Runtag | Org | What are the training data types used? | What feature types used? | Describe any external resources used. | Describe the captioning model used. |
---|---|---|---|---|---|
internvl_40b (vtt_eval) | BUPT_MCPRL | image | visual | no | internvl |
PolySmart_run1_primary_mainTask (vtt_eval) | PolySmart | video | visual | LLAVA finetuned on tv16-tv23 VTT dataset | LLAVA finetuned on tv16-tv23 VTT dataset |
PolySmart_run2_mainTask | PolySmart | video | visual | video caption datasets | llava |
PolySmart_run3_mainTask (vtt_eval) | PolySmart | video | visual | none | llava |
PolySmart_run4_mainTask (vtt_eval) | PolySmart | video | visual | none | llava video |
internvl_40b_cogvlm (vtt_eval) | BUPT_MCPRL | image | visual | internvl and cogvlm | internvl and cogvlm |
tv24_kslab_1_primary (vtt_eval) | kslab | image | visual | For keyframe extraction, we used the GoogLeNet model.
GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842
Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes.
To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence.
https://arxiv.org/pdf/1910.13461 | We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame.
https://llava-vl.github.io/ |
cogvlm (vtt_eval) | BUPT_MCPRL | image | visual | cogvlm | cogvlm |
tv24_kslab_2 (vtt_eval) | kslab | image | visual | As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them.
For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842
Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes.
To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence.
https://arxiv.org/pdf/1910.13461 | The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame.
https://arxiv.org/abs/2301.12597 |
tv24_kslab_3 (vtt_eval) | kslab | image | visual | As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them.
For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842
Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes.
We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence.
https://openai.com/index/hello-gpt-4o/ | We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame.
https://llava-vl.github.io/ |
tv24_kslab_4 (vtt_eval) | kslab | image | visual | As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them.
For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842
Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes.
We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence.
https://openai.com/index/hello-gpt-4o/ | The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame.
https://arxiv.org/abs/2301.12597 |
SoftbankMeisei_vtt_main_run1 (vtt_eval) (paper) | softbank-meisei | video | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M
InstructBLIP:
img caption:MSCOCO Flickr30k NoCaps
Img QA:VQAv2
VQA:MSVD QA MSRVTT QA
etc ....
LLaVa:customized CC3M
GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU
Pseudo:
Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn)
Augmented data with GPT3.5
The pre-training data for each model module (LLM, vision model) is omitted.
Using EVA-CLIP to reranking each model and scoring. | BLIP2,BLIP3,InstructBLIP,LLaVa,GIT |
SoftbankMeisei_vtt_main_run2 (vtt_eval) (paper) | softbank-meisei | video | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M
InstructBLIP:
img caption:MSCOCO Flickr30k NoCaps
Img QA:VQAv2
VQA:MSVD QA MSRVTT QA
etc ....
LLaVa:customized CC3M
GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU
Pseudo:
Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn)
Augmented data with GPT3.5
The pre-training data for each model module (LLM, vision model) is omitted.
Using GPT4o to summarize each caption.
Using EVA-CLIP to scoring confidence scores. | BLIP2,BLIP3,InstructBLIP,LLaVa,GIT |
SoftbankMeisei_vtt_main_run3 (vtt_eval) (paper) | softbank-meisei | image | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M
InstructBLIP:
img caption:MSCOCO Flickr30k NoCaps
Img QA:VQAv2
VQA:MSVD QA MSRVTT QA
etc ....
LLaVa:customized CC3M
GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU
Pseudo:
Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn)
Augmented data with GPT3.5
The pre-training data for each model module (LLM, vision model) is omitted.
Using GPT4o to summarize each caption.
Using EVA-CLIP to scoring confidence scores. | BLIP2,BLIP3,InstructBLIP,LLaVa,GIT |
SoftbankMeisei_vtt_main_run4 (vtt_eval) (paper) | softbank-meisei | video | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
Pseudo:
Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn)
Augmented data with GPT3.5
The pre-training data for model module (LLM, vision model) is omitted.
Using EVA-CLIP to scoring confidence scores. | BLIP2 |
VTM and VTC for two model primary (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 and mplug2 |
VTC for two model primary (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 and mplug2 |
VTM for two model primary (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollma2 and mplug2 |
VTM and VTC for videollama2 primary (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 |