The Thirty-Third Text REtrieval Conference
(TREC 2024)

Video-to-Text Description generation task Appendix

RuntagOrgWhat are the training data types used?What feature types used?Describe any external resources used.Describe the captioning model used.
internvl_40b (vtt_eval) BUPT_MCPRL
image
visual
no
internvl
PolySmart_run1_primary_mainTask (vtt_eval) PolySmart
video
visual
LLAVA finetuned on tv16-tv23 VTT dataset
LLAVA finetuned on tv16-tv23 VTT dataset
PolySmart_run2_mainTask PolySmart
video
visual
video caption datasets
llava
PolySmart_run3_mainTask (vtt_eval) PolySmart
video
visual
none
llava
PolySmart_run4_mainTask (vtt_eval) PolySmart
video
visual
none
llava video
internvl_40b_cogvlm (vtt_eval) BUPT_MCPRL
image
visual
internvl and cogvlm
internvl and cogvlm
tv24_kslab_1_primary (vtt_eval) kslab
image
visual
For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461
We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame. https://llava-vl.github.io/
cogvlm (vtt_eval) BUPT_MCPRL
image
visual
cogvlm
cogvlm
tv24_kslab_2 (vtt_eval) kslab
image
visual
As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461
The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame. https://arxiv.org/abs/2301.12597
tv24_kslab_3 (vtt_eval) kslab
image
visual
As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/
We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame. https://llava-vl.github.io/
tv24_kslab_4 (vtt_eval) kslab
image
visual
As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/
The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame. https://arxiv.org/abs/2301.12597
SoftbankMeisei_vtt_main_run1 (vtt_eval) (paper)softbank-meisei
video
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP: img caption:MSCOCO Flickr30k NoCaps Img QA:VQAv2 VQA:MSVD QA MSRVTT QA etc .... LLaVa:customized CC3M GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo: Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using EVA-CLIP to reranking each model and scoring.
BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run2 (vtt_eval) (paper)softbank-meisei
video
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP: img caption:MSCOCO Flickr30k NoCaps Img QA:VQAv2 VQA:MSVD QA MSRVTT QA etc .... LLaVa:customized CC3M GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo: Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using GPT4o to summarize each caption. Using EVA-CLIP to scoring confidence scores.
BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run3 (vtt_eval) (paper)softbank-meisei
image
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M BLIP3:MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP: img caption:MSCOCO Flickr30k NoCaps Img QA:VQAv2 VQA:MSVD QA MSRVTT QA etc .... LLaVa:customized CC3M GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo: Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using GPT4o to summarize each caption. Using EVA-CLIP to scoring confidence scores.
BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run4 (vtt_eval) (paper)softbank-meisei
video
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M Pseudo: Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for model module (LLM, vision model) is omitted. Using EVA-CLIP to scoring confidence scores.
BLIP2
VTM and VTC for two model primary (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2 and mplug2
VTC for two model primary (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2 and mplug2
VTM for two model primary (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollma2 and mplug2
VTM and VTC for videollama2 primary (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2