TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	What are the training data types used?	What feature types used?	Describe any external resources used.	Describe the captioning model used.
internvl_40b (vtt_eval)	BUPT_MCPRL	image	visual	no	internvl
PolySmart_run1_primary_mainTask (vtt_eval)	PolySmart	video	visual	LLAVA finetuned on tv16-tv23 VTT dataset	LLAVA finetuned on tv16-tv23 VTT dataset
PolySmart_run2_mainTask	PolySmart	video	visual	video caption datasets	llava
PolySmart_run3_mainTask (vtt_eval)	PolySmart	video	visual	none	llava
PolySmart_run4_mainTask (vtt_eval)	PolySmart	video	visual	none	llava video
internvl_40b_cogvlm (vtt_eval)	BUPT_MCPRL	image	visual	internvl and cogvlm	internvl and cogvlm
tv24_kslab_1_primary (vtt_eval)	kslab	image	visual	For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461	We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame. https://llava-vl.github.io/
cogvlm (vtt_eval)	BUPT_MCPRL	image	visual	cogvlm	cogvlm
tv24_kslab_2 (vtt_eval)	kslab	image	visual	As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461	The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame. https://arxiv.org/abs/2301.12597
tv24_kslab_3 (vtt_eval)	kslab	image	visual	As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/	We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame. https://llava-vl.github.io/
tv24_kslab_4 (vtt_eval)	kslab	image	visual	As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/	The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame. https://arxiv.org/abs/2301.12597
SoftbankMeisei_vtt_main_run1 (vtt_eval) (paper)	softbank-meisei	video	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M BLIP3：MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP： img caption：MSCOCO Flickr30k NoCaps Img QA：VQAv2 VQA：MSVD QA MSRVTT QA etc .... LLaVa：customized CC3M GIT：CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using EVA-CLIP to reranking each model and scoring.	BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run2 (vtt_eval) (paper)	softbank-meisei	video	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M BLIP3：MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP： img caption：MSCOCO Flickr30k NoCaps Img QA：VQAv2 VQA：MSVD QA MSRVTT QA etc .... LLaVa：customized CC3M GIT：CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using GPT4o to summarize each caption. Using EVA-CLIP to scoring confidence scores.	BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run3 (vtt_eval) (paper)	softbank-meisei	image	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M BLIP3：MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP： img caption：MSCOCO Flickr30k NoCaps Img QA：VQAv2 VQA：MSVD QA MSRVTT QA etc .... LLaVa：customized CC3M GIT：CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using GPT4o to summarize each caption. Using EVA-CLIP to scoring confidence scores.	BLIP2,BLIP3,InstructBLIP,LLaVa,GIT
SoftbankMeisei_vtt_main_run4 (vtt_eval) (paper)	softbank-meisei	video	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for model module (LLM, vision model) is omitted. Using EVA-CLIP to scoring confidence scores.	BLIP2
VTM and VTC for two model primary (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2 and mplug2
VTC for two model primary (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2 and mplug2
VTM for two model primary (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollma2 and mplug2
VTM and VTC for videollama2 primary (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2

Runtag

Org

What are the training data types used?

What feature types used?

Describe any external resources used.

Describe the captioning model used.

internvl_40b (vtt_eval)

BUPT_MCPRL

image

visual

internvl

PolySmart_run1_primary_mainTask (vtt_eval)

PolySmart

video

visual

LLAVA finetuned on tv16-tv23 VTT dataset

PolySmart_run2_mainTask

PolySmart

video

visual

video caption datasets

llava

PolySmart_run3_mainTask (vtt_eval)

PolySmart

video

visual

none

llava

PolySmart_run4_mainTask (vtt_eval)

PolySmart

video

visual

none

llava video

internvl_40b_cogvlm (vtt_eval)

BUPT_MCPRL

image

visual

internvl and cogvlm

tv24_kslab_1_primary (vtt_eval)

kslab

image

visual

For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461

We also used LLaVa (visual language model) to generate captions for each frame. LLaVA is a large-scale multimodal model trained end-to-end by combining the image encoder CLIP and the LLM Llama2, and has also achieved SOTA on the ScienceQA dataset. We used it because it is good at generating more accurate captions for each frame. https://llava-vl.github.io/

cogvlm (vtt_eval)

BUPT_MCPRL

image

visual

cogvlm

tv24_kslab_2 (vtt_eval)

kslab

image

visual

As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. To aggregate the multiple captions generated into a one-sentence description, BART (Bidirectional and Auto-Regressive Transformers) was one of the models used. BART is a transformer-based model that can be used at sequence-to-sequence tasks, including text summarization. In this task, BART was employed to synthesize the individual captions into a single summary sentence. https://arxiv.org/pdf/1910.13461

The BLIP2 (Bootstrapping Language-Image Pretraining) model was used as the captioning model. BLIP2 is a vision-language model that uses frozen pre-trained image and language encoders, generating captions for images. BLIP2 is effective at understanding the visual content of keyframes extracted from videos and generating relevant captions that describe the scene, objects, and actions. Keyframes from the video were passed through the BLIP2 model, which generated descriptive captions for each frame. https://arxiv.org/abs/2301.12597

tv24_kslab_3 (vtt_eval)

kslab

image

visual

As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/

tv24_kslab_4 (vtt_eval)

kslab

image

visual

As in previous years, we extracted keyframes from the video, generated captions for each, and aggregated them. For keyframe extraction, we used the GoogLeNet model. GoogLeNet is a deep convolutional neural network designed for image recognition tasks. It was trained on large datasets such as ImageNet, so it provided robust image recognition capabilities for the keyframe selection process. For this task, we extracted per-frame features. https://arxiv.org/abs/1409.4842 Kernel temporal segmentation (KTS) was used to divide the video into shots and identify keyframes. Frames with large changes in the features extracted by GoogLeNet are output as keyframes. We used GPT-4, an advanced language model, to aggregate the text. GPT-4 is a large-scale language model developed by OpenAI and reflects the latest technological advances in the research field of natural language processing (NLP). This model is based on the Transformer architecture and has high performance due to the increase in the number of parameters and fine-tuning by reinforcement learning. For this task, we used GPT-4 to synthesize the individual captions into a single summary sentence. https://openai.com/index/hello-gpt-4o/

SoftbankMeisei_vtt_main_run1 (vtt_eval) (paper)

softbank-meisei

video

visual

BLIP2：MSCOCO SBU VisualGenome LAION400M BLIP3：MINT-1T OBELICS BLIP3-KALE BLIP3-OCR-200M InstructBLIP： img caption：MSCOCO Flickr30k NoCaps Img QA：VQAv2 VQA：MSVD QA MSRVTT QA etc .... LLaVa：customized CC3M GIT：CC3M CC12M MSCOCO VisualGenome ALT200M SBU Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for each model module (LLM, vision model) is omitted. Using EVA-CLIP to reranking each model and scoring.

BLIP2,BLIP3,InstructBLIP,LLaVa,GIT

SoftbankMeisei_vtt_main_run2 (vtt_eval) (paper)

softbank-meisei

video

visual

BLIP2,BLIP3,InstructBLIP,LLaVa,GIT

SoftbankMeisei_vtt_main_run3 (vtt_eval) (paper)

softbank-meisei

image

visual

BLIP2,BLIP3,InstructBLIP,LLaVa,GIT

SoftbankMeisei_vtt_main_run4 (vtt_eval) (paper)

softbank-meisei

video

visual

BLIP2：MSCOCO SBU VisualGenome LAION400M Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for model module (LLM, vision model) is omitted. Using EVA-CLIP to scoring confidence scores.

BLIP2

VTM and VTC for two model primary (vtt_eval) (paper)

ruc_aim3

video

both audio and visual

subset of v3c1

videollama2 and mplug2

VTC for two model primary (vtt_eval) (paper)

ruc_aim3

video

both audio and visual

subset of v3c1

videollama2 and mplug2

VTM for two model primary (vtt_eval) (paper)

ruc_aim3

video

both audio and visual

subset of v3c1

videollma2 and mplug2

VTM and VTC for videollama2 primary (vtt_eval) (paper)

ruc_aim3

video

both audio and visual

subset of v3c1

videollama2

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Video-to-Text Description generation task Appendix