The Thirty-Third Text REtrieval Conference
(TREC 2024)

Video-to-Text Robustness subtask Appendix

Runtag	Org	What are the training data types used?	What feature types used?	Describe any external resources used.	Describe the captioning model used.
PolySmart_run1_primary_robustnessTask (vtt_eval)	PolySmart	video	visual	TRECVid VTT 16-23 dataset	llava
PolySmart_run2_robustnessTask (vtt_eval)	PolySmart	video	visual	video captioning dataset	llava
PolySmart_run3_robustnessTask (vtt_eval)	PolySmart	video	visual	none	llava
PolySmart_run4_robustnessTask (vtt_eval)	PolySmart	video	visual	none	llava videos
SoftbankMeisei_vtt_sub_run2 (vtt_eval) (paper)	softbank-meisei	video	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M Using EVA-CLIP to scoring confidence scores.	BLIP2
SoftbankMeisei_vtt_sub_run3 (vtt_eval) (paper)	softbank-meisei	video	visual	BLIP2：MSCOCO SBU VisualGenome LAION400M Pseudo： Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for model module (LLM, vision model) is omitted. Using EVA-CLIP to scoring confidence scores.	BLIP2
VTM and VTC for two model (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2 and mplug2
VTC for two model (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2 and mplug2
VTM for two model (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	mplug2 and videollama2
VTM and VTC for videollama2 robust (vtt_eval) (paper)	ruc_aim3	video	both audio and visual	subset of v3c1	videollama2
SoftbankMeisei_vtt_sub_run1 (vtt_eval) (paper)	softbank-meisei	video	visual	GIT：CC3M CC12M MSCOCO VisualGenome ALT200M SBU	GIT-Video