The Thirty-Third Text REtrieval Conference
(TREC 2024)

Video-to-Text Robustness subtask Appendix

RuntagOrgWhat are the training data types used?What feature types used?Describe any external resources used.Describe the captioning model used.
PolySmart_run1_primary_robustnessTask (vtt_eval) PolySmart
video
visual
TRECVid VTT 16-23 dataset
llava
PolySmart_run2_robustnessTask (vtt_eval) PolySmart
video
visual
video captioning dataset
llava
PolySmart_run3_robustnessTask (vtt_eval) PolySmart
video
visual
none
llava
PolySmart_run4_robustnessTask (vtt_eval) PolySmart
video
visual
none
llava videos
SoftbankMeisei_vtt_sub_run2 (vtt_eval) (paper)softbank-meisei
video
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M Using EVA-CLIP to scoring confidence scores.
BLIP2
SoftbankMeisei_vtt_sub_run3 (vtt_eval) (paper)softbank-meisei
video
visual
BLIP2:MSCOCO SBU VisualGenome LAION400M Pseudo: Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn) Augmented data with GPT3.5 The pre-training data for model module (LLM, vision model) is omitted. Using EVA-CLIP to scoring confidence scores.
BLIP2
VTM and VTC for two model (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2 and mplug2
VTC for two model (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2 and mplug2
VTM for two model (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
mplug2 and videollama2
VTM and VTC for videollama2 robust (vtt_eval) (paper)ruc_aim3
video
both audio and visual
subset of v3c1
videollama2
SoftbankMeisei_vtt_sub_run1 (vtt_eval) (paper)softbank-meisei
video
visual
GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU
GIT-Video