Runtag | Org | What are the training data types used? | What feature types used? | Describe any external resources used. | Describe the captioning model used. |
---|---|---|---|---|---|
PolySmart_run1_primary_robustnessTask (vtt_eval) | PolySmart | video | visual | TRECVid VTT 16-23 dataset | llava |
PolySmart_run2_robustnessTask (vtt_eval) | PolySmart | video | visual | video captioning dataset | llava |
PolySmart_run3_robustnessTask (vtt_eval) | PolySmart | video | visual | none | llava |
PolySmart_run4_robustnessTask (vtt_eval) | PolySmart | video | visual | none | llava videos |
SoftbankMeisei_vtt_sub_run2 (vtt_eval) (paper) | softbank-meisei | video | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
Using EVA-CLIP to scoring confidence scores. | BLIP2 |
SoftbankMeisei_vtt_sub_run3 (vtt_eval) (paper) | softbank-meisei | video | visual | BLIP2:MSCOCO SBU VisualGenome LAION400M
Pseudo:
Augmented data for V3C1 V3C2 back-translation using Google Translate API(cs,de,fr,ja,ko,ru,zh-cn)
Augmented data with GPT3.5
The pre-training data for model module (LLM, vision model) is omitted.
Using EVA-CLIP to scoring confidence scores. | BLIP2 |
VTM and VTC for two model (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 and mplug2 |
VTC for two model (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 and mplug2 |
VTM for two model (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | mplug2 and videollama2 |
VTM and VTC for videollama2 robust (vtt_eval) (paper) | ruc_aim3 | video | both audio and visual | subset of v3c1 | videollama2 |
SoftbankMeisei_vtt_sub_run1 (vtt_eval) (paper) | softbank-meisei | video | visual | GIT:CC3M CC12M MSCOCO VisualGenome ALT200M SBU | GIT-Video |