Runtag | Org | Is this run manual or automatic? | Describe the retrieval model used. | Describe any external resources used. |
---|---|---|---|---|
certh.iti.avs.24.main.run.1 (paper) | CERTH-ITI | automatic | A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024. | The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions |
SoftbankMeisei - Main Run 1 (paper) | softbank-meisei | automatic | Pretrained embedding models mostly from OpenCLIP | LLM (GPT)
Image Generation (Stable Diffusion) |
SoftbankMeisei - Main Run 2 (paper) | softbank-meisei | automatic | Pretrained embedding models mostly from OpenCLIP | LLM (GPT)
Image Generation (Stable Diffusion) |
SoftbankMeisei - Main Run 3 (paper) | softbank-meisei | automatic | Pretrained embedding models mostly from OpenCLIP | LLM (GPT)
Image Generation (Stable Diffusion) |
SoftbankMeisei - Main Run 4 (paper) | softbank-meisei | automatic | Pretrained embedding models mostly from OpenCLIP | LLM (GPT)
Image Generation (Stable Diffusion) |
Expansion_Fusion_Rerank_Auto_Decompose | NII_UIT | automatic | InternVL
BEiT-3
CLIP-L/14 DataComp
CLIP-H/14 Laion2B
CLIP-H/14 DFN5b
OpenAI RN101
BLIP-2
XCLIP
VILA-1.5 | We used pre-trained of all models |
Expansion_Fusion_Rerank_Manual_Decompose | NII_UIT | manual | InternVL-G
BEiT-3 (COCO)
OpenCLIP-L/14 DataComp
OpenCLIP-H/14 Laion2B
OpenCLIP-H/14 DFN5b
OpenAI RN101
BLIP-2 (COCO)
XCLIP
VILA-1.5 | We used pre-trained of all models |
Expansion_Fusion_Reranking | NII_UIT | automatic | InternVL-G
BEiT-3 (COCO)
OpenCLIP-L/14 DataComp
OpenCLIP-H/14 Laion2B
OpenCLIP-H/14 DFN5b
OpenAI RN101
BLIP-2 (COCO)
XCLIP
VILA-1.5 | We used pre-trained of all models |
certh.iti.avs.24.main.run.2 (paper) | CERTH-ITI | automatic | A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries. | The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions |
certh.iti.avs.24.main.run.3 (paper) | CERTH-ITI | automatic | A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed. | The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions |
run4 (paper) | WHU-NERCMS | automatic | BLIP, CLIP, SLIP, BLIP2, LaCLIP, stable-diffusion | We used the pre-trained weights of the models. |
run3 (paper) | WHU-NERCMS | automatic | BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion | We used the pre-trained weights of the model. |
run2 (paper) | WHU-NERCMS | automatic | BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion | We used the pre-trained weights of the models. |
add_captioning (paper) | ruc_aim3 | manual | CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32
BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco
OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion
Align, Flava, ImageBind, InternVideo2
Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large | None |
baseline (paper) | ruc_aim3 | automatic | CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32
BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco
OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion
Align, Flava, ImageBind, InternVideo2 | None |
add_QArerank (paper) | ruc_aim3 | automatic | CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32
BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco
OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion
Align, Flava, ImageBind, InternVideo2
Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large
GPT4-turbo, InternVL2-26B | None |
add_captioning_QArerank (paper) | ruc_aim3 | automatic | CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32
BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco
OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion
Align, Flava, ImageBind, InternVideo2
Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large
GPT4-turbo, InternVL2-26B | None |
Expan_Fu_Rerank_M_Decompose_CRerank | NII_UIT | manual | InternVL-G
BEiT-3 (COCO)
OpenCLIP-L/14 DataComp
OpenCLIP-H/14 Laion2B
OpenCLIP-H/14 DFN5b
OpenAI RN101
BLIP-2 (COCO)
XCLIP
VILA-1.5 | We used pre-trained of all models |
Manual_run1 (paper) | WHU-NERCMS | manual | BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion | We used the pre-trained weights of the models. |
novelty search | PolySmart | automatic | GenImg search (Improved-ITV visual feature) | SD2.1 for generate images |
run4_polySmart | PolySmart | automatic | Original query | n/a |
relevance_feedback_run4 (paper) | WHU-NERCMS | relevance feedback | BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion | We used the pre-trained weights of the models. |
relevance_feedback_run1 (paper) | WHU-NERCMS | relevance feedback | BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion | We used the pre-trained weights of the models. |
auto_run1 (paper) | WHU-NERCMS | automatic | BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion | We used the pre-trained weights of the models. |
rf_run2 (paper) | WHU-NERCMS | relevance feedback | BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion | We used the pre-trained weights of the models. |
RF_run3 (paper) | WHU-NERCMS | relevance feedback | BLIP,CLIP,SLIP,BLIP2,LaCLIP,stable-diffusion | We used the weights of the pre-trained model. |
rucmm_avs_M_run1 | RUCMM | automatic | An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient. | None. |
rucmm_avs_M_run2 | RUCMM | automatic | An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23. | None. |
rucmm_avs_M_run3 | RUCMM | automatic | A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2. | None. |
rucmm_avs_M_run4 | RUCMM | automatic | A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP. | None. |
Fusion_Query_No_Reranking | NII_UIT | automatic | InternVL-G
BEiT-3
OpenCLIP-L/14 DataComp
OpenCLIP-H/14 Laion2B
OpenCLIP-H/14 DFN5b
OpenAI RN101
BLIP-2 (COCO)
XCLIP | We used pre-trained of all models |
PolySmartAndVIREO_run1 | PolySmart | automatic | ensemble of four models | n/a |
PolySmartAndVIREO_run2 | PolySmart | automatic | rewritten query | GPT4o |
PolySmartAndVIREO_run3 | PolySmart | automatic | verified captioning query | BLIP2 |
Expansion_Fusion_Rerank_Auto_Decompose_Pij | NII_UIT | automatic | InternVL-G
BEiT-3
OpenCLIP-L/14 DataComp
OpenCLIP-H/14 Laion2B
OpenCLIP-H/14 DFN5b
OpenAI RN101
BLIP-2 (COCO)
XCLIP
VILA-1.5 | We used pre-trained of all models |
PolySmartAndVIREO_manual_run2 | PolySmart | manual | manual run2 for main queries | GPT4o |
PolySmartAndVIREO_manual_run3 | PolySmart | manual | rewritten queries manually | BLIP2, manual pick |
PolySmartAndVIREO_manual_run1 | PolySmart | manual | manual ensemble run | ensemble |
PolySmartAndVIREO_manual_run4 | PolySmart | manual | manual run4 | manual run4 |