The Thirty-Third Text REtrieval Conference
(TREC 2024)

Adhoc Video Search Main task Appendix

RuntagOrgIs this run manual or automatic?Describe the retrieval model used.Describe any external resources used.
certh.iti.avs.24.main.run.1 (paper)CERTH-ITI
automatic
A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024.
The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
SoftbankMeisei - Main Run 1 (paper)softbank-meisei
automatic
Pretrained embedding models mostly from OpenCLIP
LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 2 (paper)softbank-meisei
automatic
Pretrained embedding models mostly from OpenCLIP
LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 3 (paper)softbank-meisei
automatic
Pretrained embedding models mostly from OpenCLIP
LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 4 (paper)softbank-meisei
automatic
Pretrained embedding models mostly from OpenCLIP
LLM (GPT) Image Generation (Stable Diffusion)
Expansion_Fusion_Rerank_Auto_Decompose NII_UIT
automatic
InternVL BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP VILA-1.5
We used pre-trained of all models
Expansion_Fusion_Rerank_Manual_Decompose NII_UIT
manual
InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5
We used pre-trained of all models
Expansion_Fusion_Reranking NII_UIT
automatic
InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5
We used pre-trained of all models
certh.iti.avs.24.main.run.2 (paper)CERTH-ITI
automatic
A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries.
The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
certh.iti.avs.24.main.run.3 (paper)CERTH-ITI
automatic
A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed.
The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
run4 (paper)WHU-NERCMS
automatic
BLIP, CLIP, SLIP, BLIP2, LaCLIP, stable-diffusion
We used the pre-trained weights of the models.
run3 (paper)WHU-NERCMS
automatic
BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion
We used the pre-trained weights of the model.
run2 (paper)WHU-NERCMS
automatic
BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion
We used the pre-trained weights of the models.
add_captioning (paper)ruc_aim3
manual
CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large
None
baseline (paper)ruc_aim3
automatic
CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2
None
add_QArerank (paper)ruc_aim3
automatic
CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large GPT4-turbo, InternVL2-26B
None
add_captioning_QArerank (paper)ruc_aim3
automatic
CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large GPT4-turbo, InternVL2-26B
None
Expan_Fu_Rerank_M_Decompose_CRerank NII_UIT
manual
InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5
We used pre-trained of all models
Manual_run1 (paper)WHU-NERCMS
manual
BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion
We used the pre-trained weights of the models.
novelty search PolySmart
automatic
GenImg search (Improved-ITV visual feature)
SD2.1 for generate images
run4_polySmart PolySmart
automatic
Original query
n/a
relevance_feedback_run4 (paper)WHU-NERCMS
relevance feedback
BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion
We used the pre-trained weights of the models.
relevance_feedback_run1 (paper)WHU-NERCMS
relevance feedback
BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion
We used the pre-trained weights of the models.
auto_run1 (paper)WHU-NERCMS
automatic
BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion
We used the pre-trained weights of the models.
rf_run2 (paper)WHU-NERCMS
relevance feedback
BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion
We used the pre-trained weights of the models.
RF_run3 (paper)WHU-NERCMS
relevance feedback
BLIP,CLIP,SLIP,BLIP2,LaCLIP,stable-diffusion
We used the weights of the pre-trained model.
rucmm_avs_M_run1 RUCMM
automatic
An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient.
None.
rucmm_avs_M_run2 RUCMM
automatic
An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23.
None.
rucmm_avs_M_run3 RUCMM
automatic
A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2.
None.
rucmm_avs_M_run4 RUCMM
automatic
A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP.
None.
Fusion_Query_No_Reranking NII_UIT
automatic
InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP
We used pre-trained of all models
PolySmartAndVIREO_run1 PolySmart
automatic
ensemble of four models
n/a
PolySmartAndVIREO_run2 PolySmart
automatic
rewritten query
GPT4o
PolySmartAndVIREO_run3 PolySmart
automatic
verified captioning query
BLIP2
Expansion_Fusion_Rerank_Auto_Decompose_Pij NII_UIT
automatic
InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5
We used pre-trained of all models
PolySmartAndVIREO_manual_run2 PolySmart
manual
manual run2 for main queries
GPT4o
PolySmartAndVIREO_manual_run3 PolySmart
manual
rewritten queries manually
BLIP2, manual pick
PolySmartAndVIREO_manual_run1 PolySmart
manual
manual ensemble run
ensemble
PolySmartAndVIREO_manual_run4 PolySmart
manual
manual run4
manual run4