TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Is this run manual or automatic?	Describe the retrieval model used.	Describe any external resources used.
certh.iti.avs.24.main.run.1 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
SoftbankMeisei - Main Run 1 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 2 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 3 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Main Run 4 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
Expansion_Fusion_Rerank_Auto_Decompose	NII_UIT	automatic	InternVL BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP VILA-1.5	We used pre-trained of all models
Expansion_Fusion_Rerank_Manual_Decompose	NII_UIT	manual	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
Expansion_Fusion_Reranking	NII_UIT	automatic	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
certh.iti.avs.24.main.run.2 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
certh.iti.avs.24.main.run.3 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
run4 (paper)	WHU-NERCMS	automatic	BLIP, CLIP, SLIP, BLIP2, LaCLIP, stable-diffusion	We used the pre-trained weights of the models.
run3 (paper)	WHU-NERCMS	automatic	BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion	We used the pre-trained weights of the model.
run2 (paper)	WHU-NERCMS	automatic	BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion	We used the pre-trained weights of the models.
add_captioning (paper)	ruc_aim3	manual	CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large	None
baseline (paper)	ruc_aim3	automatic	CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2	None
add_QArerank (paper)	ruc_aim3	automatic	CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large GPT4-turbo, InternVL2-26B	None
add_captioning_QArerank (paper)	ruc_aim3	automatic	CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large GPT4-turbo, InternVL2-26B	None
Expan_Fu_Rerank_M_Decompose_CRerank	NII_UIT	manual	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
Manual_run1 (paper)	WHU-NERCMS	manual	BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion	We used the pre-trained weights of the models.
novelty search	PolySmart	automatic	GenImg search (Improved-ITV visual feature)	SD2.1 for generate images
run4_polySmart	PolySmart	automatic	Original query	n/a
relevance_feedback_run4 (paper)	WHU-NERCMS	relevance feedback	BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion	We used the pre-trained weights of the models.
relevance_feedback_run1 (paper)	WHU-NERCMS	relevance feedback	BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion	We used the pre-trained weights of the models.
auto_run1 (paper)	WHU-NERCMS	automatic	BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion	We used the pre-trained weights of the models.
rf_run2 (paper)	WHU-NERCMS	relevance feedback	BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion	We used the pre-trained weights of the models.
RF_run3 (paper)	WHU-NERCMS	relevance feedback	BLIP,CLIP,SLIP,BLIP2,LaCLIP,stable-diffusion	We used the weights of the pre-trained model.
rucmm_avs_M_run1	RUCMM	automatic	An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient.	None.
rucmm_avs_M_run2	RUCMM	automatic	An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23.	None.
rucmm_avs_M_run3	RUCMM	automatic	A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2.	None.
rucmm_avs_M_run4	RUCMM	automatic	A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP.	None.
Fusion_Query_No_Reranking	NII_UIT	automatic	InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP	We used pre-trained of all models
PolySmartAndVIREO_run1	PolySmart	automatic	ensemble of four models	n/a
PolySmartAndVIREO_run2	PolySmart	automatic	rewritten query	GPT4o
PolySmartAndVIREO_run3	PolySmart	automatic	verified captioning query	BLIP2
Expansion_Fusion_Rerank_Auto_Decompose_Pij	NII_UIT	automatic	InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
PolySmartAndVIREO_manual_run2	PolySmart	manual	manual run2 for main queries	GPT4o
PolySmartAndVIREO_manual_run3	PolySmart	manual	rewritten queries manually	BLIP2, manual pick
PolySmartAndVIREO_manual_run1	PolySmart	manual	manual ensemble run	ensemble
PolySmartAndVIREO_manual_run4	PolySmart	manual	manual run4	manual run4

Runtag

Org

Is this run manual or automatic?

Describe the retrieval model used.

Describe any external resources used.

certh.iti.avs.24.main.run.1 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

SoftbankMeisei - Main Run 1 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Main Run 2 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Main Run 3 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Main Run 4 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

Expansion_Fusion_Rerank_Auto_Decompose

NII_UIT

automatic

InternVL BEiT-3 CLIP-L/14 DataComp CLIP-H/14 Laion2B CLIP-H/14 DFN5b OpenAI RN101 BLIP-2 XCLIP VILA-1.5

We used pre-trained of all models

Expansion_Fusion_Rerank_Manual_Decompose

NII_UIT

manual

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

Expansion_Fusion_Reranking

NII_UIT

automatic

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

certh.iti.avs.24.main.run.2 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

certh.iti.avs.24.main.run.3 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

run4 (paper)

WHU-NERCMS

automatic

BLIP, CLIP, SLIP, BLIP2, LaCLIP, stable-diffusion

We used the pre-trained weights of the models.

run3 (paper)

WHU-NERCMS

automatic

BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion

We used the pre-trained weights of the model.

run2 (paper)

WHU-NERCMS

automatic

BLIP, CLIP, BLIP2, LaCLIP, SLIP, stable-diffusion

We used the pre-trained weights of the models.

add_captioning (paper)

ruc_aim3

manual

CLIP: CLIP_ViT_L@336px, CLIP_ViT_B16, CLIP_VIT_L14, CLIP_VIT_B32 BLIP: BLIP_VIT_L_COCO, BLIP_VIT_L_Flickr, BLIP_VIT_B_coco, BLIP_VIT_B_flickr, BLIP2_coco OpenCLIP: CLIP_l14-datacomp, EVACLIP, ViCLIP, CLIP_h14-laion Align, Flava, ImageBind, InternVideo2 Captioning: sentence transformers, OpenAI_ada_v2, OpenAI_large

None

baseline (paper)

ruc_aim3

automatic

None

add_QArerank (paper)

ruc_aim3

automatic

None

add_captioning_QArerank (paper)

ruc_aim3

automatic

None

Expan_Fu_Rerank_M_Decompose_CRerank

NII_UIT

manual

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

Manual_run1 (paper)

WHU-NERCMS

manual

BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion

We used the pre-trained weights of the models.

novelty search

PolySmart

automatic

GenImg search (Improved-ITV visual feature)

SD2.1 for generate images

run4_polySmart

PolySmart

automatic

Original query

n/a

relevance_feedback_run4 (paper)

WHU-NERCMS

relevance feedback

BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion

We used the pre-trained weights of the models.

relevance_feedback_run1 (paper)

WHU-NERCMS

relevance feedback

BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion

We used the pre-trained weights of the models.

auto_run1 (paper)

WHU-NERCMS

automatic

BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion

We used the pre-trained weights of the models.

rf_run2 (paper)

WHU-NERCMS

relevance feedback

BLIP, CLIP, SLIP, LaCLIP, BLIP2, stable-diffusion

We used the pre-trained weights of the models.

RF_run3 (paper)

WHU-NERCMS

relevance feedback

BLIP,CLIP,SLIP,BLIP2,LaCLIP,stable-diffusion

We used the weights of the pre-trained model.

rucmm_avs_M_run1

RUCMM

automatic

An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient.

None.

rucmm_avs_M_run2

RUCMM

automatic

An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23.

None.

rucmm_avs_M_run3

RUCMM

automatic

A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2.

None.

rucmm_avs_M_run4

RUCMM

automatic

None.

Fusion_Query_No_Reranking

NII_UIT

automatic

InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP

We used pre-trained of all models

PolySmartAndVIREO_run1

PolySmart

automatic

ensemble of four models

n/a

PolySmartAndVIREO_run2

PolySmart

automatic

rewritten query

GPT4o

PolySmartAndVIREO_run3

PolySmart

automatic

verified captioning query

BLIP2

Expansion_Fusion_Rerank_Auto_Decompose_Pij

NII_UIT

automatic

InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

PolySmartAndVIREO_manual_run2

PolySmart

manual

manual run2 for main queries

GPT4o

PolySmartAndVIREO_manual_run3

PolySmart

manual

rewritten queries manually

BLIP2, manual pick

PolySmartAndVIREO_manual_run1

PolySmart

manual

manual ensemble run

ensemble

PolySmartAndVIREO_manual_run4

PolySmart

manual

manual run4

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Adhoc Video Search Main task Appendix