TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Is this run manual or automatic?	Describe the retrieval model used.	Describe any external resources used.
SoftbankMeisei - Progress Run 1 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Progress Run 2 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Progress Run 3 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
SoftbankMeisei - Progress Run 4 (paper)	softbank-meisei	automatic	Pretrained embedding models mostly from OpenCLIP	LLM (GPT) Image Generation (Stable Diffusion)
Expansion_Fusion_Reranking_Progress	NII_UIT	automatic	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
Expansion_Fusion_Rerank_Auto_Decompose_P	NII_UIT	automatic	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
Expansion_Fusion_Rerank_Manual_Decompose_P	NII_UIT	manual	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
certh.iti.avs.24.progress.run.1 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
certh.iti.avs.24.progress.run.2 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
certh.iti.avs.24.progress.run.3 (paper)	CERTH-ITI	automatic	A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed.	The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions
Expan_Fu_Rerank_M_Decompose_P_CRerank	NII_UIT	manual	InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
rucmm_avs_P_run1	RUCMM	automatic	An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient.	None.
rucmm_avs_P_run3	RUCMM	automatic	An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23.	None.
rucmm_avs_P_run2	RUCMM	automatic	A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2.	None.
rucmm_avs_P_run4	RUCMM	automatic	A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP.	None.
Fusion_Query_No_Reranking_P	NII_UIT	automatic	InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP	We used pre-trained of all models
Expansion_Fusion_Rerank_Auto_Decompose_P_Pij	NII_UIT	automatic	InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5	We used pre-trained of all models
PolySmartAndVIREO_progress_run4	PolySmart	automatic	original query, progress run	Improved-ITV model
progress_manual_run4	PolySmart	manual	progress manual run 4	Improved_ITV model
PolySmartAndVIREO_progressrun_manual_run3	PolySmart	manual	Manual select genImg	Improved-ITV feature
PolySmartAndVIREO_progressrun_manual_run2	PolySmart	manual	run2	run2

Runtag

Org

Is this run manual or automatic?

Describe the retrieval model used.

Describe any external resources used.

SoftbankMeisei - Progress Run 1 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Progress Run 2 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Progress Run 3 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

SoftbankMeisei - Progress Run 4 (paper)

softbank-meisei

automatic

Pretrained embedding models mostly from OpenCLIP

LLM (GPT) Image Generation (Stable Diffusion)

Expansion_Fusion_Reranking_Progress

NII_UIT

automatic

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

Expansion_Fusion_Rerank_Auto_Decompose_P

NII_UIT

automatic

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

Expansion_Fusion_Rerank_Manual_Decompose_P

NII_UIT

manual

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

certh.iti.avs.24.progress.run.1 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering the queries from 2022, 2023, and 2024.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

certh.iti.avs.24.progress.run.2 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. The similarities have been normalized, considering only this year's queries.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

certh.iti.avs.24.progress.run.3 (paper)

CERTH-ITI

automatic

A trainable network has been taught to combine text and video similarities from various cross-modal networks. No normalization of the similarities was performed.

The model has been trained on MSR-VTT, TGIF, VATEX, ActivityNet Captions

Expan_Fu_Rerank_M_Decompose_P_CRerank

NII_UIT

manual

InternVL-G BEiT-3 (COCO) OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

rucmm_avs_P_run1

RUCMM

automatic

An average ensemble of 7 LAFF models trained on ChinaOpen-100k, V3C1-PC, and TGIF-MSVDTT10K-VATEX. Models are selected based on infAP and Spearman’s coefficient.

None.

rucmm_avs_P_run3

RUCMM

automatic

An ensemble of 6 LAFF models trained on ChinaOpen-100k,v3c1-pc and tgif-msrvtt10k-vatex. The models' weights are learned with gradient descent and greedy search to maximize infAP of a mixed version of TV22 and TV23.

None.

rucmm_avs_P_run2

RUCMM

automatic

A LAFF model that maximizes its performance on TV22-23 with CLIP-ViT-L-14/336px + blip-base + CLIP-ViT-B-32 as its text features, CLIP-ViT-L-14/336px + blip-base +CLIP-ViT-B-32 +irCSN+ beit +wsl+ video-LLaMA + DINOv2 as video features. It is pre-trained on V3C1-PC and fine-tuned on TGIF-MSVDTT10K-VATEX, with reranking by BLIP-2 and OpenCLIP and an open-vocabulary detection model YOLO8x-worldv2.

None.

rucmm_avs_P_run4

RUCMM

automatic

None.

Fusion_Query_No_Reranking_P

NII_UIT

automatic

InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP

We used pre-trained of all models

Expansion_Fusion_Rerank_Auto_Decompose_P_Pij

NII_UIT

automatic

InternVL-G BEiT-3 OpenCLIP-L/14 DataComp OpenCLIP-H/14 Laion2B OpenCLIP-H/14 DFN5b OpenAI RN101 BLIP-2 (COCO) XCLIP VILA-1.5

We used pre-trained of all models

PolySmartAndVIREO_progress_run4

PolySmart

automatic

original query, progress run

Improved-ITV model

progress_manual_run4

PolySmart

manual

progress manual run 4

Improved_ITV model

PolySmartAndVIREO_progressrun_manual_run3

PolySmart

manual

Manual select genImg

Improved-ITV feature

PolySmartAndVIREO_progressrun_manual_run2

PolySmart

manual

run2

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Adhoc Video Search Progress task Appendix