tv25_Meisei_A4 — Video Search Task

Is this run manual or automatic?: automatic
Describe the retrieval model used.: We used a two-stage retrieval pipeline. In the first stage, we employed a pretrained embedding models such as CLIP to compute text–image similarity and retrieve relevant candidates. In the second stage, for tasks requiring fine-grained understanding (e.g., VQA), we applied a vision-language model (VLM) to perform detailed re-ranking or YES/NO verification.
Describe any external resources used.: Apart from publicly available pretrained models, no additional external resources were used.
Training type:: D