decomp — Video Search Task

Is this run manual or automatic?: automatic
Describe the retrieval model used.: We extract SigLIP2-base-patch16-naflex embeddings at 1 keyframe per second. Each user query is decomposed into visual components with each component expanded to 100 variants using GPT-4.1-mini, and their text embeddings are averaged and merged into a single query vector. Initial retrieval is done directly using SigLIP similarity, returning the top 2,500 candidates. Each candidate shot is then evaluated 10 times using Phi-3.5-Vision, and the scores are averaged. The final results are re-ranked based on these aggregated judgments, and the top 1,000 are submitted.
Describe any external resources used.: We used SigLIP2-base-patch16-naflex for embeddings, gpt-4.1-mini for query decomposition, and Phi-3.5-Vision for evaluation.
Training type:: D