clap — Video Search Task

Is this run manual or automatic?: automatic
Describe the retrieval model used.: gpt-4.1-mini decomposes query into visual and (non-speech) audio components. Visual component is searched using SigLIP2-base-patch16-naflex embeddings and audio component is searched on CLAP embeddings. The normalized scores from both search techniques are added together for the final ranking. If the LLM decided there was no audio component, then only the SigLIP2 embeddings are used.
Describe any external resources used.: gpt-4.1-mini uses for decomposing query, SigLIP2-base-patch16-naflex embeddings used for visual search, and LAION-AI/CLAP used for audio sound search.
Training type:: D