runid4 — Retrieval Task

Submission Details

Organization: ufmg
Track: Tip-of-the-Tongue Search
Task: Retrieval Task
Date: 2025-09-10

Run Description

Please describe in details how this run was generated: This run was generated using a model to classify the relevance of the top-1 retrieved item from each of the three previous runs. For each query, we applied a relevance classifier built on gpt-5-nano to determine whether the top-1 result from each run is semantically aligned with the user's query. We then selected the result whose top-1 was judged most relevant. Previous Runs Summary: Prompt Tuning – Domain Split This run used prompt-tuned LoRA adapters on top of meta-llama/Llama-3.1-8B-Instruct, with separate models for movie vs. general queries. Each query was first classified by a query type classifier (movie vs. other), then routed to specialized adapters. Rewrites were generated using domain-specific prompts and evaluated in a Tree-of-Thoughts retrieval framework with dense + cross reranking. DPO – Movie vs General Split This run used a set of LoRA adapters trained via Direct Preference Optimization (DPO) to align rewrites with dense or cross retriever preferences. The query was first classified, and different adapters were used for the movie vs. general domain. Rewrites were scored based on how well they matched previously learned preferences, and used for retrieval via the same ToT framework. DPO – General Model (No Classification) This run removed domain-specific routing and used a general-purpose DPO model trained on all query types. A single dense-aligned and a single cross-aligned adapter were used across all queries. Rewrites were generated with consistent prompts and evaluated using the same dense + reranking pipeline without any per-query logic or adaptation. GPT-5-nano Relevance Scoring To combine these systems, we used a lightweight GPT-5 classifier to assess the semantic relevance of the top-1 document retrieved by each run for every query. The model received the query and the top-1 result text and was asked to classify the relevance. The classifier produced a three-way decision for each run’s top result: 2 = Relevant 1 = Maybe 0 = Not relevant For each query, we selected the run with the highest relevance score. If two or more runs tied, we applied a fixed priority order: General DPO → Movies DPO → Prompt Tuning.
Specify datasets used in this run.: ["This year's TREC TOT training data"]
(if you checked "other", describe here)
Are you 100% confident that no data from https://github.com/microsoft/Tip-of-the-Tongue-Known-Item-Retrieval-Dataset-for-Movie-Identification or iRememberThisMovie.com (besides the training data provided as part of this year's track) was used for producing this run (including any data used for pretraining models that you are building on top of)?: no
Did you use any of the official baseline runs in any way to produce this run?: no
If you did use any of the official baseline runs in any way to produce this run, please describe how below in sufficient detail (e.g., as reranking candidates or in ensemble with other approaches).

Evaluation Files

runid4.trec_eval (trec_eval)

Paper