TREC 2025 Proceedings
runid2
Submission Details
- Organization
- ufmg
- Track
- Tip-of-the-Tongue Search
- Task
- Retrieval Task
- Date
- 2025-09-10
Run Description
- Please describe in details how this run was generated
- This run was produced using a multi-stage pipeline combining prompt-tuned LLMs, dense and cross-encoder retrieval, and a Tree-of-Thoughts reasoning framework.
1. Rewrite Generation via Prompt-Tuned LLaMA-3
We fine-tuned four adapters (using LoRA) on top of meta-llama/Llama-3.1-8B-Instruct, specialized in query rewriting for two domains:
- Movies:
movies-dense: optimized for dense retrieval
movies-cross: optimized for cross-encoder reranking
- General Domain (e.g., people, places, etc.):
all-dense: for dense retrieval
all-cross: for reranking
Each adapter was trained using Prompt Tuning to transform vague or partial user queries into precise and informative rewrites. The target rewrites were selected based on their performance (i.e., best retrieval rank) from a pool of LLM-generated candidates.
At inference time, each test query was classified as either "movie" or "other" (e.g., "person", "place", etc.) using a Query Classifier, and routed to the appropriate prompt and adapter. Two rewrites were generated per query:
- One using the *-dense adapter
- One using the *-cross adapter
Both rewrites were stored and passed to the retrieval module.
2. Retrieval with Tree-of-Thoughts (ToT) Framework
We employed a Tree-of-Thoughts architecture that combines LLM-based reasoning with dense retrieval and cross-encoder reranking:
a. Initial Retrieval:
- An embedding encoder (all-mpnet-base-v2) generated query embeddings.
- Dense similarity search was performed over either a general-purpose index or a movie-specific index, depending on the query classification.
b. Reranking:
- Candidates were reranked using a cross-encoder (cross-encoder/ms-marco-MiniLM-L-12-v2), scoring each candidate against the original vague query and the rewrite from the *-cross adapter
c. Tree Expansion:
- A greedy search explored the thought space:
At each level, the LLM (LLaMA-3.1-8B with base weights) was prompted to produce new thoughts and rewrites, guided by a specialized multi-turn prompt
Each rewrite was embedded, searched via dense retrieval, and reranked again.
The highest-scoring node was expanded until max depth or convergence.
This iterative process simulates how a human might refine their query through successive hypotheses. The resulting ranked list was built by aggregating the best reranked scores across all nodes.
3. Final Output Construction
For each test query, we returned a ranked list of item IDs, ordered by reranker score
- Specify datasets used in this run.
- ["This year's TREC TOT training data"]
- (if you checked "other", describe here)
- Are you 100% confident that no data from https://github.com/microsoft/Tip-of-the-Tongue-Known-Item-Retrieval-Dataset-for-Movie-Identification or iRememberThisMovie.com (besides the training data provided as part of this year's track) was used for producing this run (including any data used for pretraining models that you are building on top of)?
- no
- Did you use any of the official baseline runs in any way to produce this run?
- no
- If you did use any of the official baseline runs in any way to produce this run, please describe how below in sufficient detail (e.g., as reranking candidates or in ensemble with other approaches).
Evaluation Files
Paper