lex-stronger-test — Retrieval Task

Submission Details

Organization: DUTH
Track: Tip-of-the-Tongue Search
Task: Retrieval Task
Date: 2025-08-31

Run Description

Please describe in details how this run was generated: Index: built over the official TREC ToT 2025 Wikipedia corpus using IterDictIndexer (blocks=false). Retrieval per query: multiple lexical retrievers – BM25 variants (b=0.55,k1=1.5; b=0.45,k1=1.6; b=0.60,k1=1.3), PL2, InL2, DFIC, DPH, BB2, DFRee, DirichletLM(mu=1500), Hiemstra_LM(c=0.7 and c=0.35). PRF: RM3 on BM25 pipelines (fb_terms=40, fb_docs=15, lambda=0.55; plus mid/light variants 30/12/0.60 and 20/8/0.50). Depth: each retriever returns up to 10,000 docs. Fusion: Reciprocal Rank Fusion with k=60. Query processing: remove control chars and normalize quotes/slashes; keep up to 128 tokens (Terrier further limits to ~64 terms). If the Terrier parser raises an error, we retry with punctuation-stripped query. Output: top 1000 doc ids per query; if fusion yields fewer, we fill from BM25 fallback to guarantee 1000 per qid. No manual intervention on test; parameters were tuned on the provided dev sets.
Specify datasets used in this run.: ["This year's TREC TOT training data"]
(if you checked "other", describe here): Official TREC ToT 2025 Wikipedia corpus for retrieval; train/dev splits only for tuning. No external corpora.
Are you 100% confident that no data from https://github.com/microsoft/Tip-of-the-Tongue-Known-Item-Retrieval-Dataset-for-Movie-Identification or iRememberThisMovie.com (besides the training data provided as part of this year's track) was used for producing this run (including any data used for pretraining models that you are building on top of)?: Yes I am confident that no data from those sources except the official track training data was used to produce this run
Did you use any of the official baseline runs in any way to produce this run?: no
If you did use any of the official baseline runs in any way to produce this run, please describe how below in sufficient detail (e.g., as reranking candidates or in ensemble with other approaches).: Unsupervised lexical ensemble with BM25/DFR/LM + RM3, deep retrieve (10k), and RRF fusion (k=60). Automatic run; no manual edits to test queries.

Evaluation Files

lex-stronger-test.trec_eval (trec_eval)

Paper