TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Specify datasets used in this run.	(if you checked "other", describe here)	Are you 100% confident that no data from https://github.com/microsoft/Tip-of-the-Tongue-Known-Item-Retrieval-Dataset-for-Movie-Identification or iRememberThisMovie.com (besides the training data provided as part of this year's track) was used for producing this run (including any data used for pretraining models that you are building on top of)?	Did you use any of the official baseline runs in any way to produce this run?	If you did use any of the official baseline runs in any way to produce this run, please describe how below in sufficient detail (e.g., as reranking candidates or in ensemble with other approaches).
webis-base (trec_eval) (paper)	webis	['Other']	We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.	no	no	The webis-base run did not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query.
webis-tot-01 (trec_eval) (paper)	webis	['Other']	We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.	no	no	The webis-01 run does not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with monoT5-3b against the original query that yield our top-results. We fill up with the monoT5-base scored queries.
webis-tot-02 (trec_eval) (paper)	webis	['Other']	We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.	no	no	The webis-02 does not use the official baseline runs as basis. We use use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with monoT5-3b against the original query and two reduced queries, yielding 3 scores per top-document that we subsequently fuse with min-max-normalized reciprocal rank fusion implemented in ranx. We fill up with the monoT5-base scored queries.
webis-tot-04 (trec_eval) (paper)	webis	['Other']	We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.	no	no	The webis-02 run does not use any of the official baselines. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with a DeBerta model trained on the TOMT-Kiss dataset that had a link to Wikipedia (heuristically enriched via BM25 title search to wikipedia).
webis-tot-03 (trec_eval) (paper)	webis	['Other']	We used ChatGPT for long query reduction and the TOMT-Kis dataset (https://ceur-ws.org/Vol-3366/paper-03.pdf) to train a DeBerta re-ranker.	no	no	The webis-03 run does not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with a DeBerta model trained on the TOMT-KIS dataset that had a link to Wikipedia (heuristically enriched via BM25 title search to wikipedia).
dpr-lst-rerank (trec_eval) (paper)	yalenlp	["This year's TREC TOT training data", 'Other']	In addition to using this year’s provided training and development sets, we used Fröbe et al.’s TOMT-KIS dataset of 1.3M+ Reddit posts scraped from r/TipOfMyTongue in a similar manner to Borges et al.’s approach from TREC TOT 2023. Also, we manually collected some recent queries about landmarks from r/TipOfMyTongue. Our approach involved generating robust synthetic training datasets for the celebrity and landmark domains. To achieve this, we collected ~36k unique titles of Wikipedia articles about celebrities from 96 Wikipedia articles. We also collected the unique titles of Wikipedia articles about celebrities that appeared in the top 5,000 Wikipedia articles each month from December 2015 to July 2024 (provided by Wikimedia Statistics). Finally, we collected ~18k unique titles of Wikipedia articles about landmarks from 296 Wikipedia articles.	no	yes	We used the gpt_post.py script that was provided as part of the GPT-4 baseline for TREC TOT 2023 to match our reranked titles to document IDs in the corpus. We used this script to generate the TREC files for our submissions. We made one slight modification to the script, which was limiting the number of results per query ID to 1000. This was because we noticed that the script could produce more than 1000 results even when we gave it exactly 1000 titles (because one title could be matched to multiple document IDs). We also used the code from the BM25 baseline to select negatives for training dense retriever models.
dpr-pnt-lst-rerank (trec_eval) (paper)	yalenlp	["This year's TREC TOT training data", 'Other']	In addition to using this year’s provided training and development sets, we used Fröbe et al.’s TOMT-KIS dataset of 1.3M+ Reddit posts scraped from r/TipOfMyTongue in a similar manner to Borges et al.’s approach from TREC TOT 2023. Also, we manually collected some recent queries about landmarks from r/TipOfMyTongue. Our approach involved generating robust synthetic training datasets for the celebrity and landmark domains. To achieve this, we collected ~36k unique titles of Wikipedia articles about celebrities from 96 Wikipedia articles. We also collected the unique titles of Wikipedia articles about celebrities that appeared in the top 5,000 Wikipedia articles each month from December 2015 to July 2024 (provided by Wikimedia Statistics). Finally, we collected ~18k unique titles of Wikipedia articles about landmarks from 296 Wikipedia articles.	no	yes	We used the gpt_post.py script that was provided as part of the GPT-4 baseline for TREC TOT 2023 to match our reranked titles to document IDs in the corpus. We used this script to generate the TREC files for our submissions. We made one slight modification to the script, which was limiting the number of results per query ID to 1000. This was because we noticed that the script could produce more than 1000 results even when we gave it exactly 1000 titles (because one title could be matched to multiple document IDs). We also used the code from the BM25 baseline to select negatives for training dense retriever models.
dpr-router-lst-rerank (trec_eval) (paper)	yalenlp	["This year's TREC TOT training data", 'Other']	In addition to using this year’s provided training and development sets, we used Fröbe et al.’s TOMT-KIS dataset of 1.3M+ Reddit posts scraped from r/TipOfMyTongue in a similar manner to Borges et al.’s approach from TREC TOT 2023. Also, we manually collected some recent queries about landmarks from r/TipOfMyTongue. Our approach involved generating robust synthetic training datasets for the celebrity and landmark domains. To achieve this, we collected ~36k unique titles of Wikipedia articles about celebrities from 96 Wikipedia articles. We also collected the unique titles of Wikipedia articles about celebrities that appeared in the top 5,000 Wikipedia articles each month from December 2015 to July 2024 (provided by Wikimedia Statistics). Finally, we collected ~18k unique titles of Wikipedia articles about landmarks from 296 Wikipedia articles.	no	yes	We used the gpt_post.py script that was provided as part of the GPT-4 baseline for TREC TOT 2023 to match our reranked titles to document IDs in the corpus. We used this script to generate the TREC files for our submissions. We made one slight modification to the script, which was limiting the number of results per query ID to 1000. This was because we noticed that the script could produce more than 1000 results even when we gave it exactly 1000 titles (because one title could be matched to multiple document IDs). We also used the code from the BM25 baseline to select negatives for training dense retriever models.
ThinkIR_BM25 (trec_eval) (paper)	IISER-K	["This year's TREC TOT training data"]		no	no
ThinIR_BM25_layer_2 (trec_eval) (paper)	IISER-K	["This year's TREC TOT training data"]		no	no
ThinkIR_semantic (trec_eval) (paper)	IISER-K	["This year's TREC TOT training data"]		no	no
ThinkIR_4_layer_2_w_small (trec_eval) (paper)	IISER-K	["This year's TREC TOT training data"]		no	no
baseline-bm25 (trec_eval) (paper)	coordinators	["This year's TREC TOT training data"]		Yes I am confident that no data from those sources except the official track training data was used to produce this run	yes	This is the official BM25 baseline run.
baseline-dense (trec_eval) (paper)	coordinators	["This year's TREC TOT training data"]		Yes I am confident that no data from those sources except the official track training data was used to produce this run	yes	This is the official dense retrieval baseline run.
rg4o_t100_test (trec_eval)	h2oloo	["This year's TREC TOT training data"]		no	yes	gpt4o-1 - https://github.com/TREC-ToT/bench/blob/main/GPT4.md like baseline (generalized to more types) DR - https://github.com/TREC-ToT/bench/blob/main/DENSE.md results QD - RRF(BM25 of 20 q from GPT4o for each tot query + 1 tot query) PR - PromptRetriever FS - RRF(PR, RRF(gpt4o-1, DR, QD) rg4o_t100 - Top 100 (window 100) reranking RankG4o on titles (single pass over FS)
fs_test (trec_eval)	h2oloo	["This year's TREC TOT training data"]		no	yes	gpt4o-1 - https://github.com/TREC-ToT/bench/blob/main/GPT4.md like baseline (generalized to more types) DR - https://github.com/TREC-ToT/bench/blob/main/DENSE.md results QD - RRF(BM25 of 20 q from GPT4o for each tot query + 1 tot query) PR - PromptRetriever FS - RRF(PR, RRF(gpt4o-1, DR, QD)
pr_test (trec_eval)	h2oloo	['Other']	None	no	no	PR - PromptRetriever
rag-sequence-nq (trec_eval)	SUNY-BingU	["This year's TREC TOT training data"]		no	no

Runtag

Org

Specify datasets used in this run.

(if you checked "other", describe here)

Are you 100% confident that no data from https://github.com/microsoft/Tip-of-the-Tongue-Known-Item-Retrieval-Dataset-for-Movie-Identification or iRememberThisMovie.com (besides the training data provided as part of this year's track) was used for producing this run (including any data used for pretraining models that you are building on top of)?

Did you use any of the official baseline runs in any way to produce this run?

If you did use any of the official baseline runs in any way to produce this run, please describe how below in sufficient detail (e.g., as reranking candidates or in ensemble with other approaches).

webis-base (trec_eval) (paper)

webis

['Other']

We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.

The webis-base run did not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query.

webis-tot-01 (trec_eval) (paper)

webis

['Other']

We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.

The webis-01 run does not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with monoT5-3b against the original query that yield our top-results. We fill up with the monoT5-base scored queries.

webis-tot-02 (trec_eval) (paper)

webis

['Other']

We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.

The webis-02 does not use the official baseline runs as basis. We use use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with monoT5-3b against the original query and two reduced queries, yielding 3 scores per top-document that we subsequently fuse with min-max-normalized reciprocal rank fusion implemented in ranx. We fill up with the monoT5-base scored queries.

webis-tot-04 (trec_eval) (paper)

webis

['Other']

We did not explicitly use training or validation data to produce this run. But we used GPT-4o-mini to reduce the queries for BM25F retrieval.

The webis-02 run does not use any of the official baselines. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with a DeBerta model trained on the TOMT-Kiss dataset that had a link to Wikipedia (heuristically enriched via BM25 title search to wikipedia).

webis-tot-03 (trec_eval) (paper)

webis

['Other']

We used ChatGPT for long query reduction and the TOMT-Kis dataset (https://ceur-ws.org/Vol-3366/paper-03.pdf) to train a DeBerta re-ranker.

The webis-03 run does not use the official baseline runs. We use ChatNoir (BM25-F on text + title, parameters tuned on ClueWeb09) as first-stage retrieval system. We use the union of six queries submitted against ChatNoir (retrieving always the top-1000 per query), the original query and five long query reduction approaches (i.e., GPT-4o-mini prompted in different ways to reduce the long original query). All retrieved results are subsequently re-ranked by monoT5-base (castorini/monot5-base-msmarco-10k) base using the original query and two reduced queries. For each of those monoT5-base scored-queries, we re-score the top-100 with a DeBerta model trained on the TOMT-KIS dataset that had a link to Wikipedia (heuristically enriched via BM25 title search to wikipedia).

dpr-lst-rerank (trec_eval) (paper)

yalenlp

["This year's TREC TOT training data", 'Other']

In addition to using this year’s provided training and development sets, we used Fröbe et al.’s TOMT-KIS dataset of 1.3M+ Reddit posts scraped from r/TipOfMyTongue in a similar manner to Borges et al.’s approach from TREC TOT 2023. Also, we manually collected some recent queries about landmarks from r/TipOfMyTongue. Our approach involved generating robust synthetic training datasets for the celebrity and landmark domains. To achieve this, we collected ~36k unique titles of Wikipedia articles about celebrities from 96 Wikipedia articles. We also collected the unique titles of Wikipedia articles about celebrities that appeared in the top 5,000 Wikipedia articles each month from December 2015 to July 2024 (provided by Wikimedia Statistics). Finally, we collected ~18k unique titles of Wikipedia articles about landmarks from 296 Wikipedia articles.

yes

We used the gpt_post.py script that was provided as part of the GPT-4 baseline for TREC TOT 2023 to match our reranked titles to document IDs in the corpus. We used this script to generate the TREC files for our submissions. We made one slight modification to the script, which was limiting the number of results per query ID to 1000. This was because we noticed that the script could produce more than 1000 results even when we gave it exactly 1000 titles (because one title could be matched to multiple document IDs). We also used the code from the BM25 baseline to select negatives for training dense retriever models.

dpr-pnt-lst-rerank (trec_eval) (paper)

yalenlp

["This year's TREC TOT training data", 'Other']

yes

dpr-router-lst-rerank (trec_eval) (paper)

yalenlp

["This year's TREC TOT training data", 'Other']

yes

ThinkIR_BM25 (trec_eval) (paper)

IISER-K

["This year's TREC TOT training data"]

ThinIR_BM25_layer_2 (trec_eval) (paper)

IISER-K

["This year's TREC TOT training data"]

ThinkIR_semantic (trec_eval) (paper)

IISER-K

["This year's TREC TOT training data"]

ThinkIR_4_layer_2_w_small (trec_eval) (paper)

IISER-K

["This year's TREC TOT training data"]

baseline-bm25 (trec_eval) (paper)

coordinators

["This year's TREC TOT training data"]

Yes I am confident that no data from those sources except the official track training data was used to produce this run

yes

This is the official BM25 baseline run.

baseline-dense (trec_eval) (paper)

coordinators

["This year's TREC TOT training data"]

Yes I am confident that no data from those sources except the official track training data was used to produce this run

yes

This is the official dense retrieval baseline run.

rg4o_t100_test (trec_eval)

h2oloo

["This year's TREC TOT training data"]

yes

gpt4o-1 - https://github.com/TREC-ToT/bench/blob/main/GPT4.md like baseline (generalized to more types) DR - https://github.com/TREC-ToT/bench/blob/main/DENSE.md results QD - RRF(BM25 of 20 q from GPT4o for each tot query + 1 tot query) PR - PromptRetriever FS - RRF(PR, RRF(gpt4o-1, DR, QD) rg4o_t100 - Top 100 (window 100) reranking RankG4o on titles (single pass over FS)

fs_test (trec_eval)

h2oloo

["This year's TREC TOT training data"]

yes

pr_test (trec_eval)

h2oloo

['Other']

None

PR - PromptRetriever

rag-sequence-nq (trec_eval)

SUNY-BingU

["This year's TREC TOT training data"]

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Tip-of-the-Tongue Search Main task Appendix