TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	What type of manually annotated information does the system use?	How is conversation understanding (NLP/rewriting) performed in this run (check all that apply)?	What data is used for conversational query understanding in this run (check all that apply)?	How is ranking performed in this run (check all that apply)?	What data is used to develop the ranking method in this run (check all that apply)?	Please specify all the methods used to handle feedback or clarification responses from the user (check all that apply).	Please describe the method used to generate the final conversational responses from one or more retrieved passages (check all that apply).	Please describe the external resources used by this run, if applicable.	Please provide a short description of this run.	Please provide a priority for assessing this run. (If resources do not allow all runs to be assessed, NIST will work in priority order, resolving ties arbitrarily).
iiresearch_ikat2024_rag_top5_bge_reranker (trec_eval) (ptkb.trec_eval) (paper)	ii_research	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method performs supervised query expansion', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data', 'method uses data provided from CAsT datasets']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method uses a dense retrieval model (e.g.¸ DPR¸ ANCE¸ ColBERT¸ etc.)', 'method uses hybrid or fusion of sparse/dense retrieval approaches', 'method uses feature-based learning-to-rank (i.e.¸ supervised learning)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)', 'method performs re-ranking with large langauge models (LLaMA¸ GPT-x¸ etc.) (please describe specifics in the description field below)']	['method is trained with CAsT datasets', 'method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method detects and uses feedback']	['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']	For this run, we used the Wikipedia dump as an external corpus to supplement response generation.	In this run, I employed a fine-tuned LLaMA3 model to rewrite the query. For statement ranking, I utilized multiple models, including SBERT, GPT-4o, and a fine-tuned Gemma model, to comprehensively rank the statements. For passage ranking and the generation task, I initially used BM25 for preliminary ranking, followed by BGE for reranking. During the generation process, a retrieval-augmented approach was applied to determine whether additional retrieval was necessary to further optimize the generation output.	1 (top)
infosense_llama_pssgqrs_wghtdrerank_2 (trec_eval) (ptkb.trec_eval) (paper)	infosenselab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method identifies intent/discourse/sentiment (e.g. feedback¸ clarification¸ etc.)', 'method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']	Meta-Llama-3.1-8B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.	The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate a clarified user utterance from the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the clarified user utterance. - Using Llama, generate a 10-sentence passage from the clarified user utterance and all relevant PTKBs to be used as a query. - Using Llama, for up to three individual PTKBs, generate an additional 10-sentences passage from the clarified user utterance and the individual PTKB to be used as query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. Scores are weighted by their respective PTKB score. The query with all relevant PTKBs uses a weight of 1. - Llama generates the final answer from the clarified user utterance and the 3 top-ranking documents.	3
infosense_llama_pssgqrs_wghtdrerank_1 (trec_eval) (ptkb.trec_eval) (paper)	infosenselab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method identifies intent/discourse/sentiment (e.g. feedback¸ clarification¸ etc.)', 'method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']	Meta-Llama-3.1-8B-Instruct and all-MiniLM-L12-v2. No external data was used.	[Similar to the v2 of the run with the same, but using a single reranker] The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate a clarified user utterance from the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the clarified user utterance. - Using Llama, generate a 10-sentence passage from the clarified user utterance and all relevant PTKBs to be used as a query. - Using Llama, for up to three individual PTKBs, generate an additional 10-sentences passage from the clarified user utterance and the individual PTKB to be used as query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query using all-MiniLM-L12-v2. Scores are weighted by their respective PTKB score. The query with all relevant PTKBs uses a weight of 1. - Llama generates the final answer from the clarified user utterance and the 3 top-ranking documents.	4
infosense_llama_short_long_qrs_2 (trec_eval) (ptkb.trec_eval) (paper)	infosenselab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method identifies intent/discourse/sentiment (e.g. feedback¸ clarification¸ etc.)', 'method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']	Meta-Llama-3.1-8B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.	[Similar to the v1 of the run with the same, but with a few additional enhancements] The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate text with suggestions of what to look for in relevant PTKBs using the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the generated suggestions - Using Llama, generate a 5-sentence passage from the conversation history and the relevant PTKBs to be used as a query. - Using Llama, generate a 10-sentence article-style passage from the conversation history and relevant PTKBs to be used as a query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. The scoring is equally weighted. - Llama generates the final answer from the conversation history and the 3 top-ranking documents.	2
RALI_gpt4o_fusion_rerank (trec_eval) (ptkb.trec_eval) (paper)	rali lab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method detects and uses feedback']	['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']	We used Pyserini implementation of BM25. We used Open AI gpt-4o API. We also used a pretrained monoT5 reranker available at https://huggingface.co/castorini/monot5-base-msmarco-10k	This pipeline consists of four key steps. (1) query rewriting: for each conversation turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, and a de-contextualized & personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the three rewritten queries are combined. (3) The third step is reranking, where the top 50 documents from the fused ranking list are reranked using the monoT5 model, w.r.t. the de-contextualized and personalized utterance. (4) Finally, GPT-4o generates a response for the user's current query, considering the full conversation context, PTKB, and the top 3 documents from the reranking step.	1 (top)
RALI_gpt4o_no_personalize_fusion_rerank (trec_eval) (ptkb.trec_eval) (paper)	rali lab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method detects and uses feedback']	['method uses large language models to generate the summary.']	We used Pyserini implementation of BM25. We used Open AI gpt-4o API. We also used a pretrained monoT5 reranker available at https://huggingface.co/castorini/monot5-base-msmarco-10k	This pipeline consists of four key steps. (1) query rewriting: for each user turn, GPT-4o generates four types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, another de-contextualized but non-personalized utterance which is different from the first one, and finaly a de-contextualized & personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the first three non-personalized (just de-contextualized) rewritten queries are combined. (3) The third step is reranking, where the top 50 documents from the fused ranking list are reranked using the monoT5 model, using the personalized rewritten query. (4) Finally, GPT-4o generates a response for the user's current query, considering the full conversation context, PTKB, and the top 3 documents from the reranking step.	2
RALI_gpt4o_no_personalize_fusion_norerank (trec_eval) (ptkb.trec_eval) (paper)	rali lab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)']	['method uses iKAT 23 data']	['method detects and uses feedback']	['method uses other approaches (please specify in description below)']	We used Pyserini implementation of BM25. We used Open AI gpt-4o API.	This run is retrieval-only, i.e. does not participate in response evaluation. It consists of two key steps. (1) query rewriting: for each user turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, another de-contextualized but non-personalized utterance which is different from the first one. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the aforementioned three non-personalized (just de-contextualized) rewritten queries are combined.	4
RALI_gpt4o_fusion_norerank (trec_eval) (ptkb.trec_eval) (paper)	rali lab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)']	['method uses iKAT 23 data']	['method detects and uses feedback']	['method uses other approaches (please specify in description below)']	We used Pyserini implementation of BM25. We used Open AI gpt-4o API.	This run is retrieval-only, i.e. does not participate in response evaluation. It consists of 2 key steps. (1) query rewriting: for each user turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, and a de-contextualized, personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the three rewritten queries are combined.	3
iiresearch_ikat2024_rag_top5_monot5_reranker (trec_eval) (ptkb.trec_eval) (paper)	ii_research	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method performs supervised query expansion', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data', 'method uses data provided from CAsT datasets']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method uses a dense retrieval model (e.g.¸ DPR¸ ANCE¸ ColBERT¸ etc.)', 'method uses hybrid or fusion of sparse/dense retrieval approaches', 'method uses feature-based learning-to-rank (i.e.¸ supervised learning)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)', 'method performs re-ranking with large langauge models (LLaMA¸ GPT-x¸ etc.) (please describe specifics in the description field below)']	['method is trained with CAsT datasets', 'method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method detects and uses feedback']	['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']	For this run, we used the Wikipedia dump as an external corpus to supplement response generation.	In this run, l employed a fine-tuned LLaMA3 model to rewrite the query. For statement ranking, l utilized multiple models, including SBERT, GPT-4o, and a fine-tuned Gemma model, to comprehensively rank the statements. For passage ranking and the generation task, l initially used BM25 for preliminary ranking, followed by monoT5 for reranking. During the generation process, a retrieval-augmented approach was applied to determine whether additional retrieval was necessary to further optimize the generation output.	2
nii_res_gen (trec_eval) (ptkb.trec_eval) (paper)	nii	generation-only: system uses the given ranked list and only generates a response based upon that.	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	["method does not perform ranking (i.e.¸ it's a generation-only run)"]	['method uses provided automatic baseline', 'method uses iKAT 23 data']	['method detects and uses feedback']	['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']	Using Gemini-1.5-flash, no further resources were used.	This run uses Gemini-1.5-flash to generate responses.	1 (top)
infosense_llama_short_long_qrs_3 (trec_eval) (ptkb.trec_eval) (paper)	infosenselab	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method identifies intent/discourse/sentiment (e.g. feedback¸ clarification¸ etc.)', 'method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']	Meta-Llama-3.1-70B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.	[Similar to v2 with the same name, but using Llama 3.1 70B instead of 8B]The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate text with suggestions of what to look for in relevant PTKBs using the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the generated suggestions - Using Llama, generate a 5-sentence passage from the conversation history and the relevant PTKBs to be used as a query. - Using Llama, generate a 10-sentence article-style passage from the conversation history and relevant PTKBs to be used as a query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. The scoring is equally weighted. - Llama generates the final answer from the conversation history and the 3 top-ranking documents.	1 (top)
nii_auto_base (trec_eval) (ptkb.trec_eval) (paper)	nii	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2	This submission follows the outlined pipeline: Rewrite the utterance with context using GPT-4o. Rewrite the utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder.	1 (top)
nii_manu_base (trec_eval) (ptkb.trec_eval) (paper)	nii	manual: system uses only manually rewritten utterances	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2	This process follows the outlined pipeline: Rewrite the manual utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder."	1 (top)
nii_auto_ptkb_rr (trec_eval) (ptkb.trec_eval) (paper)	nii	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2	This submission follows the outlined pipeline: Rewrite the utterance with context using GPT-4o. Rewrite the utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder. Retrieve and re-rank documents based on the related PTKB using a cross-encoder.	1 (top)
nii_manu_ptkb_rr (trec_eval) (ptkb.trec_eval) (paper)	nii	manual: system uses only manually rewritten utterances	['method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2	This submission follows the outlined pipeline: Rewrite the manually rewritten utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder. Re-rank documents based on the related PTKB using a cross-encoder.	2
dcu_manual_qe_summ_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	manual: system uses only manually rewritten utterances	['method uses other query understanding method (please describe below)']	['method uses iKAT provided manually rewritten utterances (note: this makes it a manual run)']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses single source (single passage)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Re-writing: T5 based Query rewriting fine-tuned on CANARD Dataset Abstractive summarizer: pegasus-xsum BM25 Clueweb-22	This is a manual run. Therefore, resolved utterances are used for queries. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. An abstractive summary is generated based on the Top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn.	3
dcu_manual_qe_summ_ptkb_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	manual: system uses manually rewritten utterances and ground-truth PTKB provenance statements	['method uses other query understanding method (please describe below)']	['method uses iKAT provided manually rewritten utterances (note: this makes it a manual run)']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22	This is a manual run. Therefore, resolved utterances along with their ground-truth PTKB provenance statement are used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. An abstractive summary is generated based on the Top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn.	1 (top)
dcu_auto_qe_key_topP-50_topK-5 (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2) Yet Another Keyword Extractor (Yake): Unsupervised Approach for Automatic Keyword Extraction using Text Features.	In this automatic run, user utterance is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We extract the top 5 keywords from the user utterance as well as the top 50 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.	2
dcu_auto_qre_sim (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method performs generative query rewriting (CQR)¸ including models like BART/T5']	['method uses iKAT 23 data', 'method uses other external data (please specify in the external resources field below)']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Re-writing: T5 based Query rewriting fine-tuned on CANARD Dataset BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)	In this automatic run, related historical queries from the conversational context are identified based on the cosine similarity of < .50 to the user utterance and are used as query context. The user utterance and query context are used to rewrite the query by a T5-based Query rewriter fine-tuned on CANARD Dataset. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the rewritten query.	1 (top)
dcu_auto_qe_summ_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)	In this automatic run, user utterance is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We use an abstractive summary from the top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.	3
dcu_auto_qe_summ_ptkb_TopP_ (trec_eval) (ptkb.trec_eval) (paper)	DCU-ADAPT	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']	['method does not treat them specially']	['method uses other approaches (please specify in description below)']	Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)	In this automatic run, the user utterance along with the top-3 PTKB with the highest cosine similarity is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We use an abstractive summary from the top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.	3
NII_automatic_GeRe (trec_eval) (ptkb.trec_eval) (paper)	nii	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']	['method uses iKAT 23 data']	['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']	['method uses iKAT 23 data']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']	claude-3-opus, gpt-4o, BM25 on clueweb22, cross-encoder/ms-marco-MiniLM-L-6-v2 for the reranking.	1)We start with generating the answer to the conversation using LLM given the context. 2) For the generated answer in the previous step, we use the LLM to generate 5 queries. 3) Next we retrieve 300 documents per query using BM25 and re-rank them using the crossencoder. 4) In the next step, we combine all the re-reranked documents generated for the 5 queries, remove the overlaps, and then re-rank them with respect to the answer generated in the first step and take the top 1000 documents. 5) PTKB statements are selected using LLMs given the context. 5) The final answer is generated using the top 5 documents given the context created in the above steps only in the case of gpt4 LLM. 6) We did the above 5 steps for gpt4 and claude3 and combined the documents from both runs. 7) We re-rank the documents again with respect to the answers generated in the first step by gpt4o 8) We remove the duplicates from the combined document pool and then take the top 1000.	1 (top)
ksu_created_query_reranking (trec_eval) (ptkb.trec_eval)	ksu	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)', 'method performs re-ranking with large langauge models (LLaMA¸ GPT-x¸ etc.) (please describe specifics in the description field below)']	['method uses provided automatic baseline']	['method does not treat them specially']	['method uses multiple sources (multiple passages)', 'method uses supervised generative summarization (e.g. PEGASUS or similar)']	What data is used for conversational query understanding in this run (check all that apply)? →Only LLM. No datasets are used models: - sentence-t5-xxl - llama3.1 8b - pegasus-xsum datasets: None	1. Utilize a large language model (llama3.1 8b) to rewrite the utterance considering the context. 2. Use a model suitable for sentence similarity (sentence-t5-xxl) to calculate the similarity between the rewritten utterance and PTKB, and use the top 3 results. 3. Utilize a large language model (llama3.1 8b) to rewrite the utterance again, considering the top 3 PTKB results and the context. 4. Utilize a large language model (llama3.1 8b) to decompose the rewritten utterance into several simple queries. 5. Utilize a large language model (llama3.1 8b) to convert the decomposed queries into queries suitable for BM25. 6. Utilize a large language model (llama3.1 8b) to perform a BM25 search using the decomposed and BM25-suitable queries (from step 5) and retrieve the top 10 results. 7. Utilize a large language model (llama3.1 8b) to generate questions that can be answered from the top 10 passages. 8. Calculate the similarity between the generated questions and the decomposed queries (from step 4) and retrieve the top 1 passage. 9. Use a summarization model (pegasus-xsum) to summarize the passages for each decomposed query (if there are 5 decomposed queries, summarize 5 passages). 10. Utilize a large language model (llama3.1 8b) to generate a response to the rewritten utterance (from step 3) using the summarized passages.	1 (top)
gpt4-MQ-debertav3 (trec_eval) (ptkb.trec_eval) (paper)	uva	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	gpt4	gpt4 mq with splade, rerank debertav3 with single qr	2
gpt4-mq-rr-fusion (trec_eval) (ptkb.trec_eval) (paper)	uva	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	gpt4	gpt4 and fusion rr	1 (top)
gpt-single-QR-rr-debertav3 (trec_eval) (ptkb.trec_eval) (paper)	uva	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	gpt4	gpt-single-QR-rr-debertav3	3
qd1 (trec_eval) (ptkb.trec_eval) (paper)	uva	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	gpt4	gpt4 QR with different prompt + bm25 and msmarcominilm rerank	4
baseline-auto-t5-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-t5-bm25-minilm	baseline-auto-t5-bm25-minilm	1 (top)
baseline-auto-convgqr-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-convgqr-bm25-minilm	baseline-auto-convgqr-bm25-minilm	1 (top)
baseline-auto-llama3.1-splade-minilm (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-llama3.1-splade-minilm	baseline-auto-llama3.1-splade-minilm	1 (top)
baseline-auto-gpt4o-splade-minilm (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-gpt4o-splade-minilm	baseline-auto-gpt4o-splade-minilm	1 (top)
baseline-auto-gpt4-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses other query understanding method (please describe below)']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-gpt4-bm25-minilm	baseline-auto-gpt4-bm25-minilm	1 (top)
baseline-auto-gpt4o-bm25-minilm-genonly (trec_eval) (ptkb.trec_eval) (paper)	coordinators	automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)	['method uses large language models like LLaMA and GPT-x.']	['method uses other external data (please specify in the external resources field below)']	['method uses other ranking method (please describe below)']	['method is trained on other datasets (please describe below)']	['method does not treat them specially']	['method uses multiple sources (multiple passages)']	baseline-auto-gpt4o-bm25-minilm-genonly	baseline-auto-gpt4o-bm25-minilm-genonly. Run used for the generation only task. Don't eval the generation if possible.	1 (top)

Runtag

Org

What type of manually annotated information does the system use?

How is conversation understanding (NLP/rewriting) performed in this run (check all that apply)?

What data is used for conversational query understanding in this run (check all that apply)?

How is ranking performed in this run (check all that apply)?

What data is used to develop the ranking method in this run (check all that apply)?

Please specify all the methods used to handle feedback or clarification responses from the user (check all that apply).

Please describe the method used to generate the final conversational responses from one or more retrieved passages (check all that apply).

Please describe the external resources used by this run, if applicable.

Please provide a short description of this run.

Please provide a priority for assessing this run. (If resources do not allow all runs to be assessed, NIST will work in priority order, resolving ties arbitrarily).

iiresearch_ikat2024_rag_top5_bge_reranker (trec_eval) (ptkb.trec_eval) (paper)

ii_research

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method performs supervised query expansion', 'method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data', 'method uses data provided from CAsT datasets']

['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method uses a dense retrieval model (e.g.¸ DPR¸ ANCE¸ ColBERT¸ etc.)', 'method uses hybrid or fusion of sparse/dense retrieval approaches', 'method uses feature-based learning-to-rank (i.e.¸ supervised learning)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)', 'method performs re-ranking with large langauge models (LLaMA¸ GPT-x¸ etc.) (please describe specifics in the description field below)']

['method is trained with CAsT datasets', 'method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method detects and uses feedback']

['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']

For this run, we used the Wikipedia dump as an external corpus to supplement response generation.

In this run, I employed a fine-tuned LLaMA3 model to rewrite the query. For statement ranking, I utilized multiple models, including SBERT, GPT-4o, and a fine-tuned Gemma model, to comprehensively rank the statements. For passage ranking and the generation task, I initially used BM25 for preliminary ranking, followed by BGE for reranking. During the generation process, a retrieval-augmented approach was applied to determine whether additional retrieval was necessary to further optimize the generation output.

1 (top)

infosense_llama_pssgqrs_wghtdrerank_2 (trec_eval) (ptkb.trec_eval) (paper)

infosenselab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method identifies intent/discourse/sentiment (e.g. feedback¸ clarification¸ etc.)', 'method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)', 'method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']

Meta-Llama-3.1-8B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.

The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate a clarified user utterance from the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the clarified user utterance. - Using Llama, generate a 10-sentence passage from the clarified user utterance and all relevant PTKBs to be used as a query. - Using Llama, for up to three individual PTKBs, generate an additional 10-sentences passage from the clarified user utterance and the individual PTKB to be used as query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. Scores are weighted by their respective PTKB score. The query with all relevant PTKBs uses a weight of 1. - Llama generates the final answer from the clarified user utterance and the 3 top-ranking documents.

infosense_llama_pssgqrs_wghtdrerank_1 (trec_eval) (ptkb.trec_eval) (paper)

infosenselab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']

Meta-Llama-3.1-8B-Instruct and all-MiniLM-L12-v2. No external data was used.

[Similar to the v2 of the run with the same, but using a single reranker] The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate a clarified user utterance from the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the clarified user utterance. - Using Llama, generate a 10-sentence passage from the clarified user utterance and all relevant PTKBs to be used as a query. - Using Llama, for up to three individual PTKBs, generate an additional 10-sentences passage from the clarified user utterance and the individual PTKB to be used as query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query using all-MiniLM-L12-v2. Scores are weighted by their respective PTKB score. The query with all relevant PTKBs uses a weight of 1. - Llama generates the final answer from the clarified user utterance and the 3 top-ranking documents.

infosense_llama_short_long_qrs_2 (trec_eval) (ptkb.trec_eval) (paper)

infosenselab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']

Meta-Llama-3.1-8B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.

[Similar to the v1 of the run with the same, but with a few additional enhancements] The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate text with suggestions of what to look for in relevant PTKBs using the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the generated suggestions - Using Llama, generate a 5-sentence passage from the conversation history and the relevant PTKBs to be used as a query. - Using Llama, generate a 10-sentence article-style passage from the conversation history and relevant PTKBs to be used as a query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. The scoring is equally weighted. - Llama generates the final answer from the conversation history and the 3 top-ranking documents.

RALI_gpt4o_fusion_rerank (trec_eval) (ptkb.trec_eval) (paper)

rali lab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method detects and uses feedback']

['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']

We used Pyserini implementation of BM25. We used Open AI gpt-4o API. We also used a pretrained monoT5 reranker available at https://huggingface.co/castorini/monot5-base-msmarco-10k

This pipeline consists of four key steps. (1) query rewriting: for each conversation turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, and a de-contextualized & personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the three rewritten queries are combined. (3) The third step is reranking, where the top 50 documents from the fused ranking list are reranked using the monoT5 model, w.r.t. the de-contextualized and personalized utterance. (4) Finally, GPT-4o generates a response for the user's current query, considering the full conversation context, PTKB, and the top 3 documents from the reranking step.

1 (top)

RALI_gpt4o_no_personalize_fusion_rerank (trec_eval) (ptkb.trec_eval) (paper)

rali lab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method detects and uses feedback']

['method uses large language models to generate the summary.']

We used Pyserini implementation of BM25. We used Open AI gpt-4o API. We also used a pretrained monoT5 reranker available at https://huggingface.co/castorini/monot5-base-msmarco-10k

This pipeline consists of four key steps. (1) query rewriting: for each user turn, GPT-4o generates four types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, another de-contextualized but non-personalized utterance which is different from the first one, and finaly a de-contextualized & personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the first three non-personalized (just de-contextualized) rewritten queries are combined. (3) The third step is reranking, where the top 50 documents from the fused ranking list are reranked using the monoT5 model, using the personalized rewritten query. (4) Finally, GPT-4o generates a response for the user's current query, considering the full conversation context, PTKB, and the top 3 documents from the reranking step.

RALI_gpt4o_no_personalize_fusion_norerank (trec_eval) (ptkb.trec_eval) (paper)

rali lab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)']

['method uses iKAT 23 data']

['method detects and uses feedback']

['method uses other approaches (please specify in description below)']

We used Pyserini implementation of BM25. We used Open AI gpt-4o API.

This run is retrieval-only, i.e. does not participate in response evaluation. It consists of two key steps. (1) query rewriting: for each user turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, another de-contextualized but non-personalized utterance which is different from the first one. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the aforementioned three non-personalized (just de-contextualized) rewritten queries are combined.

RALI_gpt4o_fusion_norerank (trec_eval) (ptkb.trec_eval) (paper)

rali lab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method uses traditional unsupervised sparse retrieval (e.g.¸ QL¸ BM25¸ etc.)']

['method uses iKAT 23 data']

['method detects and uses feedback']

['method uses other approaches (please specify in description below)']

We used Pyserini implementation of BM25. We used Open AI gpt-4o API.

This run is retrieval-only, i.e. does not participate in response evaluation. It consists of 2 key steps. (1) query rewriting: for each user turn, GPT-4o generates three types of rewritten queries: a de-contextualized but non-personalized utterance, a pseudo-response of this rewritten utterance concatenated with itself, and a de-contextualized, personalized utterance. (2) The next step is ranking list fusion, where the BM25 ranking lists obtained for the three rewritten queries are combined.

iiresearch_ikat2024_rag_top5_monot5_reranker (trec_eval) (ptkb.trec_eval) (paper)

ii_research

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method performs supervised query expansion', 'method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data', 'method uses data provided from CAsT datasets']

['method is trained with CAsT datasets', 'method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method detects and uses feedback']

['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']

For this run, we used the Wikipedia dump as an external corpus to supplement response generation.

In this run, l employed a fine-tuned LLaMA3 model to rewrite the query. For statement ranking, l utilized multiple models, including SBERT, GPT-4o, and a fine-tuned Gemma model, to comprehensively rank the statements. For passage ranking and the generation task, l initially used BM25 for preliminary ranking, followed by monoT5 for reranking. During the generation process, a retrieval-augmented approach was applied to determine whether additional retrieval was necessary to further optimize the generation output.

nii_res_gen (trec_eval) (ptkb.trec_eval) (paper)

nii

generation-only: system uses the given ranked list and only generates a response based upon that.

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

["method does not perform ranking (i.e.¸ it's a generation-only run)"]

['method uses provided automatic baseline', 'method uses iKAT 23 data']

['method detects and uses feedback']

['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']

Using Gemini-1.5-flash, no further resources were used.

This run uses Gemini-1.5-flash to generate responses.

1 (top)

infosense_llama_short_long_qrs_3 (trec_eval) (ptkb.trec_eval) (paper)

infosenselab

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method does not perform summarization (i.e. uses passages as-is)', 'method uses large language models to generate the summary.']

Meta-Llama-3.1-70B-Instruct, msmarco-distilbert-base-v4, and all-MiniLM-L12-v2. No external data was used.

[Similar to v2 with the same name, but using Llama 3.1 70B instead of 8B]The run performs the following workflow: - Receives a user utterance. - Using Llama, summarize the previous turn response into less than 150 characters (unless it’s the first turn) and add it to a conversation history. - Using Llama, generate text with suggestions of what to look for in relevant PTKBs using the conversation history. - Using Llama, return a dictionary of relevant PTKB queries from the generated suggestions - Using Llama, generate a 5-sentence passage from the conversation history and the relevant PTKBs to be used as a query. - Using Llama, generate a 10-sentence article-style passage from the conversation history and relevant PTKBs to be used as a query. - For each query, use BM25 to retrieve 5,000 documents. Documents are only kept the first time they are found (i.e. if query A finds a document, it is stored. If query B finds it again, it is ignored). - The pool of documents is reranked iteratively for each query and by both msmarco-distilbert-base-v4 and all-MiniLM-L12-v2. The scoring is equally weighted. - Llama generates the final answer from the conversation history and the 3 top-ranking documents.

1 (top)

nii_auto_base (trec_eval) (ptkb.trec_eval) (paper)

nii

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2

This submission follows the outlined pipeline: Rewrite the utterance with context using GPT-4o. Rewrite the utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder.

1 (top)

nii_manu_base (trec_eval) (ptkb.trec_eval) (paper)

nii

manual: system uses only manually rewritten utterances

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2

This process follows the outlined pipeline: Rewrite the manual utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder."

1 (top)

nii_auto_ptkb_rr (trec_eval) (ptkb.trec_eval) (paper)

nii

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2

1 (top)

nii_manu_ptkb_rr (trec_eval) (ptkb.trec_eval) (paper)

nii

manual: system uses only manually rewritten utterances

['method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

GPT-4o Clueweb 22 dataset (BM25) cross-encoder/ms-marco-MiniLM-L-12-v2

This submission follows the outlined pipeline: Rewrite the manually rewritten utterance again to extract only the required information using GPT-4o. Rank PTKB based on the rewritten utterance and context with GPT-4o. Generate queries based on the rewritten utterance and each relevant PTKB using GPT-4o. Retrieve and re-rank documents based on the rewritten utterance using BM25 and a cross-encoder. Re-rank documents based on the related PTKB using a cross-encoder.

dcu_manual_qe_summ_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

manual: system uses only manually rewritten utterances

['method uses other query understanding method (please describe below)']

['method uses iKAT provided manually rewritten utterances (note: this makes it a manual run)']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses single source (single passage)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Re-writing: T5 based Query rewriting fine-tuned on CANARD Dataset Abstractive summarizer: pegasus-xsum BM25 Clueweb-22

This is a manual run. Therefore, resolved utterances are used for queries. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. An abstractive summary is generated based on the Top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn.

dcu_manual_qe_summ_ptkb_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

manual: system uses manually rewritten utterances and ground-truth PTKB provenance statements

['method uses other query understanding method (please describe below)']

['method uses iKAT provided manually rewritten utterances (note: this makes it a manual run)']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22

This is a manual run. Therefore, resolved utterances along with their ground-truth PTKB provenance statement are used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. An abstractive summary is generated based on the Top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn.

1 (top)

dcu_auto_qe_key_topP-50_topK-5 (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses iKAT 23 data']

['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2) Yet Another Keyword Extractor (Yake): Unsupervised Approach for Automatic Keyword Extraction using Text Features.

In this automatic run, user utterance is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We extract the top 5 keywords from the user utterance as well as the top 50 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.

dcu_auto_qre_sim (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method performs generative query rewriting (CQR)¸ including models like BART/T5']

['method uses iKAT 23 data', 'method uses other external data (please specify in the external resources field below)']

['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Re-writing: T5 based Query rewriting fine-tuned on CANARD Dataset BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)

In this automatic run, related historical queries from the conversational context are identified based on the cosine similarity of < .50 to the user utterance and are used as query context. The user utterance and query context are used to rewrite the query by a T5-based Query rewriter fine-tuned on CANARD Dataset. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the rewritten query.

1 (top)

dcu_auto_qe_summ_TopP_3 (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses iKAT 23 data']

['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)

In this automatic run, user utterance is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We use an abstractive summary from the top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.

dcu_auto_qe_summ_ptkb_TopP_ (trec_eval) (ptkb.trec_eval) (paper)

DCU-ADAPT

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses iKAT 23 data']

['method uses iKAT 23 data', 'method is trained with TREC Deep Learning Track and/or MS MARCO dataset']

['method does not treat them specially']

['method uses other approaches (please specify in description below)']

Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2 Abstractive summarizer: pegasus-xsum BM25 Clueweb-22 pre-trained Sentence-BERT model (paraphrase-MiniLM-L6-v2)

In this automatic run, the user utterance along with the top-3 PTKB with the highest cosine similarity is used for the query. BM25 is used to retrieve the top 1000 passages followed by re-ranking using cross-encoder. We use an abstractive summary from the top 3 passages retrieved from BM25 which is considered as the extracted knowledge and is used to enrich the query in the subsequent turn. PTKB provenance ranking is done by selecting the top-3 PTKB with the highest cosine similarity to the enriched query.

NII_automatic_GeRe (trec_eval) (ptkb.trec_eval) (paper)

nii

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method performs generative query rewriting (CQR)¸ including models like BART/T5', 'method uses large language models like LLaMA and GPT-x.']

['method uses iKAT 23 data']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method uses large language models to generate the summary.']

claude-3-opus, gpt-4o, BM25 on clueweb22, cross-encoder/ms-marco-MiniLM-L-6-v2 for the reranking.

1)We start with generating the answer to the conversation using LLM given the context. 2) For the generated answer in the previous step, we use the LLM to generate 5 queries. 3) Next we retrieve 300 documents per query using BM25 and re-rank them using the crossencoder. 4) In the next step, we combine all the re-reranked documents generated for the 5 queries, remove the overlaps, and then re-rank them with respect to the answer generated in the first step and take the top 1000 documents. 5) PTKB statements are selected using LLMs given the context. 5) The final answer is generated using the top 5 documents given the context created in the above steps only in the case of gpt4 LLM. 6) We did the above 5 steps for gpt4 and claude3 and combined the documents from both runs. 7) We re-rank the documents again with respect to the answers generated in the first step by gpt4o 8) We remove the duplicates from the combined document pool and then take the top 1000.

1 (top)

ksu_created_query_reranking (trec_eval) (ptkb.trec_eval)

ksu

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method performs re-ranking with a pre-trained neural language model (BERT¸ Roberta¸ T5¸ etc.) (please describe specifics in the description field below)', 'method performs re-ranking with large langauge models (LLaMA¸ GPT-x¸ etc.) (please describe specifics in the description field below)']

['method uses provided automatic baseline']

['method does not treat them specially']

['method uses multiple sources (multiple passages)', 'method uses supervised generative summarization (e.g. PEGASUS or similar)']

What data is used for conversational query understanding in this run (check all that apply)? →Only LLM. No datasets are used models: - sentence-t5-xxl - llama3.1 8b - pegasus-xsum datasets: None

1. Utilize a large language model (llama3.1 8b) to rewrite the utterance considering the context. 2. Use a model suitable for sentence similarity (sentence-t5-xxl) to calculate the similarity between the rewritten utterance and PTKB, and use the top 3 results. 3. Utilize a large language model (llama3.1 8b) to rewrite the utterance again, considering the top 3 PTKB results and the context. 4. Utilize a large language model (llama3.1 8b) to decompose the rewritten utterance into several simple queries. 5. Utilize a large language model (llama3.1 8b) to convert the decomposed queries into queries suitable for BM25. 6. Utilize a large language model (llama3.1 8b) to perform a BM25 search using the decomposed and BM25-suitable queries (from step 5) and retrieve the top 10 results. 7. Utilize a large language model (llama3.1 8b) to generate questions that can be answered from the top 10 passages. 8. Calculate the similarity between the generated questions and the decomposed queries (from step 4) and retrieve the top 1 passage. 9. Use a summarization model (pegasus-xsum) to summarize the passages for each decomposed query (if there are 5 decomposed queries, summarize 5 passages). 10. Utilize a large language model (llama3.1 8b) to generate a response to the rewritten utterance (from step 3) using the summarized passages.

1 (top)

gpt4-MQ-debertav3 (trec_eval) (ptkb.trec_eval) (paper)

uva

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

gpt4

gpt4 mq with splade, rerank debertav3 with single qr

gpt4-mq-rr-fusion (trec_eval) (ptkb.trec_eval) (paper)

uva

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

gpt4

gpt4 and fusion rr

1 (top)

gpt-single-QR-rr-debertav3 (trec_eval) (ptkb.trec_eval) (paper)

uva

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

gpt4

gpt-single-QR-rr-debertav3

qd1 (trec_eval) (ptkb.trec_eval) (paper)

uva

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

gpt4

gpt4 QR with different prompt + bm25 and msmarcominilm rerank

baseline-auto-t5-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-t5-bm25-minilm

1 (top)

baseline-auto-convgqr-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-convgqr-bm25-minilm

1 (top)

baseline-auto-llama3.1-splade-minilm (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-llama3.1-splade-minilm

1 (top)

baseline-auto-gpt4o-splade-minilm (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-gpt4o-splade-minilm

1 (top)

baseline-auto-gpt4-bm25-minilm (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses other query understanding method (please describe below)']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-gpt4-bm25-minilm

1 (top)

baseline-auto-gpt4o-bm25-minilm-genonly (trec_eval) (ptkb.trec_eval) (paper)

coordinators

automatic: system does not use any manually annotated data and relies only on the user utterance and system responses (canonical responses of previous turns)

['method uses large language models like LLaMA and GPT-x.']

['method uses other external data (please specify in the external resources field below)']

['method uses other ranking method (please describe below)']

['method is trained on other datasets (please describe below)']

['method does not treat them specially']

['method uses multiple sources (multiple passages)']

baseline-auto-gpt4o-bm25-minilm-genonly

baseline-auto-gpt4o-bm25-minilm-genonly. Run used for the generation only task. Don't eval the generation if possible.

1 (top)

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Interactive Knowledge Assistance Track (iKAT) Automatic task Appendix