TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Is this a manual (A human making choices based on seeing the topics/results) or automatic run?	Does this run leverage proprietary models in any step of the generation pipeline?	Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline?	Does this run leverage smaller open-weight language models in any step of the generation pipeline?	Was this run padded with results from a baseline run?	Please describe how you went about prompt engineering the generation pipeline	Please provide a short description of this run	Please give this run a priority for inclusion in manual assessments
baseline_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper)	coordinators	automatic	yes	no	no	no	Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o	1 (top)
baseline_rag24.test_l31_70b_instruct_top20 (trec_eval) (llm_eval) (paper)	coordinators	automatic	no	yes	no	no	Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): L3.1-70B-Instruct	2
baseline_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper)	coordinators	automatic	no	yes	no	yes	https://github.com/castorini/ragnarok/blob/06159e704d260b3e243499d0e6290ac9868cbd7c/src/ragnarok/generate/cohere.py#L49	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): Command R Plus (API)	3
agtask-bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper)	softbank-meisei	automatic	yes	yes	no	no	Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly. The above loop was repeated until found satisfactory.	Generation process of this run is as follows: 1. Top-20 segments from the provided retrieval list for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.	1 (top)
ldilab_gpt_4o (trec_eval) (llm_eval)	ldisnu	automatic	yes	no	no	yes	For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.	The best result from the evaluation using GPT-4O was a clear and concise overall score calculation, demonstrating the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.	1 (top)
Ranked_Iterative_Fact_Extraction_and_Refinement (trec_eval) (llm_eval) (paper)	TREMA-UNH	automatic	yes	yes	yes	yes	I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.	Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts.	1 (top)
Enhanced_Iterative_Fact_Refinement_and_Prioritization (trec_eval) (llm_eval) (paper)	TREMA-UNH	automatic	yes	yes	yes	yes	I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.	The iterative fact verification process follows the RIFER method, with the LLM verifying key facts across documents and removing redundant content. An additional prompt-based refinement step removes any irrelevant parts of the key facts, ensuring conciseness. The key facts are then sorted by relevance using the LLM, and a final smoothing process enhances clarity and coherence, resulting in a polished output.	1 (top)
gpt_mini (trec_eval) (llm_eval)	KML	manual	yes	no	no	yes	gpt mini with prompt	gpt mini with prompt.	1 (top)
ginger_top_5 (trec_eval) (llm_eval) (paper)	uis-iai	automatic	no	yes	no	yes	We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.	We use a multi-stage pipeline that contains information nugget detection, clustering, ranking, top clusters summarization and fluency enhancement to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.	1 (top)
baseline_top_5 (trec_eval) (llm_eval) (paper)	uis-iai	automatic	no	yes	no	yes	We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.	We summarize the top 5 candidates from the retrieval baseline provided by the organizers with GPT-4. The response is limited to 3 sentences.	2
ginger-fluency_top_5 (trec_eval) (llm_eval) (paper)	uis-iai	automatic	no	yes	no	yes	We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.	We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.	3
ginger-fluency_top_10 (trec_eval) (llm_eval) (paper)	uis-iai	automatic	no	yes	no	yes	We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.	We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 10 candidates from the retrieval baseline provided by the organizers.	4
ginger-fluency_top_20 (trec_eval) (llm_eval) (paper)	uis-iai	automatic	no	yes	no	yes	We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.	We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 20 candidates from the retrieval baseline provided by the organizers.	5
iiia_dedup_p1_reverse_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_dedup_p2_reverse_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_dedup_p1_straight_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_dedup_p2_straight_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
ag_rag_gpt35_expansion_rrf_20 (trec_eval) (llm_eval)	IITD-IRL	automatic	yes	no	no	yes	The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. It was an automatic run.	This run aims at leverage the full set of context limit with a larger LLM.	1 (top)
iiia_standard_p1_reverse_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p2_reverse_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p1_straight_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p2_straight_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
ag_rag_mistral_expansion_rrf_20 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	yes	The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.	This run aims to leverage small LLM + top -20 passage. It aims to navigate with prompt to generate desired response.	2
ag_rag_mistral_expansion_rrf_15 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	yes	The generation pipeline utilizes top 15 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.	This run aims to leverage small LLM + top -15 passage. It aims to navigate with prompt to generate desired response.	5
ag_rag_mistral_expansion_rrf_7 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	yes	The generation pipeline utilizes top 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.	This run aims to leverage small LLM + top -7 passage. It aims to navigate with prompt to generate desired response and also evaluate the quality of retrieval.	6
ag_rag_gpt35_expansion_rrf_15 (trec_eval) (llm_eval)	IITD-IRL	automatic	yes	no	no	yes	The generation pipeline utilizes top-20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. It was an automatic run.	This experiment aims to evaluate the performance of the bigger LLM vis-a-vis smaller. Given the similar set of passages, which can generate relevant responses.	3
ag_rag_gpt35_expansion_rrf_7 (trec_eval) (llm_eval)	IITD-IRL	automatic	yes	no	no	yes	The generation pipeline utilizes first 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format.	This is aimed at performance of system when few passages are supplied. How much information LLM infuses, apart from passages provided in the context.	4
cir_gpt-4o-mini_Jaccard_50_0.5_100_301_p0 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini Reranking: Rerank retrieval results with MMR method for diversity MMR similarity: Jaccard MMR Lambda: 0.5 Number of reranked segments used for generation: 50 Prompt: Summarization Prompt	5
cir_gpt-4o-mini_Jaccard_50_1.0_100_301_p0 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Jaccard, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Summarization Prompt	7
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt	1 (top)
cir_gpt-4o-mini_Cosine_50_0.25_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.25, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt	8
iiia_standard_p1_straight_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_standard_p2_straight_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
webis-ag-run0-taskrag (trec_eval) (llm_eval) (paper)	webis	automatic	yes	yes	yes	no	Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.	We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.	1 (top)
cir_gpt-4o-mini_Cosine_50_0.75_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.75, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt	9
webis-ag-run1-taskrag (trec_eval) (llm_eval) (paper)	webis	automatic	yes	yes	yes	no	Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.	We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.	2
cir_gpt-4o-mini_Cosine_50_1.0_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt	6
cir_gpt-4o-mini_Cosine_20_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 20, Prompt: Explicit RAG task prompt	10 (bottom)
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p2 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: RAG for beginner level	3
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p3 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: RAG for expert level	4
iiia_standard_p1_reverse_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_standard_p2_reverse_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
cir_gpt-4o-mini_no_reranking_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)	CIR	automatic	yes	no	no	no	The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.	Model: gpt-4o-mini, Number of segments used for generation: 50, Prompt: Explicit RAG task prompt	2
iiia_dedup_p1_straight_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
webis-ag-run3-reuserag (trec_eval) (llm_eval) (paper)	webis	automatic	no	no	yes	no	N/A	Segments from the baseline run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response.	4
webis-ag-run2-reuserag (trec_eval) (llm_eval) (paper)	webis	automatic	no	no	yes	no	Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups.	This run uses the baseline retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response.	3
iiia_dedup_p1_reverse_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_dedup_p2_reverse_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_dedup_p2_straight_ht_ag (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	yes	no	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
cohere+post_processing (trec_eval) (llm_eval)	KML	manual	yes	no	no	no	cohere prompt and post processing	cohere prompt for individual answer, single prompt for whole rag. Post processing was done on the answers	2
UDInfolab.AG-v1 (trec_eval) (llm_eval)	InfoLab	manual	yes	no	no	no	topk20 + openai	topk20 + openai	2
UDInfolab.AG-v2 (trec_eval) (llm_eval)	InfoLab	automatic	yes	no	no	no	automatic	Topk20+AG	1 (top)
gpt_mini_double_prompt (trec_eval) (llm_eval)	KML	manual	yes	no	no	no	Two steps prompt instead of single prompt, first for answer finding second for citations.	Two steps prompt instead of single prompt, first for answer finding second for citations.	1 (top)

Runtag

Org

Is this a manual (A human making choices based on seeing the topics/results) or automatic run?

Does this run leverage proprietary models in any step of the generation pipeline?

Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline?

Does this run leverage smaller open-weight language models in any step of the generation pipeline?

Was this run padded with results from a baseline run?

Please describe how you went about prompt engineering the generation pipeline

Please provide a short description of this run

Please give this run a priority for inclusion in manual assessments

baseline_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper)

coordinators

automatic

yes

Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py

First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o

1 (top)

baseline_rag24.test_l31_70b_instruct_top20 (trec_eval) (llm_eval) (paper)

coordinators

automatic

yes

Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py

baseline_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper)

coordinators

automatic

yes

https://github.com/castorini/ragnarok/blob/06159e704d260b3e243499d0e6290ac9868cbd7c/src/ragnarok/generate/cohere.py#L49

agtask-bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper)

softbank-meisei

automatic

yes

Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly. The above loop was repeated until found satisfactory.

Generation process of this run is as follows: 1. Top-20 segments from the provided retrieval list for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.

1 (top)

ldilab_gpt_4o (trec_eval) (llm_eval)

ldisnu

automatic

yes

For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.

The best result from the evaluation using GPT-4O was a clear and concise overall score calculation, demonstrating the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.

1 (top)

Ranked_Iterative_Fact_Extraction_and_Refinement (trec_eval) (llm_eval) (paper)

TREMA-UNH

automatic

yes

I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.

Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts.

1 (top)

Enhanced_Iterative_Fact_Refinement_and_Prioritization (trec_eval) (llm_eval) (paper)

TREMA-UNH

automatic

yes

The iterative fact verification process follows the RIFER method, with the LLM verifying key facts across documents and removing redundant content. An additional prompt-based refinement step removes any irrelevant parts of the key facts, ensuring conciseness. The key facts are then sorted by relevance using the LLM, and a final smoothing process enhances clarity and coherence, resulting in a polished output.

1 (top)

gpt_mini (trec_eval) (llm_eval)

KML

manual

yes

gpt mini with prompt

gpt mini with prompt.

1 (top)

ginger_top_5 (trec_eval) (llm_eval) (paper)

uis-iai

automatic

yes

We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.

We use a multi-stage pipeline that contains information nugget detection, clustering, ranking, top clusters summarization and fluency enhancement to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.

1 (top)

baseline_top_5 (trec_eval) (llm_eval) (paper)

uis-iai

automatic

yes

We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.

We summarize the top 5 candidates from the retrieval baseline provided by the organizers with GPT-4. The response is limited to 3 sentences.

ginger-fluency_top_5 (trec_eval) (llm_eval) (paper)

uis-iai

automatic

yes

We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.

We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.

ginger-fluency_top_10 (trec_eval) (llm_eval) (paper)

uis-iai

automatic

yes

We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.

We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 10 candidates from the retrieval baseline provided by the organizers.

ginger-fluency_top_20 (trec_eval) (llm_eval) (paper)

uis-iai

automatic

yes

We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.

We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 20 candidates from the retrieval baseline provided by the organizers.

iiia_dedup_p1_reverse_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_dedup_p2_reverse_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_dedup_p1_straight_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_dedup_p2_straight_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

ag_rag_gpt35_expansion_rrf_20 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. It was an automatic run.

This run aims at leverage the full set of context limit with a larger LLM.

1 (top)

iiia_standard_p1_reverse_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p2_reverse_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p1_straight_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p2_straight_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

ag_rag_mistral_expansion_rrf_20 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.

This run aims to leverage small LLM + top -20 passage. It aims to navigate with prompt to generate desired response.

ag_rag_mistral_expansion_rrf_15 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes top 15 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.

This run aims to leverage small LLM + top -15 passage. It aims to navigate with prompt to generate desired response.

ag_rag_mistral_expansion_rrf_7 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes top 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.

This run aims to leverage small LLM + top -7 passage. It aims to navigate with prompt to generate desired response and also evaluate the quality of retrieval.

ag_rag_gpt35_expansion_rrf_15 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes top-20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. It was an automatic run.

This experiment aims to evaluate the performance of the bigger LLM vis-a-vis smaller. Given the similar set of passages, which can generate relevant responses.

ag_rag_gpt35_expansion_rrf_7 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

The generation pipeline utilizes first 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format.

This is aimed at performance of system when few passages are supplied. How much information LLM infuses, apart from passages provided in the context.

cir_gpt-4o-mini_Jaccard_50_0.5_100_301_p0 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.

Model: gpt-4o-mini Reranking: Rerank retrieval results with MMR method for diversity MMR similarity: Jaccard MMR Lambda: 0.5 Number of reranked segments used for generation: 50 Prompt: Summarization Prompt

cir_gpt-4o-mini_Jaccard_50_1.0_100_301_p0 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Jaccard, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Summarization Prompt

cir_gpt-4o-mini_Cosine_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt

1 (top)

cir_gpt-4o-mini_Cosine_50_0.25_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.25, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt

iiia_standard_p1_straight_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

1 (top)

iiia_standard_p2_straight_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

webis-ag-run0-taskrag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.

We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.

1 (top)

cir_gpt-4o-mini_Cosine_50_0.75_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.75, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt

webis-ag-run1-taskrag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.

cir_gpt-4o-mini_Cosine_50_1.0_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt

cir_gpt-4o-mini_Cosine_20_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

10 (bottom)

cir_gpt-4o-mini_Cosine_50_0.5_100_301_p2 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

cir_gpt-4o-mini_Cosine_50_0.5_100_301_p3 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

iiia_standard_p1_reverse_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

1 (top)

iiia_standard_p2_reverse_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

cir_gpt-4o-mini_no_reranking_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)

CIR

automatic

yes

Model: gpt-4o-mini, Number of segments used for generation: 50, Prompt: Explicit RAG task prompt

iiia_dedup_p1_straight_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

1 (top)

webis-ag-run3-reuserag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

N/A

Segments from the baseline run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response.

webis-ag-run2-reuserag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups.

This run uses the baseline retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response.

iiia_dedup_p1_reverse_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

1 (top)

iiia_dedup_p2_reverse_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

iiia_dedup_p2_straight_ht_ag (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

cohere+post_processing (trec_eval) (llm_eval)

KML

manual

yes

cohere prompt and post processing

cohere prompt for individual answer, single prompt for whole rag. Post processing was done on the answers

UDInfolab.AG-v1 (trec_eval) (llm_eval)

InfoLab

manual

yes

topk20 + openai

UDInfolab.AG-v2 (trec_eval) (llm_eval)

InfoLab

automatic

yes

automatic

Topk20+AG

1 (top)

gpt_mini_double_prompt (trec_eval) (llm_eval)

KML

manual

yes

Two steps prompt instead of single prompt, first for answer finding second for citations.

1 (top)

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Retrieval-Augmented Generation Augmented generation task Appendix