TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Is this a manual (human-intervention) or automatic run?	Is this a generation-only run (Retrieval is not used!)?	Does the retrieval pipeline leverage neural networks?	Was this run padded with results from a baseline run?	What views of the corpora does the pipeline leverage?	Does this pipeline additionally leverage web search or other corpora?	Does this run leverage proprietary models in any step of the retrieval pipeline?	Does this run leverage proprietary models in any step of the generation pipeline?	Does this run leverage open-weight LLMs (> 5B parameters) in any step of the retrieval pipeline?	Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline?	Does this run leverage smaller open-weight language models in any step of the retrieval pipeline?	Does this run leverage smaller open-weight language models in any step of the generation pipeline?	What would you categorize the retrieval pipeline as?	Please describe how you went about prompt engineering the generation pipeline	Please provide a short description of this run	Please give this run a priority for inclusion in manual assessments
baseline_frag_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper)	coordinators	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	No	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o	1 (top)
neurag (trec_eval) (llm_eval)	neu	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	No	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	cot: """ You are a cognitive scientist, to answer the following question: {question} I will provide you with several retrieved passages: Passages: {passages} Task Description: Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents. Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information. This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries. """ generation: """ This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references. Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages. """	We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer.	1 (top)
neuragfix (trec_eval) (llm_eval)	neu	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	No	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	cot: You are a cognitive scientist, to answer the following question: {question} I will provide you with several retrieved passages: Passages: {passages} Task Description: Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents. Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information. This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries. generation: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references. Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages.	We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer. Use the script you provided to make corrections	2
baseline_frag_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper)	coordinators	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): Command R Plus	3
UWCrag (trec_eval) (llm_eval) (paper)	WaterlooClarke	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Ensemble/Fusion of First Stages	Few shot prompting	Ask the LLM to generate and extract a summary from the reference documents, and attribute them in one prompt.	1 (top)
UWCrag_stepbystep (trec_eval) (llm_eval) (paper)	WaterlooClarke	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Ensemble/Fusion of First Stages	Few shot prompting	Ask the LLM to extract and summarzie an answer from the retrieved documents and attribute them, but for citation, do one sentence of the answer per prompt.	1 (top)
UWCgarag (trec_eval) (llm_eval) (paper)	WaterlooClarke	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Ensemble/Fusion of First Stages	Few shot prompting	Ask the LLM to generate an answer first, and then do retrieval using query and the generated answer, and then attribute it.	1 (top)
rag_bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper)	softbank-meisei	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	yes	yes	no	Ensemble/Fusion of First Stages	Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly. The above loop was repeated until found satisfactory.	Retrieval process of this run is as follows: 1. Topic list preprocessing stage a. Used gpt4o to correct the grammar, spelling mistake and text incompletions b. Manual checking of topic list to make sure there are no errors still existing 2. BM25 to retrieve the top-100 segments 3. Vector embeddings generation stage a. Used castorini/tct_colbert-v2-hnp-msmarco to generate embeddings for the segment corpus. b. Used faiss indexing to create index at document level (containing segment embeddings). c. Used castorini/tct_colbert-v2-msmarco-cqe to generate embeddings for the prerpocessed topics. 4. For each topic, filtered the set of documents to search for based on the bm25 top-100 retrieval results. 5. Retrieve top-100 segments from each filtered document for the query. 6. Group all set of retrieved segments and sort in descending order 7. Top-100 from the sorted list is submitted as the result Generation process of this run is as follows: 1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.	1 (top)
ragtask-bm25-rank_zephyr-gpt4o-llama70b (trec_eval) (llm_eval) (paper)	softbank-meisei	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Ensemble/Fusion of First Stages	Manually wrote the prompt, analyzed the generated responses and adjusted the prompt for whatever was lacking in the response. The above process was repeated until the output was found satisfactory.	Retrieval process of this run is as follows: 1. Topic list preprocessing stage: a. Used GPT4o to preprocess the query in order to correct the grammar, spelling errors and text incompletion b. Manual checking of all 301 topics to correct any errors that still exist 2. BM25 to retrieve the relevant top-100 segments 3. Rank zephyr to rerank the retrieved top-100 segments Generation process of this run is as follows: 1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.	2
LAS-splade-mxbai-rrf-mmr8 (trec_eval) (llm_eval) (paper)	ncsu-las	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Generation-in-the-Loop Pipeline	Slight modifications to the Ragnorak example (https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py) to specify word count range, ask for citations for each sentence. Also added an example output to prompt.	Topic decomposition with GPT4o, query topic and subtopics with SPLADE retrieval, mxbai embeddings to rerank retrieved results, RRF to combine results from topic/subtopics, MMR to choose top 20, validated relevance of top 20 segments with GPT4o prompt, then provided relevant segments to GPT4o for answer generation.	1 (top)
LAS-splade-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper)	ncsu-las	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	no	no	no	Multi-Stage Pipeline pointwise	Slight modifications to Ragnorak example to specify word count, ask for reference citations for each sentence. Also provided and example output in the prompt.	SPLADE retrieval for topic question, mxbai embedding rerank, MMR to select top 20 after rerank, GPT4o relevance evaluation of top 20 segments, then provided relevant segments to GPT4o for answer generation	2
LAS-T5-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper)	ncsu-las	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	no	no	no	Multi-Stage Pipeline pointwise	modified Ragnorak example to specify word count, include citations for each sentence, and provide an output example	First stage retrieval T5 embeddings cosine similarity, rerank with MXBAI embeddings, select top20 reranked with MMR, validate relevance of top 20 with GPT4o query, pass relevant segments to GPT4o for answer generation.	3
BEST_cot_gpt3.5 (trec_eval) (llm_eval)	citi	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise+pair/listwise	This run involves using Chain-of-thoughts prompting and giving two examples of baseline generation results to LLM(gpt-3.5-turbo) then generate the results.	This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Chain-of-Thought prompting is employed, where two example baseline generation results are provided to GPT-3.5 turbo before generating the final results. Postprocessing is added after generation to further refine the results and delete any unneeded references.	1 (top)
BEST_gpt3.5_another_prompt (trec_eval) (llm_eval)	citi	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise+pair/listwise	This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.	Please provide a short description of this run : This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.	3
SECOND_cot_gpt3.5 (trec_eval) (llm_eval)	citi	manual	no	no	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise	This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results.	This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results.	2
BEST_gpt3.5_new_prompt (trec_eval) (llm_eval)	citi	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise+pair/listwise	This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.	This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.	4
BEST_gpt3.5 (trec_eval) (llm_eval)	citi	manual	no	yes	yes	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise+pair/listwise	This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results.	This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.	5
SECOND_gpt3.5_new_prompt (trec_eval) (llm_eval)	citi	manual	no	no	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise	This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.	This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.	6
SECOND_gpt3.5 (trec_eval) (llm_eval)	citi	manual	no	no	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	no	no	Multi-Stage Pipeline pointwise	This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results.	This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.	7
iiresearch-bm25-top10-llama3-8b-instruct (trec_eval) (llm_eval) (paper)	ii_research	manual	no	no	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	no	no	Traditional Only	Instruct the LLM to be an AI assistant tasked with answering questions based on a set of provided references and every time answering a question, each sentence must be followed by a citation list indicating the specific document indexes from the provided references that support that sentence.	A standard RAG pipeline retrieves the top 10 relevant documents using BM25 as additional context and instructs LLaMa3-8B-instruct to generate responses.	1 (top)
FT-llama3 (trec_eval) (llm_eval)	uog-tht	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise	automatic	BM25+Mixedbread for retriever Llama3-8B finetuned with citation-enable QA datasets for generator	1 (top)
zeph_test_rag_rrf_expand_query (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations	The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Prior to dense retrieval, the query is processed by a small LLM to generate similar queries. Retrieval is then performed using both the original and generated queries, followed by the application of RRF.	1 (top)
zeph_test_rag_rrf_raw_query (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations.	It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Raw queries are used for retrieval. Following the first a RRF is performed on BM25 and dense retrieval results.	2
zeph_test_rag24_doc_query_expansion+rrf (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations.	The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Before dense retrieval, the query is expanded to generate a small passage on the central theme. Retrieval is then performed using both the raw query and the generated paragraph. RRF is performed at the first stage	4
LAS-splade-mxbai-rrf-mmr8-doc (trec_eval) (llm_eval) (paper)	ncsu-las	automatic	no	yes	no	Both Corpora	no	yes	yes	no	no	no	no	Generation-in-the-Loop Pipeline	Modified the Ragnorak example to include word count and instruct for citations for each sentence, and added sample output example to prompt.	Topic decomposition with GPT4o, SPLADE segment retrieval for topic+sub-topics, mxbai embeddings rerank, RRF to aggregate retrieval, MMR to get top20 for generation, validate relevance of top20 with GPT4o, get entire document for relevant segments, generate answer by prompting gpt4o with documents.	4
oneshot_post_sentenced (trec_eval) (llm_eval)	buw	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	no	no	Learned Dense Only	Prompting engineering was a manual process of starting with a simple prompt and improving it iteratively based on reviews of the generation results. Some of the challenges included keeping runs: - brief - non-repetitive - without self referencing or citing	The steps for this run are as follows: 1. Retrieve embeddings from the db with the embeddings (weaviate vector db with dense vectors) 2. Feed all segments all at once to the LLM in its system prompt and have it generate an answer for the query with the segments 3. Get the answer and separate it into sentences 4. Run the sentences through the embedding model 5. Look at similarity between the answer sentence and the segments used sentence by sentence (response_sentence -> segment_sentence) 6. Cite segments that produce >= 0.7 cosine similarity (0.7 was estimated through analysing the distribution of cosine similarities across queries)	1 (top)
LAS_splad_mxbai-rrf-occams_50_RAG (trec_eval) (llm_eval) (paper)	ncsu-las	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	no	yes	yes	Ensemble/Fusion of First Stages	The title was used in the direct generation, very similar to the baseline. sys_prompt = """This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context. The assistant should also indicate when the answer cannot be found in the context.""" user_prompt = f"""INSTRUCTION: Please give a fluent 400-word summary that answers the given question as completely as possible. Each sentence of the answer should represent unique information inferred from the context provided. After each sentence, cite each context document that supports your answer within brackets [] using the IEEE format. An example answer should look like: During the investigation, President Bill Clinton initially denied the accusations by stating, “I never told anybody to lie, not a single time; never” and emphasized, “These allegations are false. And I need to go back to work for the American people” [1]. On January 26, 1998, Clinton publicly remarked, “I did not have sexual relations with that woman, Miss Lewinsky” [1]. However, on August 17, 1998, Clinton admitted before a federal grand jury that he had engaged in an “improper physical relationship” with Monica Lewinsky [1], [16]. During the Senate trial that followed, Clinton’s defense hinged on the interpretation of the word ‘is’ to contest the truthfulness of his earlier denials [26]. Despite his previous denials, the evidence, including the infamous blue dress with Clinton’s DNA, contradicted his initial statements, ultimately leading to charges of perjury and obstruction of justice [1], [16], [18]. QUESTION: {topic_text} CONTEXTS: {context_text} INSTRUCTION:Please give a complete answer to the question. Cite each context document that supports your answer within brackets [] using the IEEE format.""" return sys_prompt, user_prompt	SPLADE full feature, segment summarizer, which is very similar to our priority 1 run; however, redundancy is removed by employing an extractive summarizer, occams, which approximates a bounded maximal coverage of the bigrams in the top 50 segments	5
dilab_repllama_listt5_pass1_gpt4o (trec_eval) (llm_eval)	ldisnu	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.	We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.	4
qrant_bge_gemini (trec_eval) (llm_eval)	SGU	manual	yes	yes	yes	MS MARCO v2.1 Segment Only	no	no	yes	no	yes	yes	no	Traditional Only	Due to hardware problems and limitations, we only used 80% of the organizer's data, we used Qdrant (cosine) as the storage, and used the public bge embedding model. Then used gemini pro 1.5 to generate the answer.	Due to hardware problems and limitations, we only used 80% of the organizer's data, we used Qdrant (cosine) as the storage, and used the public bge embedding model. Then used gemini pro 1.5 to generate the answer.	1 (top)
PG-mistral (trec_eval) (llm_eval)	uog-tht	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Multi-Stage Pipeline pointwise	automatic	BM25+Mixedbread Mistral + Post citing	1 (top)
ISIR-IRIT-zephyr_p2 (trec_eval) (llm_eval)	IRIT	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	no	no	yes	Multi-Stage Pipeline pointwise	Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)	This run, first retrieves a list of passages using BM25 + MonoT5. The top 2 documents are fed to an LLM (zephyr 3b). The LLM is instructed to generate an answer with citations. We additionally direct the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking.	1 (top)
ISIR-IRIT-zephyr_query_gen (trec_eval) (llm_eval)	IRIT	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	no	yes	yes	Generation-in-the-Loop Pipeline	This run consists of two generation prompts: Answer generation and subquery generation. As a part of the retrieval pipeline, an LLM is prompted to generate a list of subqueries to be later used for retrieval. The subquery generation is done using few-shot prompting. We first selected our fewshot examples. We used an external dataset (HAGRID) and selected from their training-set queries that have long answers (with 2 or more citations mentioned), for each query, we gathered a number of subqueries from Google Search API + freely written one by us. We select a subset of these queries with the best retrieval results on the training set. These queries are used as example subqueries in the prompt. For answer generation, several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed that, for instance the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. A couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)	In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The final list is fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking.	2
ISIR-IRIT-zephyr_sprompt_3p (trec_eval) (llm_eval)	IRIT	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	no	no	yes	Multi-Stage Pipeline pointwise	Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. A Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)	In this run, a two-sage retrieval pipleline is used to first retrieve a list of relevant passages to the topic. An LLM (Zephyr 3B) is then instructed to generate an answer with citations using the top 3 passages returned in the retrieval stage. The prompt used in this run is simple and straightforward, instructing the model to provide a comprehensive answer.	4
ISIR-IRIT-zephyr_query_gen_3p (trec_eval) (llm_eval)	IRIT	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	no	yes	yes	Generation-in-the-Loop Pipeline	This run consists of two generation prompts: Answer generation and subquery generation. As a part of the retrieval pipeline, an LLM is prompted to generate a list of subqueries to be later used for retrieval. The subquery generation is done using few-shot prompting. We first selected our fewshot examples. We used an external dataset (HAGRID) and selected from their training-set queries that have long answers (with 2 or more citations mentioned), for each query, we gathered a number of subqueries from Google Search API + freely written one by us. We select a subset of these queries with the best retrieval results on the training set. These queries are used as example subqueries in the prompt. For answer generation, several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed that, for instance the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. A couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)	In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The top 3 passages from the final list are fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking	3
ICL-mistral (trec_eval) (llm_eval)	uog-tht	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise	automatic	BM25+Mixedbread ICL + Output Processing	1 (top)
Ranked_Iterative_Fact_Extraction_and_Refinement_RIFER_-_bm25 (trec_eval) (llm_eval) (paper)	TREMA-UNH	automatic	no	no	yes	MS MARCO v2.1 Segment Only	no	no	yes	yes	yes	no	yes	Traditional Only	I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.	Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts.	1 (top)
zeph_test_rag_rrf_expand_query_mistral (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions.	It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied.	3
zeph_test_rag_rrf_expand_mistral_top_15 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	"The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions."	It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied.	5
iiia_dedup_p1_reverse (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_dedup_p2_reverse (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
zeph_test_rag_rrf_expand_top_5 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	"The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions. Smaller number of passages seems to produce smaller response.	It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied.	6
zeph_test_rag_rrf_expand_top_10 (trec_eval) (llm_eval)	IITD-IRL	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	"The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions."	It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied.	7
iiia_dedup_p1_straight (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_dedup_p2_straight (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p1_reverse (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p2_reverse (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p1_straight (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
iiia_standard_p2_straight (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.	1 (top)
dilab_repllama_listt5_pass2_gpt4o (trec_eval) (llm_eval)	ldisnu	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.	We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.	3
dilab_repllama_listt5_pass3_gpt4o (trec_eval) (llm_eval)	ldisnu	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.	We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.	1 (top)
dilab_repllama_listt5_pass4_gpt4o (trec_eval) (llm_eval)	ldisnu	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.	We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.	2
ielab-b70bf-70bqfs-ad_hoc (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	no	no	Generation-in-the-Loop Pipeline	Copy from Paper: https://arxiv.org/abs/2407.01796	Retriever: Fusion(stella+BM25) -> Pooled -> setwise Generator: Llama-3.1-70b-instruct, quantized, vllm Prompt: Few shot prompt Attributor: In Context from generator prompt, ad_hoc	2
ielab-b8bf-8bfs-ad_hoc (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Copy from paper: https://arxiv.org/abs/2407.01796	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8B-instruct Prompt: Few shot prompt with example Attributor: In context from generator prompt, ad_hoc	9
iiia_dedup_p1_reverse_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_dedup_p1_straight_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_dedup_p2_reverse_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_dedup_p2_straight_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_standard_p1_reverse_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.	This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_standard_p1_straight_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.	This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
ielab-b70bf-70bqp-70bafs (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-70b-instruct, quantised, vllm Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-70b-instruct	1 (top)
iiia_standard_p2_reverse_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
iiia_standard_p2_straight_ht (trec_eval) (llm_eval)	IIIA-UNIPD	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	yes	yes	Learned Dense Only	I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.	This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.	1 (top)
ielab-b8bf-8bzs-ad_hoc (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Copy from paper: https://arxiv.org/abs/2407.01796	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Zero-shot Prompt without examples Attributor: In context generating answers directly (ad_hoc)	7
ielab-b8bf-8bp-8ba (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_propmt: Zero-shot	4
webis-manual (trec_eval) (llm_eval) (paper)	webis	manual	no	yes	yes	MS MARCO v2.1 Segment Only	yes	yes	yes	yes	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	We did not directly used prompts, as the responses are manually generated by humans for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784'].	We did create manual responses for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784'].	1 (top)
ielab-b-8bp-8bafs (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: Few shot	3
ielab-b8b-8bp-8bafs (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) -> Pooled Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: few shot	7
webis-rag-run0-taskrag (trec_eval) (llm_eval) (paper)	webis	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.	This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.	1 (top)
buw (trec_eval) (llm_eval)	buw	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	no	yes	yes	yes	Traditional Only	The prompt engineering in the code involves carefully crafting the input queries to the language model (LLM) and designing a pipeline to ensure that the generated responses are relevant, accurate, and well-supported by retrieved data. Here's how it is done: 1. Query Construction: The `query` is generated for each `topic_id`, ensuring that the input to the LLM is clear and directly related to the topic being addressed. This step sets the foundation for generating a contextually relevant response. 2. Contextual Retrieval: The pipeline retrieves relevant segments from a collection based on the `query`. These segments provide the LLM with a focused context, which is crucial for generating a response that aligns closely with the desired topic. 3. LLM Prompting: The segments are passed to the LLM as part of the prompt. The design of this prompt is crucial because it ensures that the model considers the most relevant information during response generation. This approach helps in obtaining a more precise and contextually appropriate output. 4. Response Splitting and Refinement: The generated response from the LLM is split into sentences, each of which is treated as a distinct unit of information. This allows for a more granular alignment between the prompt (query and segments) and the LLM's output, ensuring that each sentence can be individually analyzed for relevance. 5. Similarity-Based Citations: After generating the response, the sentences are compared with the original retrieved segments using cosine similarity. This process refines the LLM's output by tying each sentence back to the most relevant segments, enhancing the accuracy and reliability of the final response.	The prompt engineering for this generation pipeline revolves around a systematic approach to retrieving and generating relevant content from a collection based on a given query. The pipeline is structured as follows: 1. Query Handling: For each `topic_id`, a query is executed to retrieve relevant text segments using a near-text search, returning metadata that includes the distance and certainty metrics. 2. Data Retrieval: The retrieved objects are then processed to fetch the associated vectors, text segments, and document IDs. This is necessary because the initial query response does not include the vectors. 3. Response Generation: A request is made to an LLM using the query and retrieved segments, generating a response that is then split into individual sentences for further processing. 4. Similarity Calculation: Each sentence is encoded into a vector and compared with the vectors of the retrieved segments using cosine similarity. This determines the relevance of each segment to the sentence. 5. Thresholding and Citation Assignment: The most relevant segments, determined by a calculated threshold, are assigned as citations for each sentence. 6. Output Formatting: The output is structured into a JSON object, which includes metadata such as the run ID, topic, references, response length, and the generated response with citations. This object is then serialized and written to a JSONL file.	2
webis-rag-run1-taskrag (trec_eval) (llm_eval) (paper)	webis	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.	This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.	2
ielab-b8bf-8bp-8bafs (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: few_shot	4
webis-rag-run3-taskrag (trec_eval) (llm_eval) (paper)	webis	automatic	no	yes	yes	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.	This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.	4
buw_2 (trec_eval) (llm_eval)	buw	manual	yes	yes	no	MS MARCO v2.1 Segment Only	yes	yes	yes	yes	yes	yes	yes	Traditional Only	The prompt engineering in the provided code involves crafting queries and responses for efficient interaction with the language model and the retrieval system. Here's how it is done: 1. Query Construction: The `query` is formulated based on the `topic_id`, which is used to retrieve relevant documents from the collection. The prompt engineering here ensures that the query is precise and relevant to the topic, optimizing the retrieval of the most pertinent text segments. 2. LLM Response Prompting: After retrieving the relevant text segments, these segments are used as context in a prompt sent to the language model (`make_llm_request(query, segment_list)`). The prompt is designed to generate a coherent and informative response from the model based on the provided context. 3. Sentence-Level Analysis: The LLM-generated response is split into individual sentences, each treated as a distinct prompt. This allows for a more granular analysis of the response, making it easier to match each sentence with relevant citations based on semantic similarity. 4. Similarity-Based Citation: For each sentence in the LLM's response, its semantic similarity with the retrieved text segments is calculated. This ensures that the model's output is supported by relevant and contextually similar citations, enhancing the credibility and relevance of the generated content. Through these steps, the prompts are carefully engineered to retrieve, generate, and align information effectively, ensuring high-quality, contextually appropriate outputs.	dense	2
webis-rag-run4-reuserag (trec_eval) (llm_eval) (paper)	webis	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	no	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups.	This run uses the webis-01 retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response.	5
webis-rag-run5-reuserag (trec_eval) (llm_eval) (paper)	webis	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	no	yes	no	yes	yes	Multi-Stage Pipeline pointwise+pair/listwise	N/A	Segments from the webis-01 run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response.	6
buw_3 (trec_eval) (llm_eval)	buw	manual	yes	yes	no	MS MARCO v2.1 Segment Only	yes	yes	yes	yes	yes	yes	yes	Traditional Only	manually framed the words for finetuning the generation	dense	2
ruc001 (trec_eval) (llm_eval)	Ruc01	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	no	yes	no	yes	Traditional Only	Q: What state is home to the university that is represented in sports by George Washington Colonials men's basketball? A: First, the education institution has a sports team named George Washington Colonials men's basketball in is George Washington University , Second, George Washington University is in Washington D.C. The answer is {Washington, D.C.}. Q: Who lists Pramatha Chaudhuri as an influence and wrote Jana Gana Mana? A: First, Bharoto Bhagyo Bidhata wrote Jana Gana Mana. Second, Bharoto Bhagyo Bidhata lists Pramatha Chaudhuri as an influence. The answer is {Bharoto Bhagyo Bidhata}. Q: Who was the artist nominated for an award for You Drive Me Crazy? A: First, the artist nominated for an award for You Drive Me Crazy is Britney Spears. The answer is {Jason Allen Alexander}. Q: What person born in Siegen influenced the work of Vincent Van Gogh? A: First, Peter Paul Rubens, Claude Monet and etc. influenced the work of Vincent Van Gogh. Second, Peter Paul Rubens born in Siegen. The answer is {Peter Paul Rubens}. Q: What is the country close to Russia where Mikheil Saakashvii holds a government position? A: First, China, Norway, Finland, Estonia and Georgia is close to Russia. Second, Mikheil Saakashvii holds a government position at Georgia. The answer is {Georgia}. Q: What drug did the actor who portrayed the character Urethane Wheels Guy overdosed on? A: First, Mitchell Lee Hedberg portrayed character Urethane Wheels Guy. Second, Mitchell Lee Hedberg overdose Heroin. The answer is {Heroin}."""	Serialization pipeline of a retrieve-reorder-then-generate model.	1 (top)
buw_5 (trec_eval) (llm_eval)	buw	manual	yes	yes	no	MS MARCO v2.1 Segment Only	yes	yes	yes	yes	yes	yes	yes	Traditional Only	finetuned the prompt manually until expected output	dense	1 (top)
UDInfolab.bgeV2 (trec_eval) (llm_eval)	InfoLab	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	yes	no	no	Learned Dense Only	Testing several prompts	This run use BGE + gpt-4o-mini	3
UDInfolab.RAG.AnsAI (trec_eval) (llm_eval)	InfoLab	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise	I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents, finally I used gpt4 for generation.	Bge + reranker + using doc2Query	2
UDInfolab.RAG.Query (trec_eval) (llm_eval)	InfoLab	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise	I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents however i used LLM to modify the query to be more user-friendly for the retrieval step.	BGE+ reranker (query modified)	1 (top)
ielab-b70bf-70bqp-rarr (trec_eval) (llm_eval)	ielab	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	no	no	yes	yes	yes	no	Generation-in-the-Loop Pipeline	Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)	Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-70b-instruct, quantised, vllm Prompt: Plain answer generation without citation ids Attributor: RARR	9
UDInfolab.RAG.bge.tuned (trec_eval) (llm_eval)	InfoLab	manual	yes	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	no	no	Multi-Stage Pipeline pointwise	bge+parameters tuned	bge+parameters tuned	6
UDInfolab.RAG.bge.QueryAgm.tuned (trec_eval) (llm_eval)	InfoLab	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	yes	no	no	no	Multi-Stage Pipeline pointwise	bge+ reranker used in the retrieval step + openai modifying parameters	bge+ reranker used in the retrieval step + openai modifying parameters	5
UDInfolab.RAG.bge.QueryAnsAI.tuned (trec_eval) (llm_eval)	InfoLab	manual	no	yes	no	MS MARCO v2.1 Segment Only	no	no	yes	no	yes	no	no	Traditional Only	bge+ reranker + query2doc used in the retrieval step + openai modifying parameters	bge+ reranker + query2doc used in the retrieval step + openai modifying parameters	4
listgalore_gpt4o_ragnarokv4_top20 (trec_eval) (llm_eval)	h2oloo	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	yes	yes	no	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Claude-aided prompt building	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(RankGPT4-o, RankLLaMA3.1-70B, RankZephyr) Generation (top-20): Ragnarok V4 Prompt - GPT-4o	1 (top)
listgalore_l31-70b_ragnarokv4_top20 (trec_eval) (llm_eval)	h2oloo	automatic	no	yes	no	MS MARCO v2.1 Segment Only	no	yes	no	yes	yes	yes	no	Multi-Stage Pipeline pointwise+pair/listwise	Claude-aided prompt building	First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(RankGPT4-o, RankLLaMA3.1-70B, RankZephyr) Generation (top-20): Ragnarok V4 Prompt - L3.1-70B	2
listgalore_gpt4o_ragnarokv4nocite_top20 (trec_eval) (llm_eval)	h2oloo	automatic	yes	no	no	Neither Corpora	no	no	yes	no	no	no	no	Traditional Only	Claude-aided building	Ragnarok (prompt v4 - No Retrieval): GPT-4o top20	3
listgalore_l31-70b_ragnarokv4nocite_top20 (trec_eval) (llm_eval)	h2oloo	automatic	yes	no	no	Neither Corpora	no	no	no	no	yes	no	no	Traditional Only	Claude-aided prompt building	Ragnarok (prompt v4 - No Retrieval): L3.1-70B top20e	4

Runtag

Org

Is this a manual (human-intervention) or automatic run?

Is this a generation-only run (Retrieval is not used!)?

Does the retrieval pipeline leverage neural networks?

Was this run padded with results from a baseline run?

What views of the corpora does the pipeline leverage?

Does this pipeline additionally leverage web search or other corpora?

Does this run leverage proprietary models in any step of the retrieval pipeline?

Does this run leverage proprietary models in any step of the generation pipeline?

Does this run leverage open-weight LLMs (> 5B parameters) in any step of the retrieval pipeline?

Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline?

Does this run leverage smaller open-weight language models in any step of the retrieval pipeline?

Does this run leverage smaller open-weight language models in any step of the generation pipeline?

What would you categorize the retrieval pipeline as?

Please describe how you went about prompt engineering the generation pipeline

Please provide a short description of this run

Please give this run a priority for inclusion in manual assessments

baseline_frag_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper)

coordinators

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py

First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o

1 (top)

neurag (trec_eval) (llm_eval)

neu

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

cot: """ You are a cognitive scientist, to answer the following question: {question} I will provide you with several retrieved passages: Passages: {passages} Task Description: Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents. Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information. This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries. """ generation: """ This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references. Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages. """

We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer.

1 (top)

neuragfix (trec_eval) (llm_eval)

neu

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

cot: You are a cognitive scientist, to answer the following question: {question} I will provide you with several retrieved passages: Passages: {passages} Task Description: Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents. Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information. This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries. generation: This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references. Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages.

We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer. Use the script you provided to make corrections

baseline_frag_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper)

coordinators

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py

UWCrag (trec_eval) (llm_eval) (paper)

WaterlooClarke

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

Few shot prompting

Ask the LLM to generate and extract a summary from the reference documents, and attribute them in one prompt.

1 (top)

UWCrag_stepbystep (trec_eval) (llm_eval) (paper)

WaterlooClarke

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

Few shot prompting

Ask the LLM to extract and summarzie an answer from the retrieved documents and attribute them, but for citation, do one sentence of the answer per prompt.

1 (top)

UWCgarag (trec_eval) (llm_eval) (paper)

WaterlooClarke

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

Few shot prompting

Ask the LLM to generate an answer first, and then do retrieval using query and the generated answer, and then attribute it.

1 (top)

rag_bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper)

softbank-meisei

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly. The above loop was repeated until found satisfactory.

Retrieval process of this run is as follows: 1. Topic list preprocessing stage a. Used gpt4o to correct the grammar, spelling mistake and text incompletions b. Manual checking of topic list to make sure there are no errors still existing 2. BM25 to retrieve the top-100 segments 3. Vector embeddings generation stage a. Used castorini/tct_colbert-v2-hnp-msmarco to generate embeddings for the segment corpus. b. Used faiss indexing to create index at document level (containing segment embeddings). c. Used castorini/tct_colbert-v2-msmarco-cqe to generate embeddings for the prerpocessed topics. 4. For each topic, filtered the set of documents to search for based on the bm25 top-100 retrieval results. 5. Retrieve top-100 segments from each filtered document for the query. 6. Group all set of retrieved segments and sort in descending order 7. Top-100 from the sorted list is submitted as the result Generation process of this run is as follows: 1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.

1 (top)

ragtask-bm25-rank_zephyr-gpt4o-llama70b (trec_eval) (llm_eval) (paper)

softbank-meisei

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

Manually wrote the prompt, analyzed the generated responses and adjusted the prompt for whatever was lacking in the response. The above process was repeated until the output was found satisfactory.

Retrieval process of this run is as follows: 1. Topic list preprocessing stage: a. Used GPT4o to preprocess the query in order to correct the grammar, spelling errors and text incompletion b. Manual checking of all 301 topics to correct any errors that still exist 2. BM25 to retrieve the relevant top-100 segments 3. Rank zephyr to rerank the retrieved top-100 segments Generation process of this run is as follows: 1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.

LAS-splade-mxbai-rrf-mmr8 (trec_eval) (llm_eval) (paper)

ncsu-las

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Slight modifications to the Ragnorak example (https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py) to specify word count range, ask for citations for each sentence. Also added an example output to prompt.

Topic decomposition with GPT4o, query topic and subtopics with SPLADE retrieval, mxbai embeddings to rerank retrieved results, RRF to combine results from topic/subtopics, MMR to choose top 20, validated relevance of top 20 segments with GPT4o prompt, then provided relevant segments to GPT4o for answer generation.

1 (top)

LAS-splade-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper)

ncsu-las

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

Slight modifications to Ragnorak example to specify word count, ask for reference citations for each sentence. Also provided and example output in the prompt.

SPLADE retrieval for topic question, mxbai embedding rerank, MMR to select top 20 after rerank, GPT4o relevance evaluation of top 20 segments, then provided relevant segments to GPT4o for answer generation

LAS-T5-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper)

ncsu-las

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

modified Ragnorak example to specify word count, include citations for each sentence, and provide an output example

First stage retrieval T5 embeddings cosine similarity, rerank with MXBAI embeddings, select top20 reranked with MMR, validate relevance of top 20 with GPT4o query, pass relevant segments to GPT4o for answer generation.

BEST_cot_gpt3.5 (trec_eval) (llm_eval)

citi

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

This run involves using Chain-of-thoughts prompting and giving two examples of baseline generation results to LLM(gpt-3.5-turbo) then generate the results.

This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Chain-of-Thought prompting is employed, where two example baseline generation results are provided to GPT-3.5 turbo before generating the final results. Postprocessing is added after generation to further refine the results and delete any unneeded references.

1 (top)

BEST_gpt3.5_another_prompt (trec_eval) (llm_eval)

citi

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.

Please provide a short description of this run : This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.

SECOND_cot_gpt3.5 (trec_eval) (llm_eval)

citi

manual

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results.

BEST_gpt3.5_new_prompt (trec_eval) (llm_eval)

citi

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.

This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.

BEST_gpt3.5 (trec_eval) (llm_eval)

citi

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results.

This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.

SECOND_gpt3.5_new_prompt (trec_eval) (llm_eval)

citi

manual

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results.

This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.

SECOND_gpt3.5 (trec_eval) (llm_eval)

citi

manual

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results.

This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references.

iiresearch-bm25-top10-llama3-8b-instruct (trec_eval) (llm_eval) (paper)

ii_research

manual

MS MARCO v2.1 Segment Only

yes

Traditional Only

Instruct the LLM to be an AI assistant tasked with answering questions based on a set of provided references and every time answering a question, each sentence must be followed by a citation list indicating the specific document indexes from the provided references that support that sentence.

A standard RAG pipeline retrieves the top 10 relevant documents using BM25 as additional context and instructs LLaMa3-8B-instruct to generate responses.

1 (top)

FT-llama3 (trec_eval) (llm_eval)

uog-tht

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

automatic

BM25+Mixedbread for retriever Llama3-8B finetuned with citation-enable QA datasets for generator

1 (top)

zeph_test_rag_rrf_expand_query (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations

The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Prior to dense retrieval, the query is processed by a small LLM to generate similar queries. Retrieval is then performed using both the original and generated queries, followed by the application of RRF.

1 (top)

zeph_test_rag_rrf_raw_query (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Raw queries are used for retrieval. Following the first a RRF is performed on BM25 and dense retrieval results.

zeph_test_rag24_doc_query_expansion+rrf (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Before dense retrieval, the query is expanded to generate a small passage on the central theme. Retrieval is then performed using both the raw query and the generated paragraph. RRF is performed at the first stage

LAS-splade-mxbai-rrf-mmr8-doc (trec_eval) (llm_eval) (paper)

ncsu-las

automatic

yes

Both Corpora

yes

Generation-in-the-Loop Pipeline

Modified the Ragnorak example to include word count and instruct for citations for each sentence, and added sample output example to prompt.

Topic decomposition with GPT4o, SPLADE segment retrieval for topic+sub-topics, mxbai embeddings rerank, RRF to aggregate retrieval, MMR to get top20 for generation, validate relevance of top20 with GPT4o, get entire document for relevant segments, generate answer by prompting gpt4o with documents.

oneshot_post_sentenced (trec_eval) (llm_eval)

buw

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

Prompting engineering was a manual process of starting with a simple prompt and improving it iteratively based on reviews of the generation results. Some of the challenges included keeping runs: - brief - non-repetitive - without self referencing or citing

The steps for this run are as follows: 1. Retrieve embeddings from the db with the embeddings (weaviate vector db with dense vectors) 2. Feed all segments all at once to the LLM in its system prompt and have it generate an answer for the query with the segments 3. Get the answer and separate it into sentences 4. Run the sentences through the embedding model 5. Look at similarity between the answer sentence and the segments used sentence by sentence (response_sentence -> segment_sentence) 6. Cite segments that produce >= 0.7 cosine similarity (0.7 was estimated through analysing the distribution of cosine similarities across queries)

1 (top)

LAS_splad_mxbai-rrf-occams_50_RAG (trec_eval) (llm_eval) (paper)

ncsu-las

automatic

yes

MS MARCO v2.1 Segment Only

yes

Ensemble/Fusion of First Stages

The title was used in the direct generation, very similar to the baseline. sys_prompt = """This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context. The assistant should also indicate when the answer cannot be found in the context.""" user_prompt = f"""INSTRUCTION: Please give a fluent 400-word summary that answers the given question as completely as possible. Each sentence of the answer should represent unique information inferred from the context provided. After each sentence, cite each context document that supports your answer within brackets [] using the IEEE format. An example answer should look like: During the investigation, President Bill Clinton initially denied the accusations by stating, “I never told anybody to lie, not a single time; never” and emphasized, “These allegations are false. And I need to go back to work for the American people” [1]. On January 26, 1998, Clinton publicly remarked, “I did not have sexual relations with that woman, Miss Lewinsky” [1]. However, on August 17, 1998, Clinton admitted before a federal grand jury that he had engaged in an “improper physical relationship” with Monica Lewinsky [1], [16]. During the Senate trial that followed, Clinton’s defense hinged on the interpretation of the word ‘is’ to contest the truthfulness of his earlier denials [26]. Despite his previous denials, the evidence, including the infamous blue dress with Clinton’s DNA, contradicted his initial statements, ultimately leading to charges of perjury and obstruction of justice [1], [16], [18]. QUESTION: {topic_text} CONTEXTS: {context_text} INSTRUCTION:Please give a complete answer to the question. Cite each context document that supports your answer within brackets [] using the IEEE format.""" return sys_prompt, user_prompt

SPLADE full feature, segment summarizer, which is very similar to our priority 1 run; however, redundancy is removed by employing an extractive summarizer, occams, which approximates a bounded maximal coverage of the bigrams in the top 50 segments

dilab_repllama_listt5_pass1_gpt4o (trec_eval) (llm_eval)

ldisnu

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.

We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.

qrant_bge_gemini (trec_eval) (llm_eval)

SGU

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

Due to hardware problems and limitations, we only used 80% of the organizer's data, we used Qdrant (cosine) as the storage, and used the public bge embedding model. Then used gemini pro 1.5 to generate the answer.

1 (top)

PG-mistral (trec_eval) (llm_eval)

uog-tht

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

automatic

BM25+Mixedbread Mistral + Post citing

1 (top)

ISIR-IRIT-zephyr_p2 (trec_eval) (llm_eval)

IRIT

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)

This run, first retrieves a list of passages using BM25 + MonoT5. The top 2 documents are fed to an LLM (zephyr 3b). The LLM is instructed to generate an answer with citations. We additionally direct the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking.

1 (top)

ISIR-IRIT-zephyr_query_gen (trec_eval) (llm_eval)

IRIT

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

This run consists of two generation prompts: Answer generation and subquery generation. As a part of the retrieval pipeline, an LLM is prompted to generate a list of subqueries to be later used for retrieval. The subquery generation is done using few-shot prompting. We first selected our fewshot examples. We used an external dataset (HAGRID) and selected from their training-set queries that have long answers (with 2 or more citations mentioned), for each query, we gathered a number of subqueries from Google Search API + freely written one by us. We select a subset of these queries with the best retrieval results on the training set. These queries are used as example subqueries in the prompt. For answer generation, several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed that, for instance the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. A couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)

In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The final list is fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking.

ISIR-IRIT-zephyr_sprompt_3p (trec_eval) (llm_eval)

IRIT

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. A Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation)

In this run, a two-sage retrieval pipleline is used to first retrieve a list of relevant passages to the topic. An LLM (Zephyr 3B) is then instructed to generate an answer with citations using the top 3 passages returned in the retrieval stage. The prompt used in this run is simple and straightforward, instructing the model to provide a comprehensive answer.

ISIR-IRIT-zephyr_query_gen_3p (trec_eval) (llm_eval)

IRIT

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The top 3 passages from the final list are fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking

ICL-mistral (trec_eval) (llm_eval)

uog-tht

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

automatic

BM25+Mixedbread ICL + Output Processing

1 (top)

Ranked_Iterative_Fact_Extraction_and_Refinement_RIFER_-_bm25 (trec_eval) (llm_eval) (paper)

TREMA-UNH

automatic

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.

Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts.

1 (top)

zeph_test_rag_rrf_expand_query_mistral (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions.

It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied.

zeph_test_rag_rrf_expand_mistral_top_15 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

"The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats. To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK. A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions."

iiia_dedup_p1_reverse (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_dedup_p2_reverse (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

zeph_test_rag_rrf_expand_top_5 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

zeph_test_rag_rrf_expand_top_10 (trec_eval) (llm_eval)

IITD-IRL

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

iiia_dedup_p1_straight (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_dedup_p2_straight (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p1_reverse (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p2_reverse (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

iiia_standard_p1_straight (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

1 (top)

iiia_standard_p2_straight (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.

1 (top)

dilab_repllama_listt5_pass2_gpt4o (trec_eval) (llm_eval)

ldisnu

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

dilab_repllama_listt5_pass3_gpt4o (trec_eval) (llm_eval)

ldisnu

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

1 (top)

dilab_repllama_listt5_pass4_gpt4o (trec_eval) (llm_eval)

ldisnu

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

ielab-b70bf-70bqfs-ad_hoc (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Copy from Paper: https://arxiv.org/abs/2407.01796

Retriever: Fusion(stella+BM25) -> Pooled -> setwise Generator: Llama-3.1-70b-instruct, quantized, vllm Prompt: Few shot prompt Attributor: In Context from generator prompt, ad_hoc

ielab-b8bf-8bfs-ad_hoc (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Copy from paper: https://arxiv.org/abs/2407.01796

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8B-instruct Prompt: Few shot prompt with example Attributor: In context from generator prompt, ad_hoc

iiia_dedup_p1_reverse_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

1 (top)

iiia_dedup_p1_straight_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

1 (top)

iiia_dedup_p2_reverse_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

iiia_dedup_p2_straight_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

iiia_standard_p1_reverse_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

1 (top)

iiia_standard_p1_straight_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.

This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

ielab-b70bf-70bqp-70bafs (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-70b-instruct, quantised, vllm Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-70b-instruct

1 (top)

iiia_standard_p2_reverse_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.

1 (top)

iiia_standard_p2_straight_ht (trec_eval) (llm_eval)

IIIA-UNIPD

automatic

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

1 (top)

ielab-b8bf-8bzs-ad_hoc (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Copy from paper: https://arxiv.org/abs/2407.01796

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Zero-shot Prompt without examples Attributor: In context generating answers directly (ad_hoc)

ielab-b8bf-8bp-8ba (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_propmt: Zero-shot

webis-manual (trec_eval) (llm_eval) (paper)

webis

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

We did not directly used prompts, as the responses are manually generated by humans for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784'].

We did create manual responses for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784'].

1 (top)

ielab-b-8bp-8bafs (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: Few shot

ielab-b8b-8bp-8bafs (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) -> Pooled Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: few shot

webis-rag-run0-taskrag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.

This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.

1 (top)

buw (trec_eval) (llm_eval)

buw

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

The prompt engineering in the code involves carefully crafting the input queries to the language model (LLM) and designing a pipeline to ensure that the generated responses are relevant, accurate, and well-supported by retrieved data. Here's how it is done: 1. Query Construction: The `query` is generated for each `topic_id`, ensuring that the input to the LLM is clear and directly related to the topic being addressed. This step sets the foundation for generating a contextually relevant response. 2. Contextual Retrieval: The pipeline retrieves relevant segments from a collection based on the `query`. These segments provide the LLM with a focused context, which is crucial for generating a response that aligns closely with the desired topic. 3. LLM Prompting: The segments are passed to the LLM as part of the prompt. The design of this prompt is crucial because it ensures that the model considers the most relevant information during response generation. This approach helps in obtaining a more precise and contextually appropriate output. 4. Response Splitting and Refinement: The generated response from the LLM is split into sentences, each of which is treated as a distinct unit of information. This allows for a more granular alignment between the prompt (query and segments) and the LLM's output, ensuring that each sentence can be individually analyzed for relevance. 5. Similarity-Based Citations: After generating the response, the sentences are compared with the original retrieved segments using cosine similarity. This process refines the LLM's output by tying each sentence back to the most relevant segments, enhancing the accuracy and reliability of the final response.

The prompt engineering for this generation pipeline revolves around a systematic approach to retrieving and generating relevant content from a collection based on a given query. The pipeline is structured as follows: 1. Query Handling: For each `topic_id`, a query is executed to retrieve relevant text segments using a near-text search, returning metadata that includes the distance and certainty metrics. 2. Data Retrieval: The retrieved objects are then processed to fetch the associated vectors, text segments, and document IDs. This is necessary because the initial query response does not include the vectors. 3. Response Generation: A request is made to an LLM using the query and retrieved segments, generating a response that is then split into individual sentences for further processing. 4. Similarity Calculation: Each sentence is encoded into a vector and compared with the vectors of the retrieved segments using cosine similarity. This determines the relevance of each segment to the sentence. 5. Thresholding and Citation Assignment: The most relevant segments, determined by a calculated threshold, are assigned as citations for each sentence. 6. Output Formatting: The output is structured into a JSON object, which includes metadata such as the run ID, topic, references, response length, and the generated response with citations. This object is then serialized and written to a JSONL file.

webis-rag-run1-taskrag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.

ielab-b8bf-8bp-8bafs (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-8b-instruct Prompt: Plain answer generation without citation ids Attributor: Llama-3.1-8b-instruct Attributor_prompt: few_shot

webis-rag-run3-taskrag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.

buw_2 (trec_eval) (llm_eval)

buw

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

The prompt engineering in the provided code involves crafting queries and responses for efficient interaction with the language model and the retrieval system. Here's how it is done: 1. **Query Construction**: The `query` is formulated based on the `topic_id`, which is used to retrieve relevant documents from the collection. The prompt engineering here ensures that the query is precise and relevant to the topic, optimizing the retrieval of the most pertinent text segments. 2. **LLM Response Prompting**: After retrieving the relevant text segments, these segments are used as context in a prompt sent to the language model (`make_llm_request(query, segment_list)`). The prompt is designed to generate a coherent and informative response from the model based on the provided context. 3. **Sentence-Level Analysis**: The LLM-generated response is split into individual sentences, each treated as a distinct prompt. This allows for a more granular analysis of the response, making it easier to match each sentence with relevant citations based on semantic similarity. 4. **Similarity-Based Citation**: For each sentence in the LLM's response, its semantic similarity with the retrieved text segments is calculated. This ensures that the model's output is supported by relevant and contextually similar citations, enhancing the credibility and relevance of the generated content. Through these steps, the prompts are carefully engineered to retrieve, generate, and align information effectively, ensuring high-quality, contextually appropriate outputs.

dense

webis-rag-run4-reuserag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups.

This run uses the webis-01 retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response.

webis-rag-run5-reuserag (trec_eval) (llm_eval) (paper)

webis

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

N/A

Segments from the webis-01 run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response.

buw_3 (trec_eval) (llm_eval)

buw

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

manually framed the words for finetuning the generation

dense

ruc001 (trec_eval) (llm_eval)

Ruc01

automatic

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

Q: What state is home to the university that is represented in sports by George Washington Colonials men's basketball? A: First, the education institution has a sports team named George Washington Colonials men's basketball in is George Washington University , Second, George Washington University is in Washington D.C. The answer is {Washington, D.C.}. Q: Who lists Pramatha Chaudhuri as an influence and wrote Jana Gana Mana? A: First, Bharoto Bhagyo Bidhata wrote Jana Gana Mana. Second, Bharoto Bhagyo Bidhata lists Pramatha Chaudhuri as an influence. The answer is {Bharoto Bhagyo Bidhata}. Q: Who was the artist nominated for an award for You Drive Me Crazy? A: First, the artist nominated for an award for You Drive Me Crazy is Britney Spears. The answer is {Jason Allen Alexander}. Q: What person born in Siegen influenced the work of Vincent Van Gogh? A: First, Peter Paul Rubens, Claude Monet and etc. influenced the work of Vincent Van Gogh. Second, Peter Paul Rubens born in Siegen. The answer is {Peter Paul Rubens}. Q: What is the country close to Russia where Mikheil Saakashvii holds a government position? A: First, China, Norway, Finland, Estonia and Georgia is close to Russia. Second, Mikheil Saakashvii holds a government position at Georgia. The answer is {Georgia}. Q: What drug did the actor who portrayed the character Urethane Wheels Guy overdosed on? A: First, Mitchell Lee Hedberg portrayed character Urethane Wheels Guy. Second, Mitchell Lee Hedberg overdose Heroin. The answer is {Heroin}."""

Serialization pipeline of a retrieve-reorder-then-generate model.

1 (top)

buw_5 (trec_eval) (llm_eval)

buw

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

finetuned the prompt manually until expected output

dense

1 (top)

UDInfolab.bgeV2 (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Learned Dense Only

Testing several prompts

This run use BGE + gpt-4o-mini

UDInfolab.RAG.AnsAI (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents, finally I used gpt4 for generation.

Bge + reranker + using doc2Query

UDInfolab.RAG.Query (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents however i used LLM to modify the query to be more user-friendly for the retrieval step.

BGE+ reranker (query modified)

1 (top)

ielab-b70bf-70bqp-rarr (trec_eval) (llm_eval)

ielab

automatic

yes

MS MARCO v2.1 Segment Only

yes

Generation-in-the-Loop Pipeline

Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model)

Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise Generator: Llama-3.1-70b-instruct, quantised, vllm Prompt: Plain answer generation without citation ids Attributor: RARR

UDInfolab.RAG.bge.tuned (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

bge+parameters tuned

UDInfolab.RAG.bge.QueryAgm.tuned (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise

bge+ reranker used in the retrieval step + openai modifying parameters

UDInfolab.RAG.bge.QueryAnsAI.tuned (trec_eval) (llm_eval)

InfoLab

manual

yes

MS MARCO v2.1 Segment Only

yes

Traditional Only

bge+ reranker + query2doc used in the retrieval step + openai modifying parameters

listgalore_gpt4o_ragnarokv4_top20 (trec_eval) (llm_eval)

h2oloo

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Claude-aided prompt building

First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(RankGPT4-o, RankLLaMA3.1-70B, RankZephyr) Generation (top-20): Ragnarok V4 Prompt - GPT-4o

1 (top)

listgalore_l31-70b_ragnarokv4_top20 (trec_eval) (llm_eval)

h2oloo

automatic

yes

MS MARCO v2.1 Segment Only

yes

Multi-Stage Pipeline pointwise+pair/listwise

Claude-aided prompt building

listgalore_gpt4o_ragnarokv4nocite_top20 (trec_eval) (llm_eval)

h2oloo

automatic

yes

Neither Corpora

yes

Traditional Only

Claude-aided building

Ragnarok (prompt v4 - No Retrieval): GPT-4o top20

listgalore_l31-70b_ragnarokv4nocite_top20 (trec_eval) (llm_eval)

h2oloo

automatic

yes

Neither Corpora

yes

Traditional Only

Claude-aided prompt building

Ragnarok (prompt v4 - No Retrieval): L3.1-70B top20e

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Retrieval-Augmented Generation Retrieval-augmented generation task Appendix