Runtag | Org | Is this a manual (human-intervention) or automatic run? | Is this a generation-only run (Retrieval is not used!)? | Does the retrieval pipeline leverage neural networks? | Was this run padded with results from a baseline run? | What views of the corpora does the pipeline leverage? | Does this pipeline additionally leverage web search or other corpora? | Does this run leverage proprietary models in any step of the retrieval pipeline? | Does this run leverage proprietary models in any step of the generation pipeline? | Does this run leverage open-weight LLMs (> 5B parameters) in any step of the retrieval pipeline? | Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline? | Does this run leverage smaller open-weight language models in any step of the retrieval pipeline? | Does this run leverage smaller open-weight language models in any step of the generation pipeline? | What would you categorize the retrieval pipeline as? | Please describe how you went about prompt engineering the generation pipeline | Please provide a short description of this run | Please give this run a priority for inclusion in manual assessments |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
baseline_frag_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper) | coordinators | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | No | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o | 1 (top) |
neurag (trec_eval) (llm_eval) | neu | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | No | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | cot:
"""
You are a cognitive scientist, to answer the following question:
{question}
I will provide you with several retrieved passages:
Passages:
{passages}
Task Description:
Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents.
Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information.
This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries.
"""
generation:
"""
This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references.
Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages.
""" | We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer. | 1 (top) |
neuragfix (trec_eval) (llm_eval) | neu | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | No | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | cot:
You are a cognitive scientist, to answer the following question:
{question}
I will provide you with several retrieved passages:
Passages:
{passages}
Task Description:
Please extract foundational knowledge that may be familiar to the model or advanced information beyond the model's already familiar foundational knowledge from these passages, and analyze the role of these contents.
Summarize and consolidate these contents, which should deepen the model's understanding of the question through familiarity with these basic and advanced pieces of information.
This process aims to encourage the model to comprehend the question more thoroughly and expand its knowledge boundaries.
generation:
This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context references. The assistant should also indicate when the answer cannot be found in the context references.
Please give a full and complete answer for the question. Cite each context document inline that supports your answer within brackets [] using the IEEE format. Each sentence should have at most three citations. Order the citations in decreasing order of importance. Never include or mention anything about references, this is already provided, just answer the question such that each sentence has one or more sentence-level citations and say nothing else. To deepen the language model's understanding of the question through familiarity with basic and advanced pieces of information. Encourage the language model to comprehend the question more thoroughly and expand its knowledge boundaries. I retrieved some foundational knowledge that is familiar to the model or advanced information beyond the language model's already familiar foundational knowledge from these passages. | We are based on gpt-4o. First, call the first step to generate cot for passages, and then send cot to generate the answer. Use the script you provided to make corrections | 2 |
baseline_frag_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper) | coordinators | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large)
Second Stage (top-3K): RRF(First Stage, monoT5-3B)
Third Stage (top-100): RRF(Second Stage, RankZephyr)
Generation (top-20): Command R Plus | 3 |
UWCrag (trec_eval) (llm_eval) (paper) | WaterlooClarke | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Ensemble/Fusion of First Stages | Few shot prompting | Ask the LLM to generate and extract a summary from the reference documents, and attribute them in one prompt. | 1 (top) |
UWCrag_stepbystep (trec_eval) (llm_eval) (paper) | WaterlooClarke | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Ensemble/Fusion of First Stages | Few shot prompting | Ask the LLM to extract and summarzie an answer from the retrieved documents and attribute them, but for citation, do one sentence of the answer per prompt. | 1 (top) |
UWCgarag (trec_eval) (llm_eval) (paper) | WaterlooClarke | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Ensemble/Fusion of First Stages | Few shot prompting | Ask the LLM to generate an answer first, and then do retrieval using query and the generated answer, and then attribute it. | 1 (top) |
rag_bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper) | softbank-meisei | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | yes | yes | no | Ensemble/Fusion of First Stages | Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly.
The above loop was repeated until found satisfactory. | Retrieval process of this run is as follows:
1. Topic list preprocessing stage
a. Used gpt4o to correct the grammar, spelling mistake and text incompletions
b. Manual checking of topic list to make sure there are no errors still existing
2. BM25 to retrieve the top-100 segments
3. Vector embeddings generation stage
a. Used castorini/tct_colbert-v2-hnp-msmarco to generate embeddings for the segment corpus.
b. Used faiss indexing to create index at document level (containing segment embeddings).
c. Used castorini/tct_colbert-v2-msmarco-cqe to generate embeddings for the prerpocessed topics.
4. For each topic, filtered the set of documents to search for based on the bm25 top-100 retrieval results.
5. Retrieve top-100 segments from each filtered document for the query.
6. Group all set of retrieved segments and sort in descending order
7. Top-100 from the sorted list is submitted as the result
Generation process of this run is as follows:
1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API)
2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries.
3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model.
4. Postprocessing:
a. Reformatted responses from Llama3.1
b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length. | 1 (top) |
ragtask-bm25-rank_zephyr-gpt4o-llama70b (trec_eval) (llm_eval) (paper) | softbank-meisei | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Ensemble/Fusion of First Stages | Manually wrote the prompt, analyzed the generated responses and adjusted the prompt for whatever was lacking in the response.
The above process was repeated until the output was found satisfactory. | Retrieval process of this run is as follows:
1. Topic list preprocessing stage:
a. Used GPT4o to preprocess the query in order to correct the grammar, spelling errors and text incompletion
b. Manual checking of all 301 topics to correct any errors that still exist
2. BM25 to retrieve the relevant top-100 segments
3. Rank zephyr to rerank the retrieved top-100 segments
Generation process of this run is as follows:
1. Top-20 segments for each query was given along with prompt to generate the response using GPT4o (Azure API)
2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries.
3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model.
4. Postprocessing:
a. Reformatted responses from Llama3.1
b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length. | 2 |
LAS-splade-mxbai-rrf-mmr8 (trec_eval) (llm_eval) (paper) | ncsu-las | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Generation-in-the-Loop Pipeline | Slight modifications to the Ragnorak example (https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py) to specify word count range, ask for citations for each sentence. Also added an example output to prompt. | Topic decomposition with GPT4o, query topic and subtopics with SPLADE retrieval, mxbai embeddings to rerank retrieved results, RRF to combine results from topic/subtopics, MMR to choose top 20, validated relevance of top 20 segments with GPT4o prompt, then provided relevant segments to GPT4o for answer generation. | 1 (top) |
LAS-splade-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper) | ncsu-las | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | no | no | no | Multi-Stage Pipeline pointwise | Slight modifications to Ragnorak example to specify word count, ask for reference citations for each sentence. Also provided and example output in the prompt. | SPLADE retrieval for topic question, mxbai embedding rerank, MMR to select top 20 after rerank, GPT4o relevance evaluation of top 20 segments, then provided relevant segments to GPT4o for answer generation | 2 |
LAS-T5-mxbai-mmr8-RAG (trec_eval) (llm_eval) (paper) | ncsu-las | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | no | no | no | Multi-Stage Pipeline pointwise | modified Ragnorak example to specify word count, include citations for each sentence, and provide an output example | First stage retrieval T5 embeddings cosine similarity, rerank with MXBAI embeddings, select top20 reranked with MMR, validate relevance of top 20 with GPT4o query, pass relevant segments to GPT4o for answer generation. | 3 |
BEST_cot_gpt3.5 (trec_eval) (llm_eval) | citi | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise+pair/listwise | This run involves using Chain-of-thoughts prompting and giving two examples of baseline generation results to LLM(gpt-3.5-turbo) then generate the results. | This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Chain-of-Thought prompting is employed, where two example baseline generation results are provided to GPT-3.5 turbo before generating the final results. Postprocessing is added after generation to further refine the results and delete any unneeded references. | 1 (top) |
BEST_gpt3.5_another_prompt (trec_eval) (llm_eval) | citi | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise+pair/listwise | This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results. | Please provide a short description of this run :
This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references. | 3 |
SECOND_cot_gpt3.5 (trec_eval) (llm_eval) | citi | manual | no | no | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise | This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. | This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. | 2 |
BEST_gpt3.5_new_prompt (trec_eval) (llm_eval) | citi | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise+pair/listwise | This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results. | This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references. | 4 |
BEST_gpt3.5 (trec_eval) (llm_eval) | citi | manual | no | yes | yes | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise+pair/listwise | This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results. | This process involves using the Cohere embedding model to encode documents and queries, retrieving the top results, and then refining them by reranking the top 100 with the GPT-3.5 turbo model. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references. | 5 |
SECOND_gpt3.5_new_prompt (trec_eval) (llm_eval) | citi | manual | no | no | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise | This run involves using other prompt and to LLM(gpt-3.5-turbo) then generate the results. | This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Our prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references. | 6 |
SECOND_gpt3.5 (trec_eval) (llm_eval) | citi | manual | no | no | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | no | no | Multi-Stage Pipeline pointwise | This run involves using baseline prompt and to LLM(gpt-3.5-turbo) then generate the results. | This run involves using Cohere's embedding to retrieve the documents and then using Cohere's reranker to rerank these results. Baseline prompt are provided to GPT-3.5 turbo before generating the final results. Postorocessing is added after generation to further refine the results and delete any unneeded references. | 7 |
iiresearch-bm25-top10-llama3-8b-instruct (trec_eval) (llm_eval) (paper) | ii_research | manual | no | no | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | no | no | Traditional Only | Instruct the LLM to be an AI assistant tasked with answering questions based on a set of provided references and every time answering a question, each sentence must be followed by a citation list indicating the specific document indexes from the provided references that support that sentence. | A standard RAG pipeline retrieves the top 10 relevant documents using BM25 as additional context and instructs LLaMa3-8B-instruct to generate responses. | 1 (top) |
FT-llama3 (trec_eval) (llm_eval) | uog-tht | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise | automatic | BM25+Mixedbread for retriever
Llama3-8B finetuned with citation-enable QA datasets for generator | 1 (top) |
zeph_test_rag_rrf_expand_query (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations | The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Prior to dense retrieval, the query is processed by a small LLM to generate similar queries. Retrieval is then performed using both the original and generated queries, followed by the application of RRF. | 1 (top) |
zeph_test_rag_rrf_raw_query (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. | It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Raw queries are used for retrieval. Following the first a RRF is performed on BM25 and dense retrieval results. | 2 |
zeph_test_rag24_doc_query_expansion+rrf (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | The generation pipeline utilizes all 20 segments. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. | The pipeline consists of three stages. The first stage leverages BM25 combined with dense retrieval. The second stage employs the Stella model for reranking, and the final stage uses Zepher for list-wise sorting. Before dense retrieval, the query is expanded to generate a small passage on the central theme. Retrieval is then performed using both the raw query and the generated paragraph. RRF is performed at the first stage | 4 |
LAS-splade-mxbai-rrf-mmr8-doc (trec_eval) (llm_eval) (paper) | ncsu-las | automatic | no | yes | no | Both Corpora | no | yes | yes | no | no | no | no | Generation-in-the-Loop Pipeline | Modified the Ragnorak example to include word count and instruct for citations for each sentence, and added sample output example to prompt. | Topic decomposition with GPT4o, SPLADE segment retrieval for topic+sub-topics, mxbai embeddings rerank, RRF to aggregate retrieval, MMR to get top20 for generation, validate relevance of top20 with GPT4o, get entire document for relevant segments, generate answer by prompting gpt4o with documents. | 4 |
oneshot_post_sentenced (trec_eval) (llm_eval) | buw | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | no | no | Learned Dense Only | Prompting engineering was a manual process of starting with a simple prompt and improving it iteratively based on reviews of the generation results. Some of the challenges included keeping runs:
- brief
- non-repetitive
- without self referencing or citing | The steps for this run are as follows:
1. Retrieve embeddings from the db with the embeddings (weaviate vector db with dense vectors)
2. Feed all segments all at once to the LLM in its system prompt and have it generate an answer for the query with the segments
3. Get the answer and separate it into sentences
4. Run the sentences through the embedding model
5. Look at similarity between the answer sentence and the segments used sentence by sentence (response_sentence -> segment_sentence)
6. Cite segments that produce >= 0.7 cosine similarity (0.7 was estimated through analysing the distribution of cosine similarities across queries) | 1 (top) |
LAS_splad_mxbai-rrf-occams_50_RAG (trec_eval) (llm_eval) (paper) | ncsu-las | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | no | yes | yes | Ensemble/Fusion of First Stages | The title was used in the direct generation, very similar to the baseline. sys_prompt = """This is a chat between a user and an artificial intelligence assistant. The assistant gives helpful and detailed answers to the user's question based on the context. The assistant should also indicate when the answer cannot be found in the context.""" user_prompt = f"""INSTRUCTION: Please give a fluent 400-word summary that answers the given question as completely as possible. Each sentence of the answer should represent unique information inferred from the context provided. After each sentence, cite each context document that supports your answer within brackets [] using the IEEE format. An example answer should look like: During the investigation, President Bill Clinton initially denied the accusations by stating, “I never told anybody to lie, not a single time; never” and emphasized, “These allegations are false. And I need to go back to work for the American people” [1]. On January 26, 1998, Clinton publicly remarked, “I did not have sexual relations with that woman, Miss Lewinsky” [1]. However, on August 17, 1998, Clinton admitted before a federal grand jury that he had engaged in an “improper physical relationship” with Monica Lewinsky [1], [16]. During the Senate trial that followed, Clinton’s defense hinged on the interpretation of the word ‘is’ to contest the truthfulness of his earlier denials [26]. Despite his previous denials, the evidence, including the infamous blue dress with Clinton’s DNA, contradicted his initial statements, ultimately leading to charges of perjury and obstruction of justice [1], [16], [18]. QUESTION: {topic_text} CONTEXTS: {context_text} INSTRUCTION:Please give a complete answer to the question. Cite each context document that supports your answer within brackets [] using the IEEE format.""" return sys_prompt, user_prompt | SPLADE full feature, segment summarizer, which is very similar to our priority 1 run; however, redundancy is removed by employing an extractive summarizer, occams, which approximates a bounded maximal coverage of the bigrams in the top 50 segments | 5 |
dilab_repllama_listt5_pass1_gpt4o (trec_eval) (llm_eval) | ldisnu | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines. | We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process. | 4 |
qrant_bge_gemini (trec_eval) (llm_eval) | SGU | manual | yes | yes | yes | MS MARCO v2.1 Segment Only | no | no | yes | no | yes | yes | no | Traditional Only | Due to hardware problems and limitations, we only used 80% of the organizer's data, we used Qdrant (cosine) as the storage, and used the public bge embedding model. Then used gemini pro 1.5 to generate the answer. | Due to hardware problems and limitations, we only used 80% of the organizer's data, we used Qdrant (cosine) as the storage, and used the public bge embedding model. Then used gemini pro 1.5 to generate the answer. | 1 (top) |
PG-mistral (trec_eval) (llm_eval) | uog-tht | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Multi-Stage Pipeline pointwise | automatic | BM25+Mixedbread
Mistral + Post citing | 1 (top) |
ISIR-IRIT-zephyr_p2 (trec_eval) (llm_eval) | IRIT | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | no | no | yes | Multi-Stage Pipeline pointwise | Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation) | This run, first retrieves a list of passages using BM25 + MonoT5. The top 2 documents are fed to an LLM (zephyr 3b). The LLM is instructed to generate an answer with citations. We additionally direct the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. | 1 (top) |
ISIR-IRIT-zephyr_query_gen (trec_eval) (llm_eval) | IRIT | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | no | yes | yes | Generation-in-the-Loop Pipeline | This run consists of two generation prompts: Answer generation and subquery generation. As a part of the retrieval pipeline, an LLM is prompted to generate a list of subqueries to be later used for retrieval. The subquery generation is done using few-shot prompting. We first selected our fewshot examples. We used an external dataset (HAGRID) and selected from their training-set queries that have long answers (with 2 or more citations mentioned), for each query, we gathered a number of subqueries from Google Search API + freely written one by us. We select a subset of these queries with the best retrieval results on the training set. These queries are used as example subqueries in the prompt. For answer generation, several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed that, for instance the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. A couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation) | In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The final list is fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. | 2 |
ISIR-IRIT-zephyr_sprompt_3p (trec_eval) (llm_eval) | IRIT | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | no | no | yes | Multi-Stage Pipeline pointwise | Several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed for example, that the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. A Couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation) | In this run, a two-sage retrieval pipleline is used to first retrieve a list of relevant passages to the topic. An LLM (Zephyr 3B) is then instructed to generate an answer with citations using the top 3 passages returned in the retrieval stage. The prompt used in this run is simple and straightforward, instructing the model to provide a comprehensive answer. | 4 |
ISIR-IRIT-zephyr_query_gen_3p (trec_eval) (llm_eval) | IRIT | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | no | yes | yes | Generation-in-the-Loop Pipeline | This run consists of two generation prompts: Answer generation and subquery generation. As a part of the retrieval pipeline, an LLM is prompted to generate a list of subqueries to be later used for retrieval. The subquery generation is done using few-shot prompting. We first selected our fewshot examples. We used an external dataset (HAGRID) and selected from their training-set queries that have long answers (with 2 or more citations mentioned), for each query, we gathered a number of subqueries from Google Search API + freely written one by us. We select a subset of these queries with the best retrieval results on the training set. These queries are used as example subqueries in the prompt. For answer generation, several phrasings of a simple prompt were first tested on a few examples in order to detect and eliminate bad phrasing. We noticed that, for instance the model tends to ignore certain instructions when they are mentioned towards the end and when the prompt exceeds a certain length. Therefore, we kept the prompt simple and straightforward, instructing the model to generate an answer with citations and briefly describing how the citations are to be mentioned. We additionally directed the model to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking. A couple of prompts were tested on a larger scale (full test set) and we selected the best prompt according to automatic metrics of evaluation (citation evaluation) | In this run, we augment the retrieval pipeline with subqueries. We first instruct an LLM (zephyr 3B) to generate a list of 4 subqueries given the user's query (official topic) using fewshot prompting. The examples in the fewshot prompting are selected from a trainingset based on how well they improved the retrieval results on their respective user queries. For each generated subquery + the user's query we run a two-stage retrieval model (BM25+MonoT5). We obtain a list of relevant passages for each subquery + user query. We aggregate these lists into one list using a reranking approach. The top 3 passages from the final list are fed to an LLM (zephyr 3B) that is instructed to generate an answer with its citations. The model is additionally directed to only use the passages it finds relevant to the answer, allowing thus for a sort of an additional reranking | 3 |
ICL-mistral (trec_eval) (llm_eval) | uog-tht | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise | automatic | BM25+Mixedbread
ICL + Output Processing | 1 (top) |
Ranked_Iterative_Fact_Extraction_and_Refinement_RIFER_-_bm25 (trec_eval) (llm_eval) (paper) | TREMA-UNH | automatic | no | no | yes | MS MARCO v2.1 Segment Only | no | no | yes | yes | yes | no | yes | Traditional Only | I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness. | Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts. | 1 (top) |
zeph_test_rag_rrf_expand_query_mistral (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats.
To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK.
A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions. | It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied. | 3 |
zeph_test_rag_rrf_expand_mistral_top_15 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | "The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats.
To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK.
A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions." | It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied. | 5 |
iiia_dedup_p1_reverse (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_dedup_p2_reverse (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
zeph_test_rag_rrf_expand_top_5 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | "The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats.
To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK.
A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions. Smaller number of passages seems to produce smaller response. | It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied. | 6 |
zeph_test_rag_rrf_expand_top_10 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | "The process required additional caution and post-processing. Mistal-7B often encountered issues such as errors with bullet points, overflowing citations, and formatting inconsistencies. It also tended to generate responses with incorrect citation formats.
To address these issues, the model was instructed to generate output using << and >> to enclose inline citations. Citations were then passed through post-processing to ensure they matched the sentences correctly. Sentence segmentation was achieved using NLTK.
A manually curated example was provided in the initial user prompt to clearly convey the formatting instructions." | It is a three stage pipeline, first stage leverages BM25+dense retrieval. second stage uses Stella model for reranking and final step uses Zepher to generate list wise sorting. Before the dense retrieval, the query was sent to LLM (small) to generate similar queries. Using these similar queries and the original query, retrieval is performed and RRF is applied. | 7 |
iiia_dedup_p1_straight (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_dedup_p2_straight (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p1_reverse (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p2_reverse (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p1_straight (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p2_straight (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
dilab_repllama_listt5_pass2_gpt4o (trec_eval) (llm_eval) | ldisnu | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines. | We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process. | 3 |
dilab_repllama_listt5_pass3_gpt4o (trec_eval) (llm_eval) | ldisnu | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines. | We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process. | 1 (top) |
dilab_repllama_listt5_pass4_gpt4o (trec_eval) (llm_eval) | ldisnu | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines. | We used Repllama-7B as a bi-encoder for the first-stage retrieval and reranked the top-100 results using ListT5-3B with r=2 and tournament sort. The first reranking was done on Repllama's top-1000 results, with subsequent rerankings performed on previous top-100 results across multiple passes. The best result from the evaluation using GPT-4O was a precise overall score calculation, showcasing the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process. | 2 |
ielab-b70bf-70bqfs-ad_hoc (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | no | no | Generation-in-the-Loop Pipeline | Copy from Paper: https://arxiv.org/abs/2407.01796 | Retriever: Fusion(stella+BM25) -> Pooled -> setwise
Generator: Llama-3.1-70b-instruct, quantized, vllm
Prompt: Few shot prompt
Attributor: In Context from generator prompt, ad_hoc | 2 |
ielab-b8bf-8bfs-ad_hoc (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Copy from paper: https://arxiv.org/abs/2407.01796 | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-8B-instruct
Prompt: Few shot prompt with example
Attributor: In context from generator prompt, ad_hoc | 9 |
iiia_dedup_p1_reverse_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_dedup_p1_straight_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_dedup_p2_reverse_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_dedup_p2_straight_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_standard_p1_reverse_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_standard_p1_straight_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
ielab-b70bf-70bqp-70bafs (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-70b-instruct, quantised, vllm
Prompt: Plain answer generation without citation ids
Attributor: Llama-3.1-70b-instruct | 1 (top) |
iiia_standard_p2_reverse_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the retrieval as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_standard_p2_straight_ht (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | yes | yes | Learned Dense Only | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the retrieval as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
ielab-b8bf-8bzs-ad_hoc (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Copy from paper: https://arxiv.org/abs/2407.01796 | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-8b-instruct
Prompt: Zero-shot Prompt without examples
Attributor: In context generating answers directly (ad_hoc) | 7 |
ielab-b8bf-8bp-8ba (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-8b-instruct
Prompt: Plain answer generation without citation ids
Attributor: Llama-3.1-8b-instruct
Attributor_propmt: Zero-shot | 4 |
webis-manual (trec_eval) (llm_eval) (paper) | webis | manual | no | yes | yes | MS MARCO v2.1 Segment Only | yes | yes | yes | yes | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | We did not directly used prompts, as the responses are manually generated by humans for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784']. | We did create manual responses for 31 topics (ca. 40 hours of manual work; creating a manual response for a topic often takes between 1 and 2 hours per topic). The responses that are padded from the baseline are from baseline_rag24.test_gpt-4o_top20 without any modification. The topics for which we manually generated responses are ['2024-109837', '2024-142125', '2024-127349', '2024-224701', '2024-126326', '2024-224013', '2024-42464', '2024-34595', '2024-213491', '2024-37566', '2024-34710', '2024-29222', '2024-42738', '2024-6587', '2024-214744', '2024-6778', '2024-215952', '2024-79154', '2024-23680', '2024-41849', '2024-24226', '2024-36302', '2024-152259', '2024-34582', '2024-27366', '2024-145979', '2024-36935', '2024-216592', '2024-32912', '2024-153051', '2024-128784']. | 1 (top) |
ielab-b-8bp-8bafs (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25)
Generator: Llama-3.1-8b-instruct
Prompt: Plain answer generation without citation ids
Attributor: Llama-3.1-8b-instruct
Attributor_prompt: Few shot | 3 |
ielab-b8b-8bp-8bafs (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25) -> Pooled
Generator: Llama-3.1-8b-instruct
Prompt: Plain answer generation without citation ids
Attributor: Llama-3.1-8b-instruct
Attributor_prompt: few shot | 7 |
webis-rag-run0-taskrag (trec_eval) (llm_eval) (paper) | webis | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step. | This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format. | 1 (top) |
buw (trec_eval) (llm_eval) | buw | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | no | yes | yes | yes | Traditional Only | The prompt engineering in the code involves carefully crafting the input queries to the language model (LLM) and designing a pipeline to ensure that the generated responses are relevant, accurate, and well-supported by retrieved data. Here's how it is done:
1. Query Construction: The `query` is generated for each `topic_id`, ensuring that the input to the LLM is clear and directly related to the topic being addressed. This step sets the foundation for generating a contextually relevant response.
2. Contextual Retrieval: The pipeline retrieves relevant segments from a collection based on the `query`. These segments provide the LLM with a focused context, which is crucial for generating a response that aligns closely with the desired topic.
3. LLM Prompting: The segments are passed to the LLM as part of the prompt. The design of this prompt is crucial because it ensures that the model considers the most relevant information during response generation. This approach helps in obtaining a more precise and contextually appropriate output.
4. Response Splitting and Refinement: The generated response from the LLM is split into sentences, each of which is treated as a distinct unit of information. This allows for a more granular alignment between the prompt (query and segments) and the LLM's output, ensuring that each sentence can be individually analyzed for relevance.
5. Similarity-Based Citations: After generating the response, the sentences are compared with the original retrieved segments using cosine similarity. This process refines the LLM's output by tying each sentence back to the most relevant segments, enhancing the accuracy and reliability of the final response. | The prompt engineering for this generation pipeline revolves around a systematic approach to retrieving and generating relevant content from a collection based on a given query. The pipeline is structured as follows:
1. Query Handling: For each `topic_id`, a query is executed to retrieve relevant text segments using a near-text search, returning metadata that includes the distance and certainty metrics.
2. Data Retrieval: The retrieved objects are then processed to fetch the associated vectors, text segments, and document IDs. This is necessary because the initial query response does not include the vectors.
3. Response Generation: A request is made to an LLM using the query and retrieved segments, generating a response that is then split into individual sentences for further processing.
4. Similarity Calculation: Each sentence is encoded into a vector and compared with the vectors of the retrieved segments using cosine similarity. This determines the relevance of each segment to the sentence.
5. Thresholding and Citation Assignment: The most relevant segments, determined by a calculated threshold, are assigned as citations for each sentence.
6. Output Formatting: The output is structured into a JSON object, which includes metadata such as the run ID, topic, references, response length, and the generated response with citations. This object is then serialized and written to a JSONL file. | 2 |
webis-rag-run1-taskrag (trec_eval) (llm_eval) (paper) | webis | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step. | This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format. | 2 |
ielab-b8bf-8bp-8bafs (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-8b-instruct
Prompt: Plain answer generation without citation ids
Attributor: Llama-3.1-8b-instruct
Attributor_prompt: few_shot | 4 |
webis-rag-run3-taskrag (trec_eval) (llm_eval) (paper) | webis | automatic | no | yes | yes | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step. | This run uses the webis-01 retrieval run as retrieval input. For generation, we decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format. | 4 |
buw_2 (trec_eval) (llm_eval) | buw | manual | yes | yes | no | MS MARCO v2.1 Segment Only | yes | yes | yes | yes | yes | yes | yes | Traditional Only | The prompt engineering in the provided code involves crafting queries and responses for efficient interaction with the language model and the retrieval system. Here's how it is done:
1. **Query Construction**: The `query` is formulated based on the `topic_id`, which is used to retrieve relevant documents from the collection. The prompt engineering here ensures that the query is precise and relevant to the topic, optimizing the retrieval of the most pertinent text segments.
2. **LLM Response Prompting**: After retrieving the relevant text segments, these segments are used as context in a prompt sent to the language model (`make_llm_request(query, segment_list)`). The prompt is designed to generate a coherent and informative response from the model based on the provided context.
3. **Sentence-Level Analysis**: The LLM-generated response is split into individual sentences, each treated as a distinct prompt. This allows for a more granular analysis of the response, making it easier to match each sentence with relevant citations based on semantic similarity.
4. **Similarity-Based Citation**: For each sentence in the LLM's response, its semantic similarity with the retrieved text segments is calculated. This ensures that the model's output is supported by relevant and contextually similar citations, enhancing the credibility and relevance of the generated content.
Through these steps, the prompts are carefully engineered to retrieve, generate, and align information effectively, ensuring high-quality, contextually appropriate outputs. | dense | 2 |
webis-rag-run4-reuserag (trec_eval) (llm_eval) (paper) | webis | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | no | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups. | This run uses the webis-01 retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response. | 5 |
webis-rag-run5-reuserag (trec_eval) (llm_eval) (paper) | webis | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | no | yes | no | yes | yes | Multi-Stage Pipeline pointwise+pair/listwise | N/A | Segments from the webis-01 run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response. | 6 |
buw_3 (trec_eval) (llm_eval) | buw | manual | yes | yes | no | MS MARCO v2.1 Segment Only | yes | yes | yes | yes | yes | yes | yes | Traditional Only | manually framed the words for finetuning the generation | dense | 2 |
ruc001 (trec_eval) (llm_eval) | Ruc01 | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | no | yes | no | yes | Traditional Only | Q: What state is home to the university that is represented in sports by George Washington Colonials men's basketball?
A: First, the education institution has a sports team named George Washington Colonials men's basketball in is George Washington University , Second, George Washington University is in Washington D.C. The answer is {Washington, D.C.}.
Q: Who lists Pramatha Chaudhuri as an influence and wrote Jana Gana Mana?
A: First, Bharoto Bhagyo Bidhata wrote Jana Gana Mana. Second, Bharoto Bhagyo Bidhata lists Pramatha Chaudhuri as an influence. The answer is {Bharoto Bhagyo Bidhata}.
Q: Who was the artist nominated for an award for You Drive Me Crazy?
A: First, the artist nominated for an award for You Drive Me Crazy is Britney Spears. The answer is {Jason Allen Alexander}.
Q: What person born in Siegen influenced the work of Vincent Van Gogh?
A: First, Peter Paul Rubens, Claude Monet and etc. influenced the work of Vincent Van Gogh. Second, Peter Paul Rubens born in Siegen. The answer is {Peter Paul Rubens}.
Q: What is the country close to Russia where Mikheil Saakashvii holds a government position?
A: First, China, Norway, Finland, Estonia and Georgia is close to Russia. Second, Mikheil Saakashvii holds a government position at Georgia. The answer is {Georgia}.
Q: What drug did the actor who portrayed the character Urethane Wheels Guy overdosed on?
A: First, Mitchell Lee Hedberg portrayed character Urethane Wheels Guy. Second, Mitchell Lee Hedberg overdose Heroin. The answer is {Heroin}.""" | Serialization pipeline of a retrieve-reorder-then-generate model. | 1 (top) |
buw_5 (trec_eval) (llm_eval) | buw | manual | yes | yes | no | MS MARCO v2.1 Segment Only | yes | yes | yes | yes | yes | yes | yes | Traditional Only | finetuned the prompt manually until expected output | dense | 1 (top) |
UDInfolab.bgeV2 (trec_eval) (llm_eval) | InfoLab | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | yes | no | no | Learned Dense Only | Testing several prompts | This run use BGE + gpt-4o-mini | 3 |
UDInfolab.RAG.AnsAI (trec_eval) (llm_eval) | InfoLab | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise | I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents, finally I used gpt4 for generation. | Bge + reranker + using doc2Query | 2 |
UDInfolab.RAG.Query (trec_eval) (llm_eval) | InfoLab | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise | I started with all the documents, then using BGE I got the 500 most similar, then in the next stage I used a reranker to get the 100 most relevant documents however i used LLM to modify the query to be more user-friendly for the retrieval step. | BGE+ reranker (query modified) | 1 (top) |
ielab-b70bf-70bqp-rarr (trec_eval) (llm_eval) | ielab | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | no | no | yes | yes | yes | no | Generation-in-the-Loop Pipeline | Trial/error, running it by checking with respect to one query and output; observe from the response to refine the prompt (Using GPT-4 model) | Retriever: Fusion(Stella+BM25) -> Pooled -> Setwise
Generator: Llama-3.1-70b-instruct, quantised, vllm
Prompt: Plain answer generation without citation ids
Attributor: RARR | 9 |
UDInfolab.RAG.bge.tuned (trec_eval) (llm_eval) | InfoLab | manual | yes | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | no | no | Multi-Stage Pipeline pointwise | bge+parameters tuned | bge+parameters tuned | 6 |
UDInfolab.RAG.bge.QueryAgm.tuned (trec_eval) (llm_eval) | InfoLab | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | yes | no | no | no | Multi-Stage Pipeline pointwise | bge+ reranker used in the retrieval step + openai modifying parameters | bge+ reranker used in the retrieval step + openai modifying parameters | 5 |
UDInfolab.RAG.bge.QueryAnsAI.tuned (trec_eval) (llm_eval) | InfoLab | manual | no | yes | no | MS MARCO v2.1 Segment Only | no | no | yes | no | yes | no | no | Traditional Only | bge+ reranker + query2doc used in the retrieval step + openai modifying parameters | bge+ reranker + query2doc used in the retrieval step + openai modifying parameters | 4 |
listgalore_gpt4o_ragnarokv4_top20 (trec_eval) (llm_eval) | h2oloo | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | yes | yes | no | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Claude-aided prompt building | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(RankGPT4-o, RankLLaMA3.1-70B, RankZephyr)
Generation (top-20): Ragnarok V4 Prompt - GPT-4o | 1 (top) |
listgalore_l31-70b_ragnarokv4_top20 (trec_eval) (llm_eval) | h2oloo | automatic | no | yes | no | MS MARCO v2.1 Segment Only | no | yes | no | yes | yes | yes | no | Multi-Stage Pipeline pointwise+pair/listwise | Claude-aided prompt building | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(RankGPT4-o, RankLLaMA3.1-70B, RankZephyr)
Generation (top-20): Ragnarok V4 Prompt - L3.1-70B | 2 |
listgalore_gpt4o_ragnarokv4nocite_top20 (trec_eval) (llm_eval) | h2oloo | automatic | yes | no | no | Neither Corpora | no | no | yes | no | no | no | no | Traditional Only | Claude-aided building | Ragnarok (prompt v4 - No Retrieval): GPT-4o top20 | 3 |
listgalore_l31-70b_ragnarokv4nocite_top20 (trec_eval) (llm_eval) | h2oloo | automatic | yes | no | no | Neither Corpora | no | no | no | no | yes | no | no | Traditional Only | Claude-aided prompt building | Ragnarok (prompt v4 - No Retrieval): L3.1-70B top20e | 4 |