The Thirty-Third Text REtrieval Conference
(TREC 2024)

Retrieval-Augmented Generation Augmented generation task Appendix

RuntagOrgIs this a manual (A human making choices based on seeing the topics/results) or automatic run?Does this run leverage proprietary models in any step of the generation pipeline?Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline?Does this run leverage smaller open-weight language models in any step of the generation pipeline?Was this run padded with results from a baseline run?Please describe how you went about prompt engineering the generation pipelinePlease provide a short description of this runPlease give this run a priority for inclusion in manual assessments
baseline_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper)coordinators
automatic
yes
no
no
no
Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py
First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): GPT-4o
1 (top)
baseline_rag24.test_l31_70b_instruct_top20 (trec_eval) (llm_eval) (paper)coordinators
automatic
no
yes
no
no
Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py
First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): L3.1-70B-Instruct
2
baseline_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper)coordinators
automatic
no
yes
no
yes
https://github.com/castorini/ragnarok/blob/06159e704d260b3e243499d0e6290ac9868cbd7c/src/ragnarok/generate/cohere.py#L49
First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large) Second Stage (top-3K): RRF(First Stage, monoT5-3B) Third Stage (top-100): RRF(Second Stage, RankZephyr) Generation (top-20): Command R Plus (API)
3
agtask-bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper)softbank-meisei
automatic
yes
yes
no
no
Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly. The above loop was repeated until found satisfactory.
Generation process of this run is as follows: 1. Top-20 segments from the provided retrieval list for each query was given along with prompt to generate the response using GPT4o (Azure API) 2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries. 3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model. 4. Postprocessing: a. Reformatted responses from Llama3.1 b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length.
1 (top)
ldilab_gpt_4o (trec_eval) (llm_eval) ldisnu
automatic
yes
no
no
yes
For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines.
The best result from the evaluation using GPT-4O was a clear and concise overall score calculation, demonstrating the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process.
1 (top)
Ranked_Iterative_Fact_Extraction_and_Refinement (trec_eval) (llm_eval) (paper)TREMA-UNH
automatic
yes
yes
yes
yes
I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.
Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts.
1 (top)
Enhanced_Iterative_Fact_Refinement_and_Prioritization (trec_eval) (llm_eval) (paper)TREMA-UNH
automatic
yes
yes
yes
yes
I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness.
The iterative fact verification process follows the RIFER method, with the LLM verifying key facts across documents and removing redundant content. An additional prompt-based refinement step removes any irrelevant parts of the key facts, ensuring conciseness. The key facts are then sorted by relevance using the LLM, and a final smoothing process enhances clarity and coherence, resulting in a polished output.
1 (top)
gpt_mini (trec_eval) (llm_eval) KML
manual
yes
no
no
yes
gpt mini with prompt
gpt mini with prompt.
1 (top)
ginger_top_5 (trec_eval) (llm_eval) (paper)uis-iai
automatic
no
yes
no
yes
We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.
We use a multi-stage pipeline that contains information nugget detection, clustering, ranking, top clusters summarization and fluency enhancement to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.
1 (top)
baseline_top_5 (trec_eval) (llm_eval) (paper)uis-iai
automatic
no
yes
no
yes
We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.
We summarize the top 5 candidates from the retrieval baseline provided by the organizers with GPT-4. The response is limited to 3 sentences.
2
ginger-fluency_top_5 (trec_eval) (llm_eval) (paper)uis-iai
automatic
no
yes
no
yes
We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.
We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences.
3
ginger-fluency_top_10 (trec_eval) (llm_eval) (paper)uis-iai
automatic
no
yes
no
yes
We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.
We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 10 candidates from the retrieval baseline provided by the organizers.
4
ginger-fluency_top_20 (trec_eval) (llm_eval) (paper)uis-iai
automatic
no
yes
no
yes
We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample.
We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 20 candidates from the retrieval baseline provided by the organizers.
5
iiia_dedup_p1_reverse_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_dedup_p2_reverse_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_dedup_p1_straight_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_dedup_p2_straight_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
ag_rag_gpt35_expansion_rrf_20 (trec_eval) (llm_eval) IITD-IRL
automatic
yes
no
no
yes
The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. It was an automatic run.
This run aims at leverage the full set of context limit with a larger LLM.
1 (top)
iiia_standard_p1_reverse_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.
This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_standard_p2_reverse_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_standard_p1_straight_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.
This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
iiia_standard_p2_straight_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc.
1 (top)
ag_rag_mistral_expansion_rrf_20 (trec_eval) (llm_eval) IITD-IRL
automatic
no
yes
no
yes
The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.
This run aims to leverage small LLM + top -20 passage. It aims to navigate with prompt to generate desired response.
2
ag_rag_mistral_expansion_rrf_15 (trec_eval) (llm_eval) IITD-IRL
automatic
no
yes
no
yes
The generation pipeline utilizes top 15 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.
This run aims to leverage small LLM + top -15 passage. It aims to navigate with prompt to generate desired response.
5
ag_rag_mistral_expansion_rrf_7 (trec_eval) (llm_eval) IITD-IRL
automatic
no
yes
no
yes
The generation pipeline utilizes top 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run.
This run aims to leverage small LLM + top -7 passage. It aims to navigate with prompt to generate desired response and also evaluate the quality of retrieval.
6
ag_rag_gpt35_expansion_rrf_15 (trec_eval) (llm_eval) IITD-IRL
automatic
yes
no
no
yes
The generation pipeline utilizes top-20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. It was an automatic run.
This experiment aims to evaluate the performance of the bigger LLM vis-a-vis smaller. Given the similar set of passages, which can generate relevant responses.
3
ag_rag_gpt35_expansion_rrf_7 (trec_eval) (llm_eval) IITD-IRL
automatic
yes
no
no
yes
The generation pipeline utilizes first 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format.
This is aimed at performance of system when few passages are supplied. How much information LLM infuses, apart from passages provided in the context.
4
cir_gpt-4o-mini_Jaccard_50_0.5_100_301_p0 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini Reranking: Rerank retrieval results with MMR method for diversity MMR similarity: Jaccard MMR Lambda: 0.5 Number of reranked segments used for generation: 50 Prompt: Summarization Prompt
5
cir_gpt-4o-mini_Jaccard_50_1.0_100_301_p0 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Jaccard, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Summarization Prompt
7
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt
1 (top)
cir_gpt-4o-mini_Cosine_50_0.25_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.25, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt
8
iiia_standard_p1_straight_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.
This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
iiia_standard_p2_straight_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
webis-ag-run0-taskrag (trec_eval) (llm_eval) (paper)webis
automatic
yes
yes
yes
no
Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.
We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.
1 (top)
cir_gpt-4o-mini_Cosine_50_0.75_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.75, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt
9
webis-ag-run1-taskrag (trec_eval) (llm_eval) (paper)webis
automatic
yes
yes
yes
no
Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step.
We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format.
2
cir_gpt-4o-mini_Cosine_50_1.0_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 1.0, Number of reranked segments used for generation: 50, Prompt: Explicit RAG task prompt
6
cir_gpt-4o-mini_Cosine_20_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 20, Prompt: Explicit RAG task prompt
10 (bottom)
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p2 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: RAG for beginner level
3
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p3 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Reranking: Rerank retrieval results with MMR method for diversity, MMR similarity: Cosine, MMR Lambda: 0.5, Number of reranked segments used for generation: 50, Prompt: RAG for expert level
4
iiia_standard_p1_reverse_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.
This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
iiia_standard_p2_reverse_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
cir_gpt-4o-mini_no_reranking_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper)CIR
automatic
yes
no
no
no
The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions.
Model: gpt-4o-mini, Number of segments used for generation: 50, Prompt: Explicit RAG task prompt
2
iiia_dedup_p1_straight_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
webis-ag-run3-reuserag (trec_eval) (llm_eval) (paper)webis
automatic
no
no
yes
no
N/A
Segments from the baseline run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response.
4
webis-ag-run2-reuserag (trec_eval) (llm_eval) (paper)webis
automatic
no
no
yes
no
Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups.
This run uses the baseline retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response.
3
iiia_dedup_p1_reverse_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
iiia_dedup_p2_reverse_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
iiia_dedup_p2_straight_ht_ag (trec_eval) (llm_eval) IIIA-UNIPD
automatic
no
yes
yes
no
I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement.
This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used.
1 (top)
cohere+post_processing (trec_eval) (llm_eval) KML
manual
yes
no
no
no
cohere prompt and post processing
cohere prompt for individual answer, single prompt for whole rag. Post processing was done on the answers
2
UDInfolab.AG-v1 (trec_eval) (llm_eval) InfoLab
manual
yes
no
no
no
topk20 + openai
topk20 + openai
2
UDInfolab.AG-v2 (trec_eval) (llm_eval) InfoLab
automatic
yes
no
no
no
automatic
Topk20+AG
1 (top)
gpt_mini_double_prompt (trec_eval) (llm_eval) KML
manual
yes
no
no
no
Two steps prompt instead of single prompt, first for answer finding second for citations.
Two steps prompt instead of single prompt, first for answer finding second for citations.
1 (top)