Runtag | Org | Is this a manual (A human making choices based on seeing the topics/results) or automatic run? | Does this run leverage proprietary models in any step of the generation pipeline? | Does this run leverage open-weight LLMs (> 5B parameters) in any step of the generation pipeline? | Does this run leverage smaller open-weight language models in any step of the generation pipeline? | Was this run padded with results from a baseline run? | Please describe how you went about prompt engineering the generation pipeline | Please provide a short description of this run | Please give this run a priority for inclusion in manual assessments |
---|---|---|---|---|---|---|---|---|---|
baseline_rag24.test_gpt-4o_top20 (trec_eval) (llm_eval) (paper) | coordinators | automatic | yes | no | no | no | Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large)
Second Stage (top-3K): RRF(First Stage, monoT5-3B)
Third Stage (top-100): RRF(Second Stage, RankZephyr)
Generation (top-20): GPT-4o | 1 (top) |
baseline_rag24.test_l31_70b_instruct_top20 (trec_eval) (llm_eval) (paper) | coordinators | automatic | no | yes | no | no | Uses a variation of ChatQA template: https://github.com/castorini/ragnarok/blob/main/src/ragnarok/generate/templates/chat_qa.py | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large)
Second Stage (top-3K): RRF(First Stage, monoT5-3B)
Third Stage (top-100): RRF(Second Stage, RankZephyr)
Generation (top-20): L3.1-70B-Instruct | 2 |
baseline_rag24.test_command-r-plus_top20 (trec_eval) (llm_eval) (paper) | coordinators | automatic | no | yes | no | yes | https://github.com/castorini/ragnarok/blob/06159e704d260b3e243499d0e6290ac9868cbd7c/src/ragnarok/generate/cohere.py#L49 | First Stage (top-3K): RRF(BM25 + Rocchio, Snowflake Embed L, Snowflake Embed M, GTE Large)
Second Stage (top-3K): RRF(First Stage, monoT5-3B)
Third Stage (top-100): RRF(Second Stage, RankZephyr)
Generation (top-20): Command R Plus (API) | 3 |
agtask-bm25-colbert_faiss-gpt4o-llama70b (trec_eval) (llm_eval) (paper) | softbank-meisei | automatic | yes | yes | no | no | Manually wrote the prompt, checked the generated responses, adjusted the prompt for what was lacking or not working properly.
The above loop was repeated until found satisfactory. | Generation process of this run is as follows:
1. Top-20 segments from the provided retrieval list for each query was given along with prompt to generate the response using GPT4o (Azure API)
2. Filtered the topics which couldn't generate responses due to being caught at content filtering, or due to inappropriate format even after three tries.
3. Responses for the filtered topics alone were generated using LLama 3.1 70b instruct model.
4. Postprocessing:
a. Reformatted responses from Llama3.1
b. Script to automatically remove any mistaken inline citations in response "text" (not the "citations" key), and recalculate the response length. | 1 (top) |
ldilab_gpt_4o (trec_eval) (llm_eval) | ldisnu | automatic | yes | no | no | yes | For prompt engineering the generation pipeline, I created a clear, structured prompt that provided specific guidelines for evaluating responses. The prompt focused on four key criteria: citation requirement identification, citation supportiveness, fluency, and nugget coverage. I ensured the instructions were concise and emphasized returning only the final numerical score without explanations, reinforcing the importance of adhering strictly to the guidelines. | The best result from the evaluation using GPT-4O was a clear and concise overall score calculation, demonstrating the model's ability to follow the structured prompt and accurately apply the provided criteria to the evaluation process. | 1 (top) |
Ranked_Iterative_Fact_Extraction_and_Refinement (trec_eval) (llm_eval) (paper) | TREMA-UNH | automatic | yes | yes | yes | yes | I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness. | Key fact sentences are first extracted from the highest-ranked document using an LLM, focusing on information directly relevant to the query. These extracted facts are then verified across the remaining documents, with the LLM identifying supporting sentences and confirming their reliability. Additional relevant facts are extracted from the remaining text, and a rule-based method removes redundancies. Finally, a smoothing process improves the flow and coherence of the output, resulting in a polished set of key facts. | 1 (top) |
Enhanced_Iterative_Fact_Refinement_and_Prioritization (trec_eval) (llm_eval) (paper) | TREMA-UNH | automatic | yes | yes | yes | yes | I started with a basic prompt and improved it using an LLM to enhance clarity and grammar. After checking the results, I refined the prompt further by adding specific sentences to improve its effectiveness. | The iterative fact verification process follows the RIFER method, with the LLM verifying key facts across documents and removing redundant content. An additional prompt-based refinement step removes any irrelevant parts of the key facts, ensuring conciseness. The key facts are then sorted by relevance using the LLM, and a final smoothing process enhances clarity and coherence, resulting in a polished output. | 1 (top) |
gpt_mini (trec_eval) (llm_eval) | KML | manual | yes | no | no | yes | gpt mini with prompt | gpt mini with prompt. | 1 (top) |
ginger_top_5 (trec_eval) (llm_eval) (paper) | uis-iai | automatic | no | yes | no | yes | We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample. | We use a multi-stage pipeline that contains information nugget detection, clustering, ranking, top clusters summarization and fluency enhancement to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences. | 1 (top) |
baseline_top_5 (trec_eval) (llm_eval) (paper) | uis-iai | automatic | no | yes | no | yes | We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample. | We summarize the top 5 candidates from the retrieval baseline provided by the organizers with GPT-4. The response is limited to 3 sentences. | 2 |
ginger-fluency_top_5 (trec_eval) (llm_eval) (paper) | uis-iai | automatic | no | yes | no | yes | We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample. | We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 5 candidates from the retrieval baseline provided by the organizers. The response is limited to 3 sentences. | 3 |
ginger-fluency_top_10 (trec_eval) (llm_eval) (paper) | uis-iai | automatic | no | yes | no | yes | We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample. | We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 10 candidates from the retrieval baseline provided by the organizers. | 4 |
ginger-fluency_top_20 (trec_eval) (llm_eval) (paper) | uis-iai | automatic | no | yes | no | yes | We performed experiments on the subset of TREC CAsT dataset and tuned the prompts manually to ensure the best results on the validation sample. | We use a multi-stage pipeline that contains information nugget detection, clustering, ranking and summarization of top clusters to generate responses. Generation is performed on sentence level. Our pipeline operates on information nuggets in all the steps of generation pipeline to ensure that responses are rooted in factual evidence from the passages. This run uses top 20 candidates from the retrieval baseline provided by the organizers. | 5 |
iiia_dedup_p1_reverse_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_dedup_p2_reverse_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_dedup_p1_straight_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_dedup_p2_straight_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
ag_rag_gpt35_expansion_rrf_20 (trec_eval) (llm_eval) | IITD-IRL | automatic | yes | no | no | yes | The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. Attempts to use smaller LLMs for generation were unstable, resulting in issues with sentence organization and citations. It was an automatic run. | This run aims at leverage the full set of context limit with a larger LLM. | 1 (top) |
iiia_standard_p1_reverse_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p2_reverse_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p1_straight_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
iiia_standard_p2_straight_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. | 1 (top) |
ag_rag_mistral_expansion_rrf_20 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | yes | The generation pipeline utilizes all 20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run. | This run aims to leverage small LLM + top -20 passage. It aims to navigate with prompt to generate desired response. | 2 |
ag_rag_mistral_expansion_rrf_15 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | yes | The generation pipeline utilizes top 15 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run. | This run aims to leverage small LLM + top -15 passage. It aims to navigate with prompt to generate desired response. | 5 |
ag_rag_mistral_expansion_rrf_7 (trec_eval) (llm_eval) | IITD-IRL | automatic | no | yes | no | yes | The generation pipeline utilizes top 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the instruction to generate the response between << and >>. The model was instructed to generate concise responses with citations in IEEE format. It utilizes Mistral and nltk to chunk the sentences. Following this a post processing is applied to extract citations. It was an automatic run. | This run aims to leverage small LLM + top -7 passage. It aims to navigate with prompt to generate desired response and also evaluate the quality of retrieval. | 6 |
ag_rag_gpt35_expansion_rrf_15 (trec_eval) (llm_eval) | IITD-IRL | automatic | yes | no | no | yes | The generation pipeline utilizes top-20 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. It was an automatic run. | This experiment aims to evaluate the performance of the bigger LLM vis-a-vis smaller. Given the similar set of passages, which can generate relevant responses. | 3 |
ag_rag_gpt35_expansion_rrf_7 (trec_eval) (llm_eval) | IITD-IRL | automatic | yes | no | no | yes | The generation pipeline utilizes first 7 segments from the provided baseline [retrieve_results_fs4_bm25+rocchio_snowael_snowaem_gtel+monot5_rrf+rz_rrf.rag24.test_top100.jsonl]. The model was guided to produce a structured output by following the Pydantic schema, ensuring the response adhered to a defined format. The model was instructed to generate concise responses with citations in IEEE format. | This is aimed at performance of system when few passages are supplied. How much information LLM infuses, apart from passages provided in the context. | 4 |
cir_gpt-4o-mini_Jaccard_50_0.5_100_301_p0 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini
Reranking: Rerank retrieval results with MMR method for diversity
MMR similarity: Jaccard
MMR Lambda: 0.5
Number of reranked segments used for generation: 50
Prompt: Summarization Prompt | 5 |
cir_gpt-4o-mini_Jaccard_50_1.0_100_301_p0 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Jaccard,
MMR Lambda: 1.0,
Number of reranked segments used for generation: 50,
Prompt: Summarization Prompt | 7 |
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.5,
Number of reranked segments used for generation: 50,
Prompt: Explicit RAG task prompt | 1 (top) |
cir_gpt-4o-mini_Cosine_50_0.25_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.25,
Number of reranked segments used for generation: 50,
Prompt: Explicit RAG task prompt | 8 |
iiia_standard_p1_straight_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_standard_p2_straight_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
webis-ag-run0-taskrag (trec_eval) (llm_eval) (paper) | webis | automatic | yes | yes | yes | no | Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step. | We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format. | 1 (top) |
cir_gpt-4o-mini_Cosine_50_0.75_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.75,
Number of reranked segments used for generation: 50,
Prompt: Explicit RAG task prompt | 9 |
webis-ag-run1-taskrag (trec_eval) (llm_eval) (paper) | webis | automatic | yes | yes | yes | no | Prompts were formulated using an iterative manual reformulation approach, with feedback regarding the quality of each prompted task at each step. | We decompose the RAG pipeline into 3 individual generation tasks. 'Extract' yields the most salient information form a doc given a query-doc pair; 'Combine' merges the extracted information of two docs; 'Condense' reformulates the merged evidence into a final response. The pipeline first applies extract to each document, then combines all documents with pairwise merges in a tree-like fashion, and finally condense the final response. Attribution is achieved via prompting the model to include explicit references, i.e., [0], at each step. References are then parsed using regex to conform with the final submission format. | 2 |
cir_gpt-4o-mini_Cosine_50_1.0_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 1.0,
Number of reranked segments used for generation: 50,
Prompt: Explicit RAG task prompt | 6 |
cir_gpt-4o-mini_Cosine_20_0.5_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.5,
Number of reranked segments used for generation: 20,
Prompt: Explicit RAG task prompt | 10 (bottom) |
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p2 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.5,
Number of reranked segments used for generation: 50,
Prompt: RAG for beginner level | 3 |
cir_gpt-4o-mini_Cosine_50_0.5_100_301_p3 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Reranking: Rerank retrieval results with MMR method for diversity,
MMR similarity: Cosine,
MMR Lambda: 0.5,
Number of reranked segments used for generation: 50,
Prompt: RAG for expert level | 4 |
iiia_standard_p1_reverse_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_standard_p2_reverse_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
cir_gpt-4o-mini_no_reranking_50_0.5_100_301_p1 (trec_eval) (llm_eval) (paper) | CIR | automatic | yes | no | no | no | The prompt template was designed according to different tasks and personas and adapted to achieve an output format that complies with the guidelines. The templates were manually tested with subsets of the topics, and the results were evaluated in group discussions. | Model: gpt-4o-mini,
Number of segments used for generation: 50,
Prompt: Explicit RAG task prompt | 2 |
iiia_dedup_p1_straight_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
webis-ag-run3-reuserag (trec_eval) (llm_eval) (paper) | webis | automatic | no | no | yes | no | N/A | Segments from the baseline run were clustered automatically using SBERT embeddings. The top ranked sentences from each cluster were concatenated to form the response. | 4 |
webis-ag-run2-reuserag (trec_eval) (llm_eval) (paper) | webis | automatic | no | no | yes | no | Baseline introduction, middle, and conclusions sentences were used as 'prompts' to cluster sentences into 3 groups. | This run uses the baseline retrieval run as retrieval input. For generation, split all sentences into 3 groups based on the prompt sentences by calculating semantic similarity with SBERT. We then concatenate the top ranked sentences together to form the response. | 3 |
iiia_dedup_p1_reverse_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_dedup_p2_reverse_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in reverse order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in reverse order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
iiia_dedup_p2_straight_ht_ag (trec_eval) (llm_eval) | IIIA-UNIPD | automatic | no | yes | yes | no | I created a simple prompt asking to answer the questions based on the context and then appending the context and the question. In this run the context documents are prompted in order of relevance. In this run the prompt also states that the answer should be a short statement. | This run uses the top-20 deduplicated documents from the baseline as context, the context is passed in order of relevance to a generator (LLAMA-8B-Instruct) with the instruction to answer the query with a short statement using only on the informations from the context provided. The answer was then split in sentences and each pair of (context-doc, sentence) was passed to an annotator model that checks if the sentence is a citation from the context-doc. In this run a higher value for "temperature" and a lower value for "top-p" were used. | 1 (top) |
cohere+post_processing (trec_eval) (llm_eval) | KML | manual | yes | no | no | no | cohere prompt and post processing | cohere prompt for individual answer, single prompt for whole rag.
Post processing was done on the answers | 2 |
UDInfolab.AG-v1 (trec_eval) (llm_eval) | InfoLab | manual | yes | no | no | no | topk20 + openai | topk20 + openai | 2 |
UDInfolab.AG-v2 (trec_eval) (llm_eval) | InfoLab | automatic | yes | no | no | no | automatic | Topk20+AG | 1 (top) |
gpt_mini_double_prompt (trec_eval) (llm_eval) | KML | manual | yes | no | no | no | Two steps prompt instead of single prompt, first for answer finding second for citations. | Two steps prompt instead of single prompt, first for answer finding second for citations. | 1 (top) |