The Thirty-Third Text REtrieval Conference
(TREC 2024)

Biomedical Generative Retrieval Main Task Appendix

RuntagOrgIs this a generation-only run (Retrieval is not used!)?Describe the document retrieval model used in this run.Describe the LLMs used to generate the answers.Please provide a short description of this run.Additional details or comments.Judging precedence
zero-shot-gpt4o-mini (paper)ur-iw
no
1. Query Expansion and Transformation with GPT4o-mini (gpt-4o-mini-2024-07-18) into a boolean query for the Elasticsearch query_string syntax. 2. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 3. (optional) Query refinement on 0 results (once). 4. Relevant snippet extraction on top50 results. 5. Reranking of snippets and selecting the top 20 snippets.
Proprietary closed-source low cost commercial model from OpenAI GPT4o-mini.
Exploring the one-shot performance of gpt4o-mini for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Zero-shot snippet extraction and snippet reranking based on question relevance with gpt4o-mini. One-shot answer generation with gpt4o-mini based on retrieved snippets.
2
zero-shot-gemini-flash (paper)ur-iw
no
1. Query Expansion and Transformation with gemini-1.5-flash (gemini-1.5-flash-001) into a boolean query for the Elasticsearch query_string syntax. 2. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 3. (optional) Query refinement on 0 results (once). 4. Relevant snippet extraction on top50 results. 5. Reranking of snippets and selecting the top 20 snippets.
Proprietary closed-source low cost commercial model from Google gemini-1.5-flash.
Exploring the one-shot performance of gemini-1.5-flash for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Zero-shot snippet extraction and snippet reranking based on question relevance with gemini-1.5-flash. One-shot answer generation with gemini-1.5-flashi based on retrieved snippets.
2
ten-shot-gpt4o-mini (paper)ur-iw
no
1. Query Expansion and Transformation with GPT4o-mini (gpt-4o-mini-2024-07-18) into a boolean query for the Elasticsearch query_string syntax. 2. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 3. (optional) Query refinement on 0 results (once). 4. Relevant snippet extraction on top50 results. 5. Reranking of snippets and selecting the top 20 snippets.
Proprietary closed-source low cost commercial model from OpenAI GPT4o-mini.
Exploring the few-shot (ten-shot) performance of gpt4o-mini for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Ten-shot snippet extraction and snippet reranking based on question relevance with gpt4o-mini. Ten-Shot answer generation with gpt4o-mini based on retrieved snippets. Few-shot examples were taken from the BioASQ dataset for the retrieval tasks. For answer generation, the examples were taken from a preliminary run over the test set, with an LLM judge scoring the answer according to the evaluation instructions on the Biogen homepage.
1 (highest priority)
ten-shot-gemini-flash (paper)ur-iw
no
1. Query Expansion and Transformation with gemini-1.5-flash-001 into a boolean query for the Elasticsearch query_string syntax. 2. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 3. (optional) Query refinement on 0 results (once). 4. Relevant snippet extraction on top50 results. 5. Reranking of snippets and selecting the top 20 snippets.
Proprietary closed-source low cost commercial model from Google gemini-1.5-flash-001.
Exploring the few-shot (ten-shot) performance of gemini-1.5-flash-001 for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Ten-shot snippet extraction and snippet reranking based on question relevance with gemini-1.5-flash-001. Ten-Shot answer generation with gemini-1.5-flash-001 based on retrieved snippets. Few-shot examples were taken from the BioASQ dataset for the retrieval tasks. For answer generation, the examples were taken from a preliminary run over the test set, with an LLM judge scoring the answer according to the evaluation instructions on the Biogen homepage.
1 (highest priority)
ten-shot-gpt4o-mini-wiki (paper)ur-iw
no
1. Prompting the LLM to generate relevant Wikipedia articles titles to the question. 2. Retrieving and summarizing relevant information from the Wikipedia articles with the LLM. 3. Few-Shot (10) query Expansion and transformation with GPT4o-mini (gpt-4o-mini-2024-07-18) into a boolean query for the Elasticsearch query_string syntax, having the summarized information from Wikipedia as additional context. 4. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 5. (optional) Query refinement on 0 results (once). 6. Few-Shot (10) relevant snippet extraction on top50 results, having the summarized information from Wikipedia as additional context. 7. Few-Shot (10) reranking of snippets and selecting the top 20 snippets, having the summarized information from Wikipedia as additional context.
Proprietary closed-source low cost commercial model from OpenAI GPT4o-mini.
Exploring the few-shot (ten-shot) performance of gpt4o-mini for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Ten-shot snippet extraction and snippet reranking based on question relevance with gpt4o-mini. Ten-Shot answer generation with gpt4o-mini based on retrieved snippets. Additional relevant context retrieved from Wikipedia was supplied in the retrieval step to help guide the LLM. Few-shot examples were taken from the BioASQ dataset for the retrieval tasks. For answer generation, the examples were taken from a preliminary run over the test set, with an LLM judge scoring the answer according to the evaluation instructions on the Biogen homepage.
1 (highest priority)
ten-shot-gemini-flash-wiki (paper)ur-iw
no
1. Prompting the LLM to generate relevant Wikipedia articles titles to the question. 2. Retrieving and summarizing relevant information from the Wikipedia articles with the LLM. 3. Few-Shot (10) query Expansion and transformation with gemini-1.5-flash-001 into a boolean query for the Elasticsearch query_string syntax, having the summarized information from Wikipedia as additional context. 4. BM25 based retrieval with default Elasticsearch scoring on title and abstract fields in a PubMed snapshot index. 5. (optional) Query refinement on 0 results (once). 6. Few-Shot (10) relevant snippet extraction on top50 results, having the summarized information from Wikipedia as additional context. 7. Few-Shot (10) reranking of snippets and selecting the top 20 snippets, having the summarized information from Wikipedia as additional context.
Proprietary closed-source low cost commercial model from Google gemini-1.5-flash-001.
Exploring the few-shot (ten-shot) performance of gemini-1.5-flash-001 for query expansion and transformation. Retrieval with Elasticsearch on title and abstract. Optional query refinement on 0 results. Ten-shot snippet extraction and snippet reranking based on question relevance with gemini-1.5-flash-001 Ten-Shot answer generation with gemini-1.5-flash-001 based on retrieved snippets. Additional relevant context retrieved from Wikipedia was supplied in the retrieval step to help guide the llm. Few-shot examples were taken from the BioASQ dataset for the retrieval tasks. For answer generation, the examples were taken from a preliminary run over the test set, with an LLM judge scoring the answer according to the evaluation instructions on the Biogen homepage.
1 (highest priority)
iiresearch_trec_bio2024_t5base_run (paper)ii_research
no
Phase 1: Relevance-Based DocID Initialization (M0): This phase treats the T5 model as a dense encoder, starting from T5-base and training it using a two-stage training strategy. In the first stage, BM25 negatives are used. This stage results in the model M0, which generates document representations optimized based on relevance using Residual Quantization (RQ) to construct DocIDs. Phase 2: Seq2Seq Pretraining + Initial Fine-tuning (M2): After generating the relevance-based DocIDs, the system moves to seq2seq pretraining, where pseudo queries associated with each document are used to further optimize the model M1 using cross-entropy loss, followed by a rank-oriented fine-tuning step (M2) that uses a pairwise margin loss function to ensure accurate ranking across different decoding steps of the generated DocIDs. Phase 3: Document Ranking and Retrieval (M3) In the third phase, the trained model M2 is used to rank and retrieve the most relevant documents for the given query. Using the trained model, the test query is provided as input, and the model directly generates the DocIDs of the top 10 most relevant documents. This process involves leveraging the seq2seq architecture's capabilities, along with the fine-tuned model from previous phases, to ensure accurate and efficient document retrieval.
The LLM used to generate the answers in this run is GPT-4o, a specialized variant of GPT-4 optimized for handling large-scale generative tasks. This model is utilized to generate answers based on the retrieved document IDs and the associated content. Here's a detailed description of how GPT-4o is applied: Input to the LLM: The inputs to GPT-4o include the query, the IDs of the top 10 most relevant documents retrieved from the previous stages, and the summaries or relevant portions of those documents. This ensures that the model has access to the most pertinent information to generate accurate and contextually relevant answers. Answer Generation: GPT-4o processes the input data by leveraging its extensive pretraining on diverse text corpora, allowing it to understand complex queries and synthesize information from multiple documents. The model combines the content from the retrieved documents to generate a coherent, accurate answer that directly addresses the query.
Retrieval Phase: Due to the lack of a manually annotated dataset, I initially used the BM25 model to perform a basic ranking on a small portion of the dataset. This preliminary ranking was used to achieve supervised learning, allowing the model to learn the basic relationships between documents and queries. Afterward, the T5 model was employed for relevance-driven DocID initialization. Specifically, the document ID, abstract content, and query were provided as input to the T5 model, enabling the model to generate DocIDs that accurately reflect the relevance between the document and the query. The purpose of this process is to optimize document representation and improve the model's performance in large-scale retrieval tasks. Ultimately, the top 10 most relevant documents were retrieved, which were then used as input for the GPT model to generate the final answer. Generation Phase: After retrieving the top 10 relevant documents, the system uses the GPT-4O model to generate the answer. GPT-4O processes the query, the document IDs, and the abstracts of the candidate documents to generate an answer that meets the query's requirements.
1 (highest priority)
listgalore_gpt-4o_ragnarokv5biogen_top20 h2oloo
no
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RRF(RL3.1-70B, RG4o)
Top 20 Generation using GPT-4o prompted with Ragnarok V5 Biogen Prompt.
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RRF(RL3.1-70B, RG4o) -> Top 20 Generation using GPT-4o prompted with Ragnarok V5 Biogen Prompt.
1 (highest priority)
listgalore_l31-70b_ragnarokv5biogen_top20 h2oloo
no
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RRF(RL3.1-70B, RG4o)
Top 20 Generation using L3.1-70B prompted with Ragnarok V5 Biogen Prompt.
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RRF(RL3.1-70B, RG4o) -> Top 20 Generation using L3.1-70B prompted with Ragnarok V5 Biogen Prompt.
2
rl31-70b_l31-70b_ragnarokv5biogen_top20 h2oloo
no
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RL3.1-70B
Top 20 Generation using L3.1-70B prompted with Ragnarok V5 Biogen Prompt.
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RL3.1-70B -> Top 20 Generation using L3.1-70B prompted with Ragnarok V5 Biogen Prompt.
3
rl31-70b_gpt-4o_ragnarokv5biogen_top20 h2oloo
no
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RL3.1-70B
Top 20 Generation using GPT-4o prompted with Ragnarok V5 Biogen Prompt.
BM25+Rocchio (Top 1K) -> RRF(MonoT5 (10/5 Segments to pick rep segment), LiT5v2-XL (100/50 sliding window)) (Top 100) -> RL3.1-70B -> Top 20 Generation using GPT-4o prompted with Ragnarok V5 Biogen Prompt.
4
rl31-70b_gpt-4o_ragnarokv5biogennc_top20 h2oloo
yes
None
GPT4-o Ragnarok V5 Biogen No Cite
GPT4-o Ragnarok V5 Biogen No Cite
5 (lowest priority)
webis-1 (paper)webis
no
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, query, narrative, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are not removed from the query. Match the query on the article's abstract text (Elasticsearch BM25). Exclude PubMed articles of non-peer-reviewed publication types. After retrieval, split passages from the retrieved article's abstract text by splitting it into sentences and returning all sentence n-grams up to 3 sentences. Re-rank up to 50 passages pointwise with a TCT-ColBERT model (castorini/tct_colbert-v2-hnp-msmarco).
Generate a summary answer for each question with DSPy using a Mistral model (Mistral-7B-Instruct-v0.3, via Blablador API). Give the question and the top-10 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Using DSPy, optimize the prompt by labeled few-shot prompting with 3 examples from the BioASQ 12b train set. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while augmenting both retrieval and generation independently.
Using the retrieval and generation modules as described above, independently augment the generation module with retrieval and the retrieval module with generation. For generation-augmented retrieval, augment 3 times while not feeding back retrieval results to the generation module. For retrieval-augmented generation, augment 3 times while feeding back generation results to the retrieval module. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
1 (highest priority)
webis-2 (paper)webis
no
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, narrative, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are not removed from the query. Match the query on the article's abstract text (Elasticsearch BM25). Exclude PubMed articles of non-peer-reviewed publication types. After retrieval, split passages from the retrieved article's abstract text by splitting it into sentences and returning all sentence n-grams up to 3 sentences. Re-rank up to 10 passages pointwise with a TCT-ColBERT model (castorini/tct_colbert-v2-hnp-msmarco). Re-rank up to 3 passages pairwise with a duoT5 model (castorini/duot5-base-msmarco).
Generate a summary answer for each question with DSPy using a Mistral model (Mistral-7B-Instruct-v0.3, via Blablador API). Give the question and the top-3 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Use chain-of-thought prompting. Using DSPy, optimize the prompt by labeled few-shot prompting with 3 examples from the BioASQ 12b train set. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while augmenting both retrieval and generation independently.
Using the retrieval and generation modules as described above, independently augment the generation module with retrieval and the retrieval module with generation. For generation-augmented retrieval, augment 3 times while not feeding back retrieval results to the generation module. For retrieval-augmented generation, augment 3 times while feeding back generation results to the retrieval module. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
2
webis-3 (paper)webis
no
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, narrative, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are not removed from the query. Match the query on the article's title, abstract text, and MeSH terms (Elasticsearch BM25, both title and abstract must match, MeSH terms should match, MeSH terms matched to medical entities extracted from the query using sciSpaCy). Exclude PubMed articles of non-peer-reviewed publication types. After retrieval, split passages from the retrieved article's abstract text by splitting it into sentences and returning all sentence n-grams up to 3 sentences. The full title is also used as a separate passage. Re-rank up to 10 passages pointwise with a monoT5 model (castorini/monot5-base-msmarco).
Generate a summary answer for each question with DSPy using a Mistral model (Mistral-7B-Instruct-v0.3, via Blablador API). Give the question and the top-10 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Using DSPy, optimize the prompt by labeled few-shot prompting with 3 examples from the BioASQ 12b train set. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while augmenting both retrieval and generation independently.
Using the retrieval and generation modules as described above, independently augment the generation module with retrieval and the retrieval module with generation. For generation-augmented retrieval, augment 3 times while not feeding back retrieval results to the generation module. For retrieval-augmented generation, augment 3 times while feeding back generation results to the retrieval module. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
3
webis-gpt-1 (paper)webis
yes
Not used.
Generate a summary answer for each question with DSPy using a GPT-4o mini model (gpt-4o-mini-2024-07-18). Give the question to the model and prompt the model to return a summary answer. Using DSPy, optimize the prompt by labeled few-shot prompting with 3 examples from the BioASQ 12b train set.
Use generation as above, while not augmenting the generation.
Generation is implemented using DSPy.
4
webis-gpt-4 (paper)webis
no
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, narrative, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are removed from the query. Match the query on the article's abstract text (Elasticsearch BM25). Exclude PubMed articles with empty abstract text. After retrieval, split passages from the retrieved article's abstract text by splitting it into sentences and returning all sentence n-grams up to 3 sentences. Re-rank up to 50 passages pointwise with a monoT5 model (castorini/monot5-base-msmarco).
Generate a summary answer for each question with DSPy using a GPT-4o mini model (gpt-4o-mini-2024-07-18). Give the question and the top-3 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Use the unoptimized prompt from DSPy. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while cross-augmenting both retrieval and generation.
Using the retrieval and generation modules as described above, cross-augment the generation module with retrieval and the retrieval module with generation (i.e., in each step, the previous retrieval result is used to augment the next generation step and vice-versa). Do 2 cross-augmentation steps. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
3
norarr.llm_only_2.llama3-8b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3.1 8B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list
5 (lowest priority)
norarr.llm_only_2.llama3-70b_fixed ielab
yes
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list
1 (highest priority)
webis-gpt-6 (paper)webis
yes
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are not removed from the query. Match the query on the article's abstract text (Elasticsearch BM25). Exclude PubMed articles with empty title. Exclude PubMed articles of non-peer-reviewed publication types. After retrieval, use only the title as a passage for the article. Re-rank up to 10 passages pointwise with a monoT5 model (castorini/monot5-base-msmarco). Re-rank up to 3 passages pairwise with a duoT5 model (castorini/duot5-base-msmarco).
Generate a summary answer for each question with DSPy using a GPT-4o mini model (gpt-4o-mini-2024-07-18). Give the question and the top-3 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Using DSPy, optimize the prompt by labeled few-shot prompting with 1 example from the BioASQ 12b train set. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while augmenting both retrieval and generation independently.
Using the retrieval and generation modules as described above, independently augment the generation module with retrieval and the retrieval module with generation. For generation-augmented retrieval, augment 2 times while feeding back retrieval results to the generation module. For retrieval-augmented generation, augment 1 time while not feeding back generation results to the retrieval module. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
3
webis-5 (paper)webis
no
Retrieve up to 10 PubMed articles from Elasticsearch using BM25. As the query, use the concatenated question, narrative, and simple yes-no, factual, or list answer (if question type is known and if retrieval is run after generation). Stopwords are not removed from the query. Match the query on the article's title and abstract text (Elasticsearch BM25, title should match and abstract must match). Exclude PubMed articles of non-peer-reviewed publication types. Exclude PubMed articles with empty title. After retrieval, split passages from the retrieved article's abstract text by splitting it into sentences and returning all sentence n-grams up to 3 sentences. The full title is also used as a separate passage. Re-rank up to 10 passages pointwise with a monoT5 model (castorini/monot5-base-msmarco).
Generate a summary answer for each question with DSPy using a Mistral model (Mistral-7B-Instruct-v0.3, via Blablador API). Give the question and the top-10 passages to the model as context (numbered by rank), and prompt the model to return a summary answer with references (by rank) given in the text. Using DSPy, optimize the prompt by labeled few-shot prompting with 3 examples from the BioASQ 12b train set. After generation, convert internal reference numbering back to PubMed IDs.
Use retrieval and generation as above, while augmenting both retrieval and generation independently.
Using the retrieval and generation modules as described above, independently augment the generation module with retrieval and the retrieval module with generation. For generation-augmented retrieval, augment 3 times while not feeding back retrieval results to the generation module. For retrieval-augmented generation, augment 3 times while feeding back generation results to the retrieval module. Retrieval is implemented using PyTerrier. Generation is implemented using DSPy.
4
norarr.llm_only_2.boolena.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Boolean query generation from topic, then retrieve initial list using boolean query; Rerank using stella to get top-20 documents. Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list
2
rarr_attronly.llm_only_2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Only the first step of the pipeline: check if the retrieved sentence, agrees with the generated answer given the topic. If it doesn't agree- remove the reference from the list of references.
1 (highest priority)
rarr_attrfix.llm_only_2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Check if the retrieved sentence agrees with the generated answer given the topic. If it doesn't agree- FIX IT
2
rarr_qgen.llm_only_2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Generate queries for each sentence; check if the retrieved sentence, agrees with the generated answer given the generated question. If it doesn't agree- remove the reference from the list of references.
3
rarr_qgenfix.llm_only_2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Generate a several questions for each sentence in the answer. Then, check if the retrieved sentence agrees with the generated answer given the generated question. If the answer can be fixed - FIX IT
1 (highest priority)
rarr_attrfix_custprompt.llm_only_3_v2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Only the first step of the pipeline: check if the retrieved sentence, agrees with the generated answer given the topic. If it doesn't agree- remove the reference from the list of references. Used customized prompt
4
rarr_qgenfix_custprompt.llm_only_3_v2.llama3-70b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 70B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Generate a several questions for each sentence in the answer. Then, check if the retrieved sentence agrees with the generated answer given the generated question. If the answer can be fixed - FIX IT. Use Custom prompt
4
rarr_qgenfix.llm_only_2.llama3-8b_fixed ielab
no
Stella + BM25 fused ranking; The type of rankers and the fusion hyperparameters were selected based on the performance on Trec-Covid dataset.
LLama 3 5B Instruct
First step: Retrieve top-20 PubMed articles with the described rankers for each topic; Second Step: Use LLM to generate the answer to the original topic based on the retrieved documents Third Step: Attribution. For each sentence of the generated answer use LLM to attribute according to each article in the top-20 ranked list Forth Step: Fix the attribution with the RARR. Generate a several questions for each sentence in the answer. Then, check if the retrieved sentence agrees with the generated answer given the generated question. If the answer can be fixed - FIX IT
5 (lowest priority)