The Thirty-Third Text REtrieval Conference
(TREC 2024)

Plain Language Adaptation of Biomedical Abstracts Term replacement task Appendix

RuntagOrgWhich subtasks did the run complete?What, if any, manual intervention was done to produce the run in response to the test data?Which base models did this run use, if any -- e.g., BERT, GPT-4, Llama 13B, etc.?Did the run use other data besides the training set? If so, please describe.Briefly describe salient features of this run, including what distinguishes it from your other runs.Please give this run a priority for inclusion in manual assessments.
bad plaba
Task 1A only (term identification)
test
test
test
test
1 (top)
good plaba
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
test
test
test
test
3 (bottom)
gpt CLAC
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
No
gpt-3.5-turbo-0125
Only training set
gpt-3.5-turbo-0125 7-shot run
2
mistral CLAC
Task 1A only (term identification)
No
mistral-large-latest
No
mistral-large-latest 7-shot with temperature of 0.4
1 (top)
MLPClassifier-identify-classify-replace-v1 BU
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
Flattened the nested JSON structure in order to make it easy to work with and automated that.
For the base model, I used a simple multi-layer perception (MLP) neural network for classification purposes.
NA
This run explores different classifiers (XGBoost, LightGBM)and uses an MLP, which can capture non-linear patterns. The model accuracy is about 65%. However, a deep dive into the performance of individual action shows the F1 for each class "SUBSTITUTE":0, "EXPLAIN":1, 'GENERALIZE':2, 'EXEMPLIFY':3, 'OMIT':4, precision recall f1-score 0.71 0.83 0.76 0.58 0.47 0.52 0.35 0.18 0.24 0.53 0.67 0.59 0.26 0.12 0.17 Another interesting thing in this run is the inclusion logic to handle cases where the top two predicted actions have very close probabilities (within 0.05 of each other). It also handles cases where no matching description is found for a term-action pair.
1 (top)
gemini-1.5-pro_demon5_replace-demon5 (paper)ntu_nlp
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.
Gemini-pro-1.5
no
Use gemini-pro-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.
1 (top)
gemini-1.5-flash_demon5_replace-demon5 (paper)ntu_nlp
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.
gemini-1.5-flash
no
Use gemini-flash-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.
2
gpt-4o-mini _demon5_replace-demon5 (paper)ntu_nlp
Task 1A and 1B and 1C (term identification and classification and replacement text generation)
format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.
gpt-4o-mini
no
Use gpt-4o-mini as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.
3 (bottom)
First IIITH
Task 1A and 1B (term identification and classification)
No manual intervention was done to obtain this data except ensuring that the data was in the specified format.
For task 1A, BioBERT was used to for named entity recognition. For task 2B, again, BioBERT was used to get term embeddings, and a RandomForest Classifier used to classify the embeddings into appropriate simplification tools
Besides using the given PLABA dataset, the other dataset used is a pre-processed version of the BC5CDR (BioCreative V CDR task corpus: a resource for relation extraction) dataset from Li et al. (2016). The two dataset were used in combination to train the BioBERT models
The following run is computationally cheap as it does not require the use of any LLMs. The given data was also complimented further using other publicly available datasets
1 (top)
Roberta-base (paper)UM
Task 1A and 1B (term identification and classification)
RoBERTa-base
nothing
Multi label token classification with roberta-base
1 (top)
roberta-gbc Yseop
Task 1A and 1B (term identification and classification)
pabRomero/BioMedRoBERTa-full-finetuned-ner-pablo and GradientBoostingClassifier
no
cleaned the training corpus and hyper param tuning of the two models
1 (top)