TREC 2024 (33rd Text REtrieval Conference)

Runtag	Org	Which subtasks did the run complete?	What, if any, manual intervention was done to produce the run in response to the test data?	Which base models did this run use, if any -- e.g., BERT, GPT-4, Llama 13B, etc.?	Did the run use other data besides the training set? If so, please describe.	Briefly describe salient features of this run, including what distinguishes it from your other runs.	Please give this run a priority for inclusion in manual assessments.
bad	plaba	Task 1A only (term identification)	test	test	test	test	1 (top)
good	plaba	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	test	test	test	test	3 (bottom)
gpt	CLAC	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	No	gpt-3.5-turbo-0125	Only training set	gpt-3.5-turbo-0125 7-shot run	2
mistral	CLAC	Task 1A only (term identification)	No	mistral-large-latest	No	mistral-large-latest 7-shot with temperature of 0.4	1 (top)
MLPClassifier-identify-classify-replace-v1	BU	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	Flattened the nested JSON structure in order to make it easy to work with and automated that.	For the base model, I used a simple multi-layer perception (MLP) neural network for classification purposes.	NA	This run explores different classifiers (XGBoost, LightGBM)and uses an MLP, which can capture non-linear patterns. The model accuracy is about 65%. However, a deep dive into the performance of individual action shows the F1 for each class "SUBSTITUTE":0, "EXPLAIN":1, 'GENERALIZE':2, 'EXEMPLIFY':3, 'OMIT':4, precision recall f1-score 0.71 0.83 0.76 0.58 0.47 0.52 0.35 0.18 0.24 0.53 0.67 0.59 0.26 0.12 0.17 Another interesting thing in this run is the inclusion logic to handle cases where the top two predicted actions have very close probabilities (within 0.05 of each other). It also handles cases where no matching description is found for a term-action pair.	1 (top)
gemini-1.5-pro_demon5_replace-demon5 (paper)	ntu_nlp	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.	Gemini-pro-1.5	no	Use gemini-pro-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.	1 (top)
gemini-1.5-flash_demon5_replace-demon5 (paper)	ntu_nlp	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.	gemini-1.5-flash	no	Use gemini-flash-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.	2
gpt-4o-mini _demon5_replace-demon5 (paper)	ntu_nlp	Task 1A and 1B and 1C (term identification and classification and replacement text generation)	format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.	gpt-4o-mini	no	Use gpt-4o-mini as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.	3 (bottom)
First	IIITH	Task 1A and 1B (term identification and classification)	No manual intervention was done to obtain this data except ensuring that the data was in the specified format.	For task 1A, BioBERT was used to for named entity recognition. For task 2B, again, BioBERT was used to get term embeddings, and a RandomForest Classifier used to classify the embeddings into appropriate simplification tools	Besides using the given PLABA dataset, the other dataset used is a pre-processed version of the BC5CDR (BioCreative V CDR task corpus: a resource for relation extraction) dataset from Li et al. (2016). The two dataset were used in combination to train the BioBERT models	The following run is computationally cheap as it does not require the use of any LLMs. The given data was also complimented further using other publicly available datasets	1 (top)
Roberta-base (paper)	UM	Task 1A and 1B (term identification and classification)		RoBERTa-base	nothing	Multi label token classification with roberta-base	1 (top)
roberta-gbc	Yseop	Task 1A and 1B (term identification and classification)		pabRomero/BioMedRoBERTa-full-finetuned-ner-pablo and GradientBoostingClassifier	no	cleaned the training corpus and hyper param tuning of the two models	1 (top)

Runtag

Org

Which subtasks did the run complete?

What, if any, manual intervention was done to produce the run in response to the test data?

Which base models did this run use, if any -- e.g., BERT, GPT-4, Llama 13B, etc.?

Did the run use other data besides the training set? If so, please describe.

Briefly describe salient features of this run, including what distinguishes it from your other runs.

Please give this run a priority for inclusion in manual assessments.

bad

plaba

Task 1A only (term identification)

test

1 (top)

good

plaba

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

test

3 (bottom)

gpt

CLAC

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

gpt-3.5-turbo-0125

Only training set

gpt-3.5-turbo-0125 7-shot run

mistral

CLAC

Task 1A only (term identification)

mistral-large-latest

mistral-large-latest 7-shot with temperature of 0.4

1 (top)

MLPClassifier-identify-classify-replace-v1

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

Flattened the nested JSON structure in order to make it easy to work with and automated that.

For the base model, I used a simple multi-layer perception (MLP) neural network for classification purposes.

This run explores different classifiers (XGBoost, LightGBM)and uses an MLP, which can capture non-linear patterns. The model accuracy is about 65%. However, a deep dive into the performance of individual action shows the F1 for each class "SUBSTITUTE":0, "EXPLAIN":1, 'GENERALIZE':2, 'EXEMPLIFY':3, 'OMIT':4, precision recall f1-score 0.71 0.83 0.76 0.58 0.47 0.52 0.35 0.18 0.24 0.53 0.67 0.59 0.26 0.12 0.17 Another interesting thing in this run is the inclusion logic to handle cases where the top two predicted actions have very close probabilities (within 0.05 of each other). It also handles cases where no matching description is found for a term-action pair.

1 (top)

gemini-1.5-pro_demon5_replace-demon5 (paper)

ntu_nlp

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.

Gemini-pro-1.5

Use gemini-pro-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.

1 (top)

gemini-1.5-flash_demon5_replace-demon5 (paper)

ntu_nlp

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.

gemini-1.5-flash

Use gemini-flash-1.5 as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.

gpt-4o-mini _demon5_replace-demon5 (paper)

ntu_nlp

Task 1A and 1B and 1C (term identification and classification and replacement text generation)

format editting I edited the format from plain text to JSON when it did not meet the required submission format. However, no changes were made to the content itself.

gpt-4o-mini

Use gpt-4o-mini as base model. Step 1: entities extraction with 5 demonstrations. Step 2: entities replacement with 5 demonstrations.

3 (bottom)

First

IIITH

Task 1A and 1B (term identification and classification)

No manual intervention was done to obtain this data except ensuring that the data was in the specified format.

For task 1A, BioBERT was used to for named entity recognition. For task 2B, again, BioBERT was used to get term embeddings, and a RandomForest Classifier used to classify the embeddings into appropriate simplification tools

Besides using the given PLABA dataset, the other dataset used is a pre-processed version of the BC5CDR (BioCreative V CDR task corpus: a resource for relation extraction) dataset from Li et al. (2016). The two dataset were used in combination to train the BioBERT models

The following run is computationally cheap as it does not require the use of any LLMs. The given data was also complimented further using other publicly available datasets

1 (top)

Roberta-base (paper)

Task 1A and 1B (term identification and classification)

RoBERTa-base

nothing

Multi label token classification with roberta-base

1 (top)

roberta-gbc

Yseop

Task 1A and 1B (term identification and classification)

pabRomero/BioMedRoBERTa-full-finetuned-ner-pablo and GradientBoostingClassifier

cleaned the training corpus and hyper param tuning of the two models

1 (top)

The Thirty-Third Text REtrieval Conference
(TREC 2024)

Plain Language Adaptation of Biomedical Abstracts Term replacement task Appendix