The Thirty-Third Text REtrieval Conference
(TREC 2024)

Medical Video Question Answering Video corpus visual answer localization task Appendix

RuntagOrgDescribe the video retrieval model usedDescribe the answer localization model usedAdditional Details/Comments
run-2 TJUMI
We implemented a process where we first crawled YouTube to gather video titles based on all available video_ids from video_id.json. Next, we performed a semantic matching between the questions from vcval-test-questions.txt and the video titles. By doing this, we were able to identify and select videos that were truly relevant for the VCVAL task. This systematic approach ensured that the chosen videos were highly pertinent to the questions at hand, enhancing the overall quality and relevance of our task.
We propose a causal-interference-resilient method for accurate medical video localization that significantly enhances the localization accuracy and precision, improving the information retrieval efficiency. First we incorporated a causal inference model, we employed a front-door adjustment technique that effectively eliminates visual-text confounders, improving the localization accuracy. Second, we fuse video and text features through the AFF module, which is to address semantic and scale inconsistencies among input features, a multi-scale channel attention module that incorporates the local channel context into global channel statistics is introduced. Additionally, we utilized large language model techniques to develop a text-assisted localization module which leverages semantic correlations within the text to enhance the localization precision. Finally, we used Answer Span Predictor module to calculate the boundaries of beginning and ending. By integrating these advancements, our method demonstrates superior performance in medical video localization tasks.
Nothing
run-1 TJUMI
We developed a method where we initially crawled YouTube to collect video titles using all the available video_ids from video_id.json. Subsequently, we conducted a semantic match between the questions in vcval-test-questions.txt and the video titles. This process allowed us to pinpoint and choose videos that were genuinely relevant to the VCVAL task. Our systematic method guaranteed that the selected videos were directly related to the questions, thereby improving the task's overall quality and relevance.
We proposed a method resilient to causal interference for precise medical video localization, significantly boosting accuracy and retrieval efficiency. Initially, we integrated a causal inference model using a front-door adjustment to remove visual-text confounders, enhancing accuracy. Next, we merged video and text features via the AFF module to tackle semantic and scale discrepancies, introducing a multi-scale channel attention module that blends local channel context with global statistics. Furthermore, we harnessed large language model techniques to create a text-assisted localization module that uses text-based semantic correlations to refine precision. Lastly, we implemented an Answer Span Predictor to determine start and end boundaries. These innovations collectively elevate our method's performance in medical video localization tasks.
Nothing
PolySmart
Sentence transformer and txt search fine grained
Cross-modal mutual knowledge Model
NA
PolySmart
Sentence transformer and txt search fine grained
Cross-modal mutual knowledge Model
NA
PolySmart
Sentence transformer and txt search fine grained
Cross-modal mutual knowledge Model
NA
PolySmart
Sentence transformer and txt search fine grained
Cross-modal mutual knowledge Model
NA
PolySmart
Sentence transformer and txt search fine grained
Cross-modal mutual knowledge Model
NA
run1 NCSU
We used a pre-trained BERT model and fine-tuned it on a medical video dataset. The model structure and process are as follows: Model Structure: We used the bert-base-uncased model, where BERT handles the input question. A linear layer is added on top of the BERT pooled output to classify video IDs. Training Process: The pooled output from the BERT model is fed into a linear layer to obtain logits for the video IDs. The model is trained using supervised learning and cross-entropy loss, with hyperparameters fine-tuned on the validation set.
We used the T5 model for answer localization. The detailed process is as follows: Model Structure: We used the pre-trained t5 model and fine-tuned it. The model takes the question and video subtitles as input and generates the answer text. Training Process: Data preprocessing includes parsing video subtitle files (in SRT format) and feeding them into the model along with the question. The model is trained using supervised learning to generate text that contains the answer relevant to the given question. Answer Localization: TF-IDF is used to vectorize subtitle text and the generated answer text, and cosine similarity is employed to find the most similar subtitle segments. The predicted answer timestamps are calculated, and the performance is evaluated using Intersection Over Union (IOU).
We used advanced preprocessing techniques to handle videos of varying quality and background noise, improving the model's robustness. Multiple post-processing steps were introduced in answer localization, including filtering based on the number of sentences and optimizing timestamps, to enhance prediction accuracy and consistency.
Seahawk run-1 NCstate
We developed a keyword extraction model based on the BiLSTM network integrated with spaCy. This model is capable of extracting keywords from questions. During retrieval, we consider both the number of keywords contained in the video subtitles and the distances between these keywords. Videos that cover more keywords and have shorter distances between them are ranked higher
We use GPT4o mini to generate answers to questions and a fine-tuned roBERTa model for answer localization.
We sincerely apologize for the late submission and appreciate the hard work of the organizers.
run UNCW
BERT
BERT
N/a
mainrun UNCC
bert
bert
none