v3_surround_glm4 — Report Generation Task

Submission Details

Organization: CSU
Track: RAG TREC Instrument for Multilingual Evaluation
Task: Report Generation Task
Date: 2025-08-20

Run Description

Document collection: ['English subset', 'Arabic subset', 'Chinese subset', 'Russian subset']
Machine translation of documents: ['None']
Write a short description of your retrieval process: The system uses pre-built FAISS indexes containing document embeddings. When a query arrives, it is encoded using a multilingual BGE model with language-specific prefixes. The encoded query vector searches the indexes to retrieve top-k similar documents from multiple language splits. Results are filtered by a minimum similarity score threshold to ensure relevance.
Write a short description of your generation process: For generation, retrieved documents undergo intelligent chunking using a recursive text splitter that creates overlapping segments to preserve context. These chunks are then reranked using specialized models (Qwen reranker or FlagReranker) that compute semantic similarity with the original query. The reranking employs batch processing and caching for efficiency, identifying the most relevant chunks. The top-ranked chunks are formatted into a structured context that includes similarity scores and chunk identifiers. This context is combined with the original task description to create a comprehensive prompt for the language model. The order of chunks is set as high score in the start and end. As the model generates responses, each sentence is automatically assigned citations by matching it with the source chunks that contributed to its content. The final output is structured into a validated JSON format containing metadata, cited sentences, and a reference list of source documents.
Which LLM(s) where used by your system?: GLM-4.5
Open repository link: https://github.com/stuartofmine/rag
Assessing priority: 1 (highest)

Evaluation Files

v3_surround_glm4.autoargue (autoargue)
v3_surround_glm4.almost-human-judgments.tsv (almost-human-judgments.tsv)
v3_surround_glm4.almost-human-scores.tsv (almost-human-scores.tsv)
v3_surround_glm4.autoargue-scores.tsv (autoargue-scores.tsv)
v3_surround_glm4.autoargue-judgments.jsonl (autoargue-judgments.jsonl)