mllm-indelab-09-17 — LLM Ranking Task

Is the run manual or automatic?: automatic
Did you use the response metadata?: yes
Did you use any additional data or external knowledge?: no
Did you use the development set?: yes
Did you train on the development set?: yes
Provide a description of this run, including details about your answers above.: The mllm-indelab-09-17 submission is an automatic run using an ensemble of five dual-encoder neural ranking models with query embeddings processed through 4-head attention (384→96×4→384) and dense layers (384→256→192→128), while LLM representations use learned embeddings (1131→256) followed by a 4-layer tower (256→512→384→256→128). Final ranking uses cosine similarity between 128D representations, with ensemble scores averaged across the 5 best-performing cross-validation folds (selected post-hoc based on validation nDCG@10). Training data consisted of 4.45M examples: 90% of the development set queries per fold (~347K examples with human qrels 0-3) plus 4.06M weakly labeled examples generated from discovery response metadata (specifically if response length > 50 characters then qrel = 1.0, otherwise qrel = 0.0). For 10-fold cross-validation, each fold trained on the training subset of the development queries plus all weakly labeled examples, and validated on the held-out 10% of original dev queries. No additional data or external knowledge was used beyond the provided development and discovery datasets.
Priority for pooling: 1 (top)