TREC 2004 Robust Track Guidelines
Results due date: August 5, 2004
What to submit to NIST
Submit at most ten ad hoc runs. If you submit exactly one automatic run, then that run must use either just the description field of the topic or just the title field of the topic. If you submit exactly two automatic runs, then one must use just the description field and the other use just the title field. If you submit three or more automatic runs, then you must submit one run that uses just the description field, one run that uses just the title field, and the other runs may use any combination of fields desired.
NIST will assess documents from at least one, and probably more, of the runs. When you submit your results, specify the order in which the runs should be considered for assessing. Evaluation results will be reported using the entire set of 250 topics, the set of 50 difficult topics selected during TREC 2003, and the set of 50 new topics for TREC 2004 (651-700).
The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.
The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.
Source # Docs Size (MB) Financial Times 210,158 564 Federal Register 94 55,630 395 FBIS, disk 5 130,471 470 LA Times 131,896 475 Total Collection: 528,155 1904
You may index any/all fields of the documents (this is a change from early TRECs where we asked participants not to use the "Subject" field in the LA Times documents).
A file containing 250 topic statements will be posted to the Robust section of the Tracks' page some time on June 3. This file will contain all of the topics used against this data set in previous TRECs plus 50 new topic statements. The 50 new topic statements will be numbered 651-700.
There is a distinguished set of 50 topics drawn from topics 301-450 that were used in the TREC 2003 robust track. This set of topics was selected as a set of topics known to be difficult for current automatic systems. Evaluation measures will be computed over this set of topics in isolation, in addition to evaluating over the whole set of 250 topics, and the new 50 topics (numbers 651-700). The members of this distinguished set are:
303 322 344 353 363 378 394 408 426 439 307 325 345 354 367 379 397 409 427 442 310 330 346 355 372 383 399 414 433 443 314 336 347 356 374 389 401 416 435 445 320 341 350 362 375 393 404 419 436 448
Queries may be created automatically or manually from the topic statements. Automatic methods are those in which there is no human intervention at any stage and manual methods are everything else. You may use any/all of the topic fields when creating queries from the topic statements for most runs. However, there are two "standard" automatic runs that must be submitted if you submit any automatic runs. These standard runs facilitate comparing systems. One standard runs uses just the description field of the topic statement. The other standard run uses just the title field. If you submit only one automatic run, then you must submit one of these two standard runs (either one of your choice). If you submit at least two automatic runs, then you must submit both standard runs.
The manual query construction category encompasses a wide variety of different approaches. There are intentionally few restrictions on what is permitted to accommodate as many experiments as possible. In general, the ranking submitted for a topic is expected to reflect a ranking that your system could actually produce, i.e., the result of a single query in your system (granting that that query might be quite complex and the end result of many iterations of query refinement) or the automatic fusion of different queries' results. However, it is permissible to submit a ranking produced in some other way, provided the ranking supports some specific hypothesis that is being tested and the conference paper gives explicit details regarding how the ranking was constructed.
Using some old topics in the test set means that full relevance data is available to participants and that systems have been developed using these topics. Since we cannot control how the topics were used in the past, the assumption will be that the old topics were fully exploited in any way desired in the construction of the retrieval system. The only restriction for the old topics is that the actual relevance judgments may not be used during the processing of the submitted runs. This precludes such things as true (rather than pseudo) relevance feedback, computing weights based on the known relevant set, etc. The usual ad hoc restrictions see General TREC Guidelines that preclude modifying system data structures in response to test topics DO apply to the 50 new topics.
Format of a Submission
A robust track submission consists of a single file with two parts. The format of the first part is the same as is given in the Welcome message, repeated here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
630 Q0 ZF08-175-870 1 4238 prise1 630 Q0 ZF08-306-044 2 4223 prise1 630 Q0 ZF09-477-757 3 4207 prise1 630 Q0 ZF08-312-422 4 4194 prise1 630 Q0 ZF08-013-262 5 4189 prise1 etc.
Each topic must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 1000 documents for a topic, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 documents per topic.
Immediately following the end of the ranked lists, is the second part of the submission file. This part contains exactly 250 lines and assigns exactly one number in the range 1--250 to each of the topics. No two topics may be assigned the same number. The semantics of the number assigned is the system's prediction of the relative difficulty of the topic. A topic assigned a number closer to 1 is predicted to be easier than (and thus the system will get a better score for) a topic assigned a number closer to 250. The reason for including this information in the submission is that we would like to test systems' abilities to recognize the difficulty of a topic. The QA track had a similar task in TREC 2002, and some systems were much better at predicting than others. The format of a line for this part is
P topic-number difficulty-number
where P is the constant "P" to flag the line as being a prediction; topic-number is one of the topic numbers in the test set; and difficulty-number is an integer in the range 1-250. When sorted by difficulty-number, the P lines must provide a strict ordering of all 250 topics in the test set from easiest (1) to most difficult (250).
NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same topic, invalid document numbers, wrong format, and multiple tags within runs. This routine will be made available to participants to check their runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.
Groups may submit up to ten runs to the Robust track. The 50 new topics for at least one run will be judged by NIST assessors; NIST may judge more than one run per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily. No additional judging of the old topics will be performed (you lose the benefits of having a standard test collection as soon as there exists multiple sets of relevance judgments). The judgments for the 50 new topics will be on a three-way scale of "not relevant", "relevant", and "highly relevant" so that we can continue to build ad hoc test collections with multiple relevance levels. To be compatible with the old topics that have only binary judgments, the scoring will consider both "relevant" and "highly relevant" documents to be relevant.
NIST will score all submitted runs using the relevance judgments produced by the assessors. A large variety of scores will be reported, including at least the following:
In the interest of continuing the investigation of good evaluation measures, we will not select a single measure to be the focus measure for the track. We will collect whether or not the run was created with a specific measure in mind (and which measure, if so) with the form that is filled in when runs are submitted.
Documents available: now Topics available: after June 3, 2004 Results due at NIST: Aug 5, 2004 (11:59pm EDT) Qrels for new topics available: not later than Oct 1, 2004 Conference notebook papers due: late October, 2004 TREC 2004 conference: November 16-19, 2004
Last updated:Tuesday, 08-Feb-2005 09:52:03 EST
Date created: Friday, 21-May-04