TREC 2005 Robust Track Guidelines
Results due date: August 8, 2005
What to submit to NIST:
Submit at most five robust runs. If you submit exactly one automatic run, then that run must use either just the description field of the topic or just the title field of the topic. If you submit exactly two automatic runs, then one must use just the description field and the other use just the title field. If you submit three or more automatic runs, then you must submit one run that uses just the description field, one run that uses just the title field, and the other runs may use any combination of fields desired.
NIST will assess documents from at least one of the runs. When you submit your results, specify the order in which the runs should be considered for assessing.
The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. This year, the track will investigate the effectiveness obtainable on a new document set for topics that are known to be difficult on a separate document set. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.
The test document collection for the Robust track is the set of documents on the AQUAINT disks (see the TREC 2005 Welcome message for how to obtain document collections). The AQUAINT collection consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires.
A file containing 50 topic statements will be posted to the Robust section of the Tracks' page some time on June 15. The 50 topics will be selected from among the topics that had low average effectiveness across runs in previous TRECs and at least 3 relevant documents in the AQUAINT collection. This same topic set will also be used by the TREC 2005 HARD track.
Queries may be created automatically or with human-assistance from the topic statements. Automatic methods are those in which there is no human intervention at any stage and manual methods are everything else. You may use any/all of the topic fields when creating queries from the topic statements for most runs. However, there are two "standard" automatic runs that must be submitted if you submit any automatic runs. These standard runs facilitate comparing systems. One standard runs uses just the description field of the topic statement. The other standard run uses just the title field. If you submit only one automatic run, then you must submit one of these two standard runs (either one of your choice). If you submit at least two automatic runs, then you must submit both standard runs.
The human-assisted query construction category encompasses a wide variety of different approaches. There are intentionally few restrictions on what is permitted to accommodate as many experiments as possible. In general, the ranking submitted for a topic is expected to reflect a ranking that your system could actually produce, i.e., the result of a single query in your system (granting that that query might be quite complex and the end result of many iterations of query refinement) or the automatic fusion of different queries' results. However, it is permissible to submit a ranking produced in some other way, provided the ranking supports some specific hypothesis that is being tested and the conference paper gives explicit details regarding how the ranking was constructed. IMPORTANT NOTE: You may make explicit use of the relevance judgments for a topic on the old collection to construct your query for the new collection, but doing so (even automatically) makes the run human-assisted since the relevance judgments are human-produced. For example, automatically creating a feedback query on the old collection and using that query on the new collection as in the old routing task is a legal human-assisted Robust track run. (Remember that the assessor doing the judging for this year's Robust track will be different from the assessor who made the judgments for the original collection. Different assessors are known to have different opinions as to relevance, so while existing judgments are likely to be helpful for retrieving good documents in the new collection, that is not guaranteed.)
Format of a Submission
A robust track submission consists of a single file with two parts. The format of the first part is the same as is given in the Welcome message, repeated here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
630 Q0 NYT19990430.0001 1 4238 prise1 630 Q0 APW20000805.0004 2 4223 prise1 630 Q0 XIE19971213.0003 3 4207 prise1 630 Q0 NYT19980830.0021 4 4194 prise1 630 Q0 APW19981105.0054 5 4289 prise1 etc.
Each topic must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 1000 documents for a topic, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 documents per topic.
Immediately following the end of the ranked lists, is the second part of the submission file. This part contains exactly 50 lines and assigns exactly one number in the range 1--50 to each of the topics. No two topics may be assigned the same number. The semantics of the number assigned is the system's prediction of the relative difficulty of the topic. A topic assigned a number closer to 1 is predicted to be easier than (and thus the system will get a better score for) a topic assigned a number closer to 50. The reason for including this information in the submission is that we will again (as in TREC 2004) test systems' abilities to recognize the difficulty of a topic. The format of a line for this part is
P topic-number difficulty-number
where P is the constant "P" to flag the line as being a prediction; topic-number is one of the topic numbers in the test set; and difficulty-number is an integer in the range 1-50. When sorted by difficulty-number, the P lines must provide a strict ordering of all 50 topics in the test set from easiest (1) to most difficult (50).
NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same topic, invalid document numbers, wrong format, and multiple tags within runs. This routine will be made available to participants to check their runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.
Groups may submit up to five runs to the Robust track. The pools for the document judging will be created from both HARD and Robust track runs. NIST will guarantee to judge at least one run per Robust group. (Note that a group that submits both HARD and Robust track runs will have the minimum number of runs in both tracks judged.) During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily. The judgments will be on a three-way scale of "not relevant", "relevant", and "highly relevant" so that we can continue to build ad hoc test collections with multiple relevance levels. The scoring will consider both "relevant" and "highly relevant" documents to be relevant.
While the pools will be somewhat shallower than previous years given resource constraints, the diversity will be high. Also, HARD runs and Robust human-assisted runs are likely to be more effective than Robust track runs in previous years, so the pool quality should be fine.
NIST will score all submitted runs using the relevance judgments produced by the assessors. In addition to the standard measures reported by trec_eval, we will also report the geometric MAP measure (see the TREC 2004 Robust track overview for a discussion of this measure). The geometric MAP measure will replace the success@10 and area measures since we are using only 50 topics. The geometric MAP measure will be the "official" measure for the track.
The quality of the run difficulty prediction will be measured by the difference between the curves produced by plotting (standard) MAP when topics are sorted by actual MAP score and when plotted by predicted order (again, see the TREC 2004 Robust track overview). Other suggestions for measures of difficulty prediction are welcome and may be reported.
Documents available: now Topics available: after June 15 Results due at NIST: Aug 8, 2005 (11:59pm EDT) Qrels for new topics available: not later than Oct 1, 2005 Conference notebook papers due: late October, 2005 TREC 2005 conference: November 15--18, 2005
Last updated:Thursday, 26-May-2005 11:57:05 EDT
Date created: Monday, May 23, 2005