TREC 2004 Robust Track Guidelines

Tracks home

National Institute of Standards and Technology Home Page

Summary

Results due date: August 5, 2004

What to submit to NIST
Each submission is in a single file that contains two parts. The first part contains a ranked list of at least one and no more than 1000 documents for each topic in the Robust test set run against the combined collection consisting of the documents from the Financial Times, the Federal Register 94, the LA Times, and FBIS (i.e. TREC disks 4&5, minus the Congressional Record). The Robust test set contains 250 topics: topics 301-450 (ad hoc topics from TRECs 6--8), topics 601-650 (new topics for last year's robust track), and topics 651-700 (this year's new topics). The second part contains exactly one line per topic (so, exactly 250 lines) and gives the system's prediction as to how well the topic was answered compared with the other topics. Each topic is assigned a number from 1 to 250. The topic assigned 1 is the topic the system believes it did best on. The topic assigned 2 is the topic the system believes it did next best on. The topic assigned 250 is the topic the system believed it did the worst on. Each topic must be assigned exactly one number from the range 1-250, and no two topics may be assigned the same number (i.e., you must provide a strict ordering of the topics by presumed difficulty).

Submit at most ten ad hoc runs. If you submit exactly one automatic run, then that run must use either just the description field of the topic or just the title field of the topic. If you submit exactly two automatic runs, then one must use just the description field and the other use just the title field. If you submit three or more automatic runs, then you must submit one run that uses just the description field, one run that uses just the title field, and the other runs may use any combination of fields desired.

NIST will assess documents from at least one, and probably more, of the runs. When you submit your results, specify the order in which the runs should be considered for assessing. Evaluation results will be reported using the entire set of 250 topics, the set of 50 difficult topics selected during TREC 2003, and the set of 50 new topics for TREC 2004 (651-700).

Details

The goal of the Robust track is to improve the consistency of retrieval technology by focusing on poorly performing topics. In addition, the track brings back a classic, ad hoc retrieval task in TREC that provides a natural home for new participants. An ad hoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top 1000 documents for that topic.

Test Data

The document collection for the Robust track is the set of documents on both TREC Disks 4 and 5 minus the the Congressional Record on disk 4.

	Source		    # Docs    Size (MB)
    Financial Times 	    210,158 	564
    Federal Register 94      55,630	395
    FBIS, disk 5   	    130,471 	470
    LA Times                131,896 	475

    Total Collection:	    528,155    1904

You may index any/all fields of the documents (this is a change from early TRECs where we asked participants not to use the "Subject" field in the LA Times documents).

A file containing 250 topic statements will be posted to the Robust section of the Tracks' page some time on June 3. This file will contain all of the topics used against this data set in previous TRECs plus 50 new topic statements. The 50 new topic statements will be numbered 651-700.

There is a distinguished set of 50 topics drawn from topics 301-450 that were used in the TREC 2003 robust track. This set of topics was selected as a set of topics known to be difficult for current automatic systems. Evaluation measures will be computed over this set of topics in isolation, in addition to evaluating over the whole set of 250 topics, and the new 50 topics (numbers 651-700). The members of this distinguished set are:

    303  322  344  353  363  378  394  408  426  439
    307  325  345  354  367  379  397  409  427  442
    310  330  346  355  372  383  399  414  433  443
    314  336  347  356  374  389  401  416  435  445
    320  341  350  362  375  393  404  419  436  448

Queries may be created automatically or manually from the topic statements. Automatic methods are those in which there is no human intervention at any stage and manual methods are everything else. You may use any/all of the topic fields when creating queries from the topic statements for most runs. However, there are two "standard" automatic runs that must be submitted if you submit any automatic runs. These standard runs facilitate comparing systems. One standard runs uses just the description field of the topic statement. The other standard run uses just the title field. If you submit only one automatic run, then you must submit one of these two standard runs (either one of your choice). If you submit at least two automatic runs, then you must submit both standard runs.

The manual query construction category encompasses a wide variety of different approaches. There are intentionally few restrictions on what is permitted to accommodate as many experiments as possible. In general, the ranking submitted for a topic is expected to reflect a ranking that your system could actually produce, i.e., the result of a single query in your system (granting that that query might be quite complex and the end result of many iterations of query refinement) or the automatic fusion of different queries' results. However, it is permissible to submit a ranking produced in some other way, provided the ranking supports some specific hypothesis that is being tested and the conference paper gives explicit details regarding how the ranking was constructed.

Using some old topics in the test set means that full relevance data is available to participants and that systems have been developed using these topics. Since we cannot control how the topics were used in the past, the assumption will be that the old topics were fully exploited in any way desired in the construction of the retrieval system. The only restriction for the old topics is that the actual relevance judgments may not be used during the processing of the submitted runs. This precludes such things as true (rather than pseudo) relevance feedback, computing weights based on the known relevant set, etc. The usual ad hoc restrictions see General TREC Guidelines that preclude modifying system data structures in response to test topics DO apply to the 50 new topics.

Format of a Submission

A robust track submission consists of a single file with two parts. The format of the first part is the same as is given in the Welcome message, repeated here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       630 Q0 ZF08-175-870  1 4238 prise1
       630 Q0 ZF08-306-044  2 4223 prise1
       630 Q0 ZF09-477-757  3 4207 prise1
       630 Q0 ZF08-312-422  4 4194 prise1
       630 Q0 ZF08-013-262  5 4189 prise1
          etc.

where:

the first column is the topic number.
the second column is the query number within that topic. This is currently unused and should always be Q0.
the third column is the official document number of the retrieved document and is the number found in the "docno" field of the document.
the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags.

Each topic must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 1000 documents for a topic, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 documents per topic.

Immediately following the end of the ranked lists, is the second part of the submission file. This part contains exactly 250 lines and assigns exactly one number in the range 1--250 to each of the topics. No two topics may be assigned the same number. The semantics of the number assigned is the system's prediction of the relative difficulty of the topic. A topic assigned a number closer to 1 is predicted to be easier than (and thus the system will get a better score for) a topic assigned a number closer to 250. The reason for including this information in the submission is that we would like to test systems' abilities to recognize the difficulty of a topic. The QA track had a similar task in TREC 2002, and some systems were much better at predicting than others. The format of a line for this part is

  		P topic-number difficulty-number

where P is the constant "P" to flag the line as being a prediction; topic-number is one of the topic numbers in the test set; and difficulty-number is an integer in the range 1-250. When sorted by difficulty-number, the P lines must provide a strict ordering of all 250 topics in the test set from easiest (1) to most difficult (250).

NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same topic, invalid document numbers, wrong format, and multiple tags within runs. This routine will be made available to participants to check their runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.

Judging

Groups may submit up to ten runs to the Robust track. The 50 new topics for at least one run will be judged by NIST assessors; NIST may judge more than one run per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily. No additional judging of the old topics will be performed (you lose the benefits of having a standard test collection as soon as there exists multiple sets of relevance judgments). The judgments for the 50 new topics will be on a three-way scale of "not relevant", "relevant", and "highly relevant" so that we can continue to build ad hoc test collections with multiple relevance levels. To be compatible with the old topics that have only binary judgments, the scoring will consider both "relevant" and "highly relevant" documents to be relevant.

Scoring

NIST will score all submitted runs using the relevance judgments produced by the assessors. A large variety of scores will be reported, including at least the following:

trec_eval output for all 250 topics together;
trec_eval output for the set of 50 distinguished topics only;
trec_eval output for the set of 50 new topics only;
the count of the number of topics with no relevant retrieved in the top 10 ranks computed over the topic sets above;
the area under the curve when MAP (mean average precision) of the worst X topics is plotted against X. X will range from 1 to .25*number-of-topics (i.e., 12 for sets with 50 topics, 62 for the 250 topic set). Worst topics---topics with the smallest average precision scores---will be defined with respect to the individual run being scored;
Kendall tau correlations between the topic ranking produced when sorting by difficulty-numbers and the topic ranking produced when sorting by an evaluation measure.

In the interest of continuing the investigation of good evaluation measures, we will not select a single measure to be the focus measure for the track. We will collect whether or not the run was created with a specific measure in mind (and which measure, if so) with the form that is filled in when runs are submitted.

Timetable

Documents available: 			now
Topics available:			after June 3, 2004
Results due at NIST:			Aug 5, 2004 (11:59pm EDT)
Qrels for new topics available:		not later than Oct 1, 2004
Conference notebook papers due:		late October, 2004
TREC 2004 conference:			November 16-19, 2004

Last updated:
Date created: Friday, 21-May-04
[email protected]