TREC 2004 Terabyte Track Guidelines


Overview

The goal of the Terabyte Track is to develop an evaluation methodology for terabyte-scale document collections. This year's track uses a 426GB collection of Web data from the .gov domain. While this collection is less than a full terabyte in size, it is considerably larger than the collections used in previous TREC tracks. In future years, we plan to expand the collection using data from other sources.

The task will be classic adhoc retrieval, similar to the current Robust Retrieval Task and to the adhoc and VLC tasks from earlier TREC conferences. An adhoc task in TREC investigates the performance of systems that search a static set of documents using previously-unseen topics. For each topic, participants create a query and submit a ranking of the top documents for that topic (10,000 for this track). NIST will create and assess 50 topics for the track.

In addition to the top 10,000 documents, we will be collecting information about each system and each run, including hardware characteristics and performance measurements. Be sure to record the required information when you generate your experimental runs, since it will be requested on the submission form.

Collection

This year's track will use a collection of Web data crawled from Web sites in the .gov domain during early 2004. This collection ("GOV2") contains a large proportion of the crawlable pages in .gov, including html and text, plus the extracted text of pdf, word and postscript files. The collection is 426GB in size and contains 25 million documents.

The GOV2 collection may be ordered from CSIRO in Australia. The collection is distributed on a hard drive, formatted for Linux or Windows, for a cost of A$1200 (about US$800). The cost includes the hard drive, which is yours to keep.

Topics

50 topics specific to the Terabyte Track will be posted to the Terabyte section of the Track's page on August 3.

Queries

Queries may be created automatically or manually from the topic statements. Automatic methods are those in which there is no human intervention at any stage, and manual methods are everything else. For most runs, you may use any or all of the topic fields when creating queries from the topic statements. However, each group submitting an automatic run must submit an automatic run that uses just the title field of the topic statement.

Web-specific techniques, including link analysis, anchor text and document structure, may be used but must be reported on the submission form.

Submissions

An experimental run will consist of the top 10,000 documents for each topic, along with associated performance and system information. We are requiring 10,000 documents, since we believe this information may help us to better understand the evaluation process.

The submission form requires each group to report details of the hardware configuration and various performance numbers, including the number of processors, total RAM (GB), on-disk index size (GB), indexing time (elapsed time in minutes), average search time (seconds), and hardware cost. For the number of processors, report the total number of CPUs in the system. For example, if your system is a cluster of eight dual-processor machines, you would report 16. For the hardware cost, provide an estimate in US dollars of the cost at the time of purchase.

Some groups may subset the collection before indexing, removing selected pages to reduce its size. The submission form asks for the fraction of pages indexed. If you did not subset the collection before indexing, report 100%.

For search time, report the time to return the top 20 documents, not the time to return the top 10,000. It is acceptable to execute your system twice for each query, once to generate the top 10,000 documents and once to measure the execution time for the top 20, provided that the top 20 results are the same in both cases.

Format of a Submission

A Terabyte Track submission consists of a single ASCII text file in the format used for all TREC adhoc submissions, which we repeat here for convenience. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       630 Q0 ZF08-175-870  1 4238 prise1
       630 Q0 ZF08-306-044  2 4223 prise1
       630 Q0 ZF09-477-757  3 4207 prise1
       630 Q0 ZF08-312-422  4 4194 prise1
       630 Q0 ZF08-013-262  5 4189 prise1
          etc.

where:

  • the first column is the topic number.
  • the second column is the query number within that topic. This is currently unused and should always be Q0.
  • the third column is the official document number of the retrieved document and is the number found in the "docno" field of the document.
  • the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking.
  • the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags.

Each topic must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 10,000 documents for a topic, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 10,000 documents per topic.

Judging

Groups may submit up to five runs to the Terabyte track. At least one run will be judged by NIST assessors; NIST may judge more than one run per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will choose the run to assess arbitrarily. The judgments will be on a three-way scale of "not relevant", "relevant", and "highly relevant".

Scoring

NIST will score all submitted runs using the relevance judgments produced by the assessors. No single measure will form the focus for the track. Instead, a variety of scores will be reported, including trec_eval output. Other measures have been proposed in the track mailing list, and as many of these as possible will be reported.

Timetable

       Documents available:             now
       Topics available:                August 3, 2004
       Results due at NIST:             Sept 8, 2004 (11:59pm EDT)
       Conference notebook papers due:  late October, 2004
       TREC 2004 conference:            November 16-19, 2004

Last updated: Tuesday, 15-June-04
Date created: Tuesday, 15-June-04
claclarke@plg.uwaterloo.ca