Text REtrieval Conference (TREC) QA Data

Question Answering Collections

TREC home

Data home

National Institute
of Standards and Technology Home Page

Question Answering Collections

Question answering systems return an actual answer, rather than a ranked list of documents, in response to a question. TREC has had a question answering track since 1999; in each track the task was defined such that the systems were to retrieve small snippets of text that contained an answer for open-domain, closed-class questions (i.e., fact-based, short-answer questions that can be drawn from any domain). This page summarizes the tasks that were used in the QA tracks and describes the available data sets. For more details about a particular task, see the question answering track overview papers in the TREC proceedings.

Creating the true equivalent of a standard retrieval test collection is an open problem. In a retrieval test collection, the unit that is judged, the document, has an unique identifier, and it is trivial to decide whether a document retrieved in a new retrieval run is the same document that has been judged. For question answering, the unit that is judged is the entire string that is returned by the system. Different QA runs very seldom return exactly the same answer strings, and it is very difficult to determine automatically whether the difference between a new string and a judged string is significant with respect to the correctness of the answer. A partial solution to this problem is to use so-called answer patterns and accept a string as correct if it matches an answer pattern for the question. A description of the use of answer patterns appeared in the SIGIR-2000 proceedings:

Voorhees, E. & Tice, D. "Building a Question Answering Test Collection", Proceedings of
SIGIR-2000, July, 2000, pp. 200-207

Answer patterns are provided below for the TREC QA collections.

A submission for the (main) QA task in each TREC was a ranked list of up to 5 responses per question. The format of a response was

qid Q0 docno rank score tag answer-string

where qid is the question number

Q0 is the literal Q0

docno is the id of a document that supports the answer

rank (1-5) is the rank of this response for this question

score is a system-dependent indication of the quality of the response

tag is the identifier for the system

and answer-string is the text snippet returned as the answer. Answer string (only)
may contain any embedded white space except a newline

The judgment set for a task contains all unique [docno,answer-string] pairs from all submissions to the track. For TREC 2001 only, the docno may also be the string "NIL", in which case the answer-string is empty and the response indicates the system's belief that there is no correct response in the document collection.

The format of a judgment set is
qid docno judgment answer-string

where the fields are as in the submissions and judgment is -1 for wrong, 1 for correct, and 2 for unsupported. "Unsupported" means that the string contains a correct response, but the document returned with that string does not allow one to recognize that it is a correct response. TREC-8 runs were judged only correct (1) or incorrect (-1). A very detailed description of how answer strings were judged is given in the paper "The TREC-8 Question Answering Track Evaluation" in the TREC-8 proceedings. The "NIL" responses are not included in the judgment set for TREC 2001.

Some potential participants in the TREC QA tracks do not have ready access to a document retrieval system. To facilitate participation by these groups, TREC provided a ranking of the top X (X=200 for TREC-8, and 1000 otherwise) documents as ranked by either the AT&T version of SMART (TRECs 8,9) or PRISE (TREC 2001) when using the question as a query. TREC also provided the full text of the top Y (Y=200 for TREC-8, 50 otherwise) documents using this same ranking. These rankings were provided as a service only; the documents containing a correct answer were not always contained in the ranking. Links to both the rankings and the document texts are provided below. The document texts are password-protected to comply with the licensing agreements with the document providers. To gain access to the texts, send details of when you obtained the TREC and TIPSTER disks to the TREC manager [email protected] asking for the access sequence to the document texts.

The QA task runs were evaluated using mean reciprocal rank (MRR). The score for an individual question was the reciprocal of the rank at which the first correct answer was returned or 0 of no correct response was returned. The score for the run was then the mean over the set of questions in the test. The number of questions for which no correct response was returned was also reported. Starting in TREC-9, two versions of the scores were reported: "strict" evaluation where unsupported responses were counted as wrong, and "lenient" evaluation where unsupported responses were counted as correct. A perl script that uses the answer patterns described above to judge responses and then calculates MRR and number of questions with no correct response returned is included with the collection data (perl script not yet available for 2001). (There is no distinction between strict or lenient evaluation with pattern-judged runs since the patterns cannot detect unsupported answers.)

TREC-8 (1999) QA Data

TREC-9 (2000) QA Data

TREC 2001 QA Data

TREC 2002 QA Data

TREC 2003 QA Data

TREC 2004 QA Data

Additional QA Resources

Last updated:
Date created: Tuesday, 23-April-02
[email protected]