TREC 2007 Question Answering Track Guidelines

Tracks home

National Institute of Standards and Technology
Home Page

I. Summary:

The goal of the TREC QA track is to foster research on systems that retrieve answers rather than documents in response to a question. The focus is on systems that can function in unrestricted domains. The TREC 2007 QA track will consist of two tasks:

The two tasks are independent, and participants in the QA track may do one or both tasks.

The primary goal of the complex interactive QA (ciQA) task is to promote the development of interactive systems capable of addressing complex information needs. For information about the complex interactive QA task, please visit the ciQA homepage.

The main task is the same as in TREC 2006, in that the test set will consist of question series where each series asks for information regarding a particular target. As in TREC 2006, the targets will include people, organizations, events and other entities. Each question series will consist of some factoid and some list questions and will end with exactly one "Other" question. The answer to the "Other" question is to be interesting information about the target that is not covered by the preceding questions in the series. The runs will be evaluated using the same methodology as in TREC 2006, except that the official scores for the "Other" questions will be computed using multiple assessors' judgments of the importance of information nuggets.

The main 2007 task differs from the 2006 task in that questions will be asked over both blog documents and newswire articles, rather than just newswire. A blog document is defined to be a blog post and its follow-up comments (a permalink). The blog collection contains well-formed English as well as badly-formed English and spam, and mining blogs for answers will introduce significant new challenges in at least two aspects that are very important for functional QA systems: 1) being able to handle language that is not well-formed, and 2) dealing with discourse structures that are more informal and less reliable than newswire.

The question set for the main task will be available on the TREC web site on July 13 (by 12 noon EDT). YOU ARE REQUIRED TO FREEZE YOUR SYSTEM BEFORE DOWNLOADING THE QUESTIONS. No changes of any sort can be made to any component of your system or any resource used by your system between the time you fetch the questions and the time you submit your results to NIST. All runs for the main task must be completely automatic; no manual intervention of any sort is allowed. All targets must be processed from the same initial state (i.e., your system may not adapt to targets that have already been processed). Results for the main task are due at NIST on or before July 29, 2007. You may submit up to three runs for the main task; all submitted runs will be judged.

II. Document set

Answers for all questions in the test set will be drawn from the Blog06 and AQUAINT-2 document collections. (If no answer was found in the document collections, as will be the case for some factoid questions, then the "answer" will be "NIL".)

The Blog06 corpus is distributed by the University of Glasgow and is the same collection as was used in the TREC 2006 Blog Track. Each document in the permalinks collection is the raw HTML content from the Web wrapped between a <DOC>...</DOC> pair. Just after <DOC>, there are some informational metadata tags, including the <DOCNO> section which contains the document id. More details can be found in the Blog06 README.

The AQUAINT-2 collection consists of newswire articles that are roughly contemporaneous with the Blog06 collection. The AQUAINT-2 collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period of October 2004 - March 2006. Articles are in English and come from a variety of sources including Agence France Presse, Central News Agency (Taiwan), Xinhua News Agency, Los Angeles Times-Washington Post News Service, New York Times, and The Associated Press. The AQUAINT-2 DTD describes the document format. Document ids identify both the source newswire service and the date when the article was posted to the newswire service.

Participants in the main task must purchase the Blog06 collection from the University of Glasgow. (See http://ir.dcs.gla.ac.uk/test_collections/.) The AQUAINT-2 collection is being distributed by NIST to TREC 2007 participants at no cost. Participants who would like to receive the AQUAINT-2 collection must submit two forms to NIST before receiving the collection:

Agreement Concerning Dissemination of TREC Results
AQUAINT-2 User Agreement Forms

These forms and instructions for submitting them to NIST are available under the Forms section of the TREC web site.

III. Details of main task

Except for using different document collections, this year's main QA task is essentially the same as the main task in TREC 2006.

Test set of questions

The test set consists of a series of questions for each of a set of targets. Each series is an abstraction of a user session with a QA system; therefore, some questions may depend on knowing answers to previous questions in the same series (for example, Series 197 from the TREC 2006 QA test set).

There will be approximately 70 question series in the test set. Each series will contain several factoid questions, 1-2 list questions, and a question called "other". The "other" question is to be interpreted as "Give additional information about the target that would be interesting to the user but that the user has not explicitly asked for." As in previous years of the QA track, we will assume that the user is an "average" adult reader of American newspapers.

Time-dependent questions are those for which the correct answer can vary depending on the timeframe that is assumed. When a question is phrased in the present tense, the implicit timeframe will be the date of the last document in the document collection; thus, systems will be required to return the most up-to-date answer supported by the document collection. When the question is phrased in the past tense then either the question will explicitly specify the time frame (e.g., "What cruise line attempted to take over NCL in December 1999?") or else the time frame will be implicit in the question series. For example, if the target is the event "France wins World Cup in soccer" and the question is "Who was the coach of the French team?" then the correct answer must be "Aime Jacquet" (the name of the coach of the French team in 1998 when France won the World Cup), and not just the name of any past or current coach of the French team.

The question set will be in the XML format given in the TREC 2006 QA test set. The format will explicitly tag the target as the target, as well as the type of each question in the series (type is one of FACTOID, LIST, and OTHER). Each question will have an id of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Factoid questions are questions that seek short, fact-based answers such as have been used in previous TREC QA tracks. Some factoid questions may not have an answer in the document collection. List questions are requests for a set of instances of a specified type. Factoid and list questions require exact answers to be returned. Responses to the "other" question need not be exact, though excessive length will be penalized.

The document collection might not contain a correct answer for some of the factoid questions. In this case, the correct response is the string "NIL". A question will be assumed to have no correct answer in the collection if our assessors do not find an answer during the answer verification phase AND no participant returns a correct response. (If a system returns a right answer that is unsupported, NIL will still be the correct response for that question.)

The questions used in previous tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous QA tracks that you may use to develop your QA system. This data includes judgment files, answer patterns, top ranked document lists, and sentence files. See the web page for a description of each of these resources.

Submission format

A submission consists of exactly one response for each question. The definition of a response varies for the different question types.

Factoid response

For factoid questions, a response is a single [answer-string, docid] pair or the string "NIL". The "NIL" string will be judged correct if there is no answer known to exist in the document collection; otherwise it will be judged as incorrect. If an [answer-string, docid] pair is given as a response, the answer-string must contain nothing other than the answer, and the docid must be the id of a document in the collection that supports answer-string as an answer. The id of a Blog06 document is the DOCNO element from a permalinks file. The id of an AQAUINT-2 document is the "id" attribute of a DOC element. The answer-string does not have to appear literally in a document in order for the document to support it as being the correct answer. Some additional axiomatic knowledge and temporal inferencing can be used, including:

Temporal ordering of months, days of the week, etc.
Calculation of specific dates from relative temporal expressions, possibly using a universal calendar. For example, "August 12, 2000" is a supported correct answer to "On what date did the Kursk sink?" if it is supported by a document dated August 13, 2000, containing the phrase "the Kursk sank Saturday..."

It may be the case that different documents will support contradictory answers as being correct. A response is said to be locally correct when the supporting document supports the answer-string as being correct, and the answer-string contains exactly the answer. Locally correct answers are assumed to be globally correct unless there is a better, contradictory answer supported in the document collection. The assessor may use a number of criteria in determining that one locally correct answer is better than another, including recency of the supporting document, the amount of support provided by each supporting document, the number of distinct sources that support the answer as being correct, and the credibility or authoritativeness of the source. The assessor will mark as globally correct one or more of the most credible of the known locally correct answers. "Global" correctness is defined with respect to the document collection, and not necessarily with respect to the real world.

An answer string must contain a complete, exact answer and nothing else. Support, correctness, and exactness will be in the opinion of the assessor. Responses will be judged by human assessors who will assign one of five possible judgments to a response:

incorrect: the answer-string does not contain a correct answer or the answer is not responsive;
unsupported: the answer-string contains a correct answer but the document returned does not support that answer;
non-exact: the answer-string contains a correct answer and the document supports that answer, but the string contains more than just the answer (or is missing bits of the answer);
locally correct: the answer-string consists of exactly a correct answer and that answer is supported by the document returned, but the document collection contains a contradictory answer that the assessor believes is better.
globally correct: the answer-string consists of exactly a correct answer, that answer is supported by the document returned, and the document collection does not contain a contradictory answer that the assessor believes is better.

Being "responsive" means such things as including units for quantitative responses (e.g., $20 instead of 20) and answering with regard to a famous entity itself rather than its replicas or imitations. (The paper "The TREC-8 Question Answering Track Evaluation" in the TREC-8 proceedings contains a very detailed discussion of responsiveness.)

List response

For list questions, a response is an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. The interpretation of the pair is the same as for factoid questions, and each pair will be judged in the same way as for a factoid question. Note that this means that if an answer-string actually contains multiple answers for the question it will be marked inexact and will thus hurt the question's precision score. In addition to judging the individual instances in a response, the assessors will also mark a subset of the instances judged globally correct as being distinct. If multiple globally correct instances contain answer strings that are conceptually identical, the assessor will mark exactly one arbitrary instance as distinct, and the others will not be marked as distinct. Scores will be computed using the number of globally correct, distinct instances in the set.

Other response

A response for the "other" question that ends each target's series is also an unordered, non-empty set of [answer-string, docid] pairs, but the interpretation of the pairs differs somewhat from a list response. A pair in an "other" response is assumed to represent a new (i.e., not part of the question series nor previously reported as an answer to an earlier question in the series), interesting fact about the target. Individual pairs will not be judged for these questions. Instead, the assessor will construct a list of desirable information nuggets about the target, and count the number of distinct, desirable nuggets that occur in the response as a whole. There is no expectation of an exact answer to "other" questions, but responses will be penalized for excessive length.

File format

A submission file for the main task comprises the responses to all the questions in the main task and must contain a response for each question in the main task. Each line in the file must have the form

          qid run-tag docid answer-string
where qid           is the question number (of the form X.Y),
      run-tag       is the run id, 
      docid         is the id of the supporting document or
			the string "NIL" (no quotes) if the
			question is a factoid question and
			there is no answer in the collection,
and
      answer-string is a text string with no embedded
			newlines or is empty if docid is NIL

Any amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present (modulo empty answer-string when docid is NIL). Answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. The total length of all answer-strings for each question cannot exceed 7000 non-white-space characters. The "run tag" should be a unique identifier for your group AND for the method that produced the run.

The first few lines of an example submission might look like this:

    1.1  nistqa07  BLOG06-20051223-047-0003667011        $10.75
    1.2  nistqa07  NIL
    1.3  nistqa07  BLOG06-20051208-156-0018779394	    Jim Moran
    1.3  nistqa07  LTW_ENG_20051024.0064	Lisa Yockelson
    1.3  nistqa07  AFP_ENG_20050901.0520	Mary Landrieu
    1.3  nistqa07  APW_ENG_20051121.0020	Bogota
    1.3  nistqa07  CNA_ENG_20060326.0026	John Cornyn
    1.3  nistqa07  NYT_ENG_20041102.0232	Betty Castor
    1.4  nistqa07  AFP_ENG_20051001.0239	Ilamatepec
    1.4  nistqa07  BLOG06-20051209-014-0011016013 Pacaya
    2.1  nistqa07  BLOG06-20051230-020-0031205167 Super Size Me

Each group may submit up to three runs. All submitted runs will be judged.

Scoring

The different types of questions (factoid, list, other) have different scoring metrics. Each of the three scores has a range of [0.0, 1.0] with 1 being the high score. We will compute the factoid-score, list-score, and other-score for each series. The per-series combined weighted score will be a simple average of these three scores for questions in the series:

The final score for a run will be the mean of the per-series combined weighted scores.

Factoid-score

For factoid questions, the response will be judged as "incorrect", "unsupported", "non-exact", "locally correct", or "globally correct". All judgments are in the opinion of the NIST assessor. The factoid-score for a series is the fraction of factoid questions in the series judged to be "globally correct".

List-score

The response to a list question is a non-null, unordered set of [answer-string, docid] pairs, where each pair is called an instance. An individual instance is interpreted as for factoid questions and will be judged in the same way. The final answer set for a list question will be created from the union of the distinct, globally correct responses returned by all participants plus the set of answers found by the NIST assessor during question development. An individual list question will be scored by first computing instance recall (IR) and instance precision (IP) using the final answer set, and combining those scores using the F measure with recall and precision equally weighted.
That is,

 
    IR = # instances judged correct & distinct/|final answer set|
    IP = # instances judged correct & distinct/# instances returned
    F = (2*IP*IR)/(IP+IR)

The list-score for a series is the mean of the F scores of the list questions in the series.

Other-score

The response for an "other" question is syntactically the same as for a list question: a non-null, unordered set of [answer-string, docid] pairs. The interpretation of this set is different, however. For each "other" question, the assessor will create a list of acceptable information nuggets about the target from the union of the returned responses and the information discovered during question development. This list will NOT contain the answers to previous questions in the target's series: systems are expected to report only new information for the "other" question. There may be many other facts related to the target that, while true, are not considered acceptable by the assessor (because they are not interesting, for example). These items will not be on the list at all (and thus including them in a response will be penalized). All decisions regarding acceptability are in the opinion of the assessor. Once the list of acceptable nuggets is created, the assessor will view the response for one question from one system in its entirety and mark the nuggets contained in it. Each nugget that is present will be matched only once.

Some of the acceptable nuggets will be deemed vital, while other nuggets on the list are merely okay. A score for each "other" question will be computed using multiple assessors' judgments of whether a nugget is vital or okay. Each nugget will be assigned a weight equal to the number of assessors who judged it to be vital; nugget weights will then be normalized so that the maximum weight of nuggets for each "other" question is 1. See (Lin and Demner-Fushman, HLT/NAACL 2006) for details.

An individual "other" question will be scored using nugget recall (NR) and an approximation to nugget precision (NP) based on length. These scores will be combined using the F measure with recall three times as important as precision. In particular,

  NR = sum of weights of nuggets returned in response / sum of weights of all nuggets in nugget list
  NP is defined using
	allowance = 100*(# nuggets returned)
	length = total # non-white-space characters in answer strings
 	NP = 	1 if length < allowance
  		else 1-[(length-allowance)/length]
  F = (10*NP*NR)/(9*NP + NR)

The other-score for a series is simply the F score of its "other" question.

Document lists

As a service to the track, for each question series, NIST will provide the ranking of the top 1000 documents retrieved by the PRISE search engine when using the target as the query for each of the two document collections (Blog06 and AQUAINT-2). NIST will not provide document lists for individual questions. Because the Blog06 collection is very different from the AQUAINT-2 collection, PRISE will be run separately over each of the two collections and a separate ranking of the top 1000 documents will be returned for each collection. Note that this is a service only, provided as a convenience for groups that do not wish to implement their own document retrieval system. There is no guarantee that these rankings will contain all the documents that actually answer the questions in a series, even if such documents exist. The document lists will be in the same format as previous years:

    qnum rank docid rsv
where
    qnum is the target number 
    rank is the rank at which PRISE retrieved the doc
        (most similar doc is rank 1)
    docid is the document identifier
    rsv is the "retrieval status value" or similarity
         between the query and doc where larger is better

NIST will also provide the full text of the top 50 documents per target per document collection (as given from the above rankings). Participants who would like to receive the text of the top documents from a collection must submit or have already submitted to NIST a signed user agreement form for that collection. The user agreement form for AQUAINT-2 is under the Forms section of the TREC web site. The user agreement form for Blog06 is located at http://ir.dcs.gla.ac.uk/test_collections/OrganisationApplication_blog.html.

IV. Submission deadlines

The deadline for all results for the main task is 11:59 p.m. EDT on July 29, 2007. This is a firm deadline. NIST will post the questions on the web site by noon EDT on July 13 and results will have to be submitted to NIST by 11:59 p.m. July 29. However, there may be participants for whom this 16-day period will not work. In that case, the participant must make prior arrangements with Hoa Dang ([email protected]) to choose another 16-day period that begins after July 1 and ends before July 29. Results must be received 16 days after you receive the questions or by July 29, whichever is earlier. Late submissions will be discarded.

Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the main TREC 2007 participants' mailing list in late June. At that time, NIST will release a routine that checks for common errors in result files including such things as invalid document numbers, wrong formats, missing data, etc. You should check your runs with this script before submitting them to NIST because if the automatic submission procedure detects any errors, it will not accept the submission.

Restrictions
No part of the system can be changed in any manner between the time the test questions are downloaded and the time the results are submitted to NIST. No manual processing of questions, answers or any other part of a system is allowed: all processing must be fully automatic. Targets must be processed independently. Questions within a series must be processed in order, without looking ahead. That is, your system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.

V. Timetable

Blog-06 and AQUAINT-2 collections available:	now
Main task questions available:	July 13, 2007
Top ranked documents for main task available:	July 13, 2007
Results for main task due at NIST:	5:00 PM (EDT) July 31, 2007 (Extended deadline)
Evaluated results from NIST:	October 3, 2007
TREC 2007 Conference:	November 6-9, 2007

Last updated:
Date created: Wednesday, 09-May-07
[email protected]