TREC 2006 Question Answering Track Guidelines

Tracks home

National Institute of Standards and Technology
Home Page

I. Summary:

The goal of the TREC QA track is to foster research on systems that retrieve answers rather than documents in response to a question. The focus is on systems that can function in unrestricted domains. The TREC 2006 QA track will consist of two tasks:

The two tasks are independent, and participants in the QA track may do one or both tasks.

The main task is the same as in TREC 2005, in that the test set will consist of question series where each series asks for information regarding a particular target. As in TREC 2005, the targets will include people, organizations, events and other entities. Each question series will consist of some factoid and some list questions and will end with exactly one "Other" question. The answer to the "Other" question is to be interesting information about the target that is not covered by the preceding questions in the series. The runs will be evaluated using the same methodology as in TREC 2005, except that the per-series score will give equal weight to each of the three types of questions (FACTOID, LIST, OTHER). Secondary scores for the "Other" questions will also be computed using multiple assessors' judgments of the importance of information nuggets, as inspired by (Lin and Demner-Fushman, 2006), but these secondary scores are experimental and will not be used to calculate the per-series scores.

The complex interactive QA task is a blend of the TREC 2005 relationship QA task and the TREC 2005 HARD track, which focused on single-iteration clarification dialogs. For information about the complex interactive QA task, please visit the ciQA homepage, which includes a detailed task description, timetable, and sample data.

The question set for the main task will be available on the TREC web site on July 26 (by 12 noon EDT). YOU ARE REQUIRED TO FREEZE YOUR SYSTEM BEFORE DOWNLOADING THE QUESTIONS. No changes of any sort can be made to any component of your system or any resource used by your system between the time you fetch the questions and the time you submit your results to NIST. All runs for the main task must be completely automatic; no manual intervention of any sort is allowed. All targets must be processed from the same initial state (i.e., your system may not adapt to targets that have already been processed). Results for the main task are due at NIST on or before August 2, 2006. You may submit up to three runs for the main task; all submitted runs will be judged.

II. Document set

The document set for the 2006 QA tasks is the same as was used in the past four QA tracks. It consists of the set of documents on the AQUAINT disk set. See the TREC 2006 Welcome email message for how to obtain the AQUAINT disks. The AQUAINT collection consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The following is a sample from the collection.

<DOC>
<DOCNO> NYT19990430.0001 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1999-04-30 00:01 </DATE_TIME>
<HEADER>
A8974 &Cx1f; taf-z
u s &Cx13; &Cx11; BC-BOX-TVSPORTS-COLUMN-N 04-30 0809
</HEADER>
<BODY>
<SLUG> BC-BOX-TVSPORTS-COLUMN-NYT </SLUG>
<HEADLINE>
SPORTS COLUMN: A MARCIANO DOCUDRAMA GETS MUCH OF IT WRONG
</HEADLINE>

(ATTN: Iowa) (rk)
By RICHARD SANDOMIR
c.1999 N.Y. Times News Service
<TEXT>
<P>
Toward the end of ``Rocky Marciano,'' a coming Showtime
docudrama, the retired boxer flies to Denver on the final day of
his life to visit Joe Louis in a psychiatric hospital. When he
leaves, Marciano hands the hospital administrator a bag full of
cash to upgrade Louis' care.
</P>
...
</TEXT>
</BODY>
<TRAILER>
NYT-04-30-99 0001EDT &QL;
</TRAILER>
</DOC>

III. Details of main task

Test set of questions

The test set consists of a series of questions for each of a set of targets. Each series is an abstraction of a user session with a QA system; therefore, some questions may depend on knowing answers to previous questions in the same series (see the series for the example "Shiite" target below).

The targets and questions were developed by NIST assessors after searching the document set to determine if there was an appropriate amount of information about the target. There will be 75 targets in the test set. Each series will contain several factoid questions, 0--2 list questions, and a question called "other". The "other" question is to be interpreted as "Give additional information about the target that would be interesting to the user but that the user has not explicitly asked for." As in the TREC 2005 track, we will assume that the user is an "average" adult reader of American newspapers.

The question set will be XML-formatted such that the format will explicitly tag the target as the target as well as the type of each question in the series (type is one of FACTOID, LIST, and OTHER). Each question will have an id of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Factoid questions are questions that seek short, fact-based answers such as have been used in previous TREC QA tracks. Some factoid questions may not have an answer in the document collection. List questions are requests for a set of instances of a specified type. Factoid and list questions require exact answers to be returned. Responses to the "other" question need not be exact, though excessive length will be penalized.

The document collection may not contain a correct answer for some of the factoid questions. In this case, the correct response is the string "NIL". A question will be assumed to have no correct answer in the collection if our assessors do not find an answer during the answer verification phase AND no participant returns a correct response. (If a system returns a right answer that is unsupported, NIL will still be the correct response for that question.)

Each question is tagged as to its type using the XML format below:

<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCTYPE trec2006qa [



<!ELEMENT trecqa       (target+)           >
<!ELEMENT target        (qa+)                 >
<!ELEMENT qa              (q, as)               >
<!ELEMENT q               (CDATA)    > 
<!ELEMENT as             (a*, nugget*)     >

<!ATTLIST trecqa
                                      year CDATA #REQUIRED
                                      task CDATA #REQUIRED>

<!ATTLIST target        id    ID           #REQUIRED
                                     text CDATA  #REQUIRED>

<!ATTLIST q               id    ID            #REQUIRED
                                     type       (FACTOID|LIST|OTHER)   #REQUIRED>

<!ATTLIST a               src   CDATA   #REQUIRED
                                     regex CDATA #IMPLIED >

<!ATTLIST nugget      id    ID             #REQUIRED
                                     type     (VITAL|OKAY)                     #REQUIRED>
]>

<trecqa year="2006" task="main">

<target id="1" text="Shiite">
       <qa>
             <q id = "1.1" type="FACTOID">
                         Who was the first Imam of the Shiite sect of Islam?
       </q>
       </qa>

       <qa>
             <q id = "1.2" type="FACTOID">
                         Where is his tomb?
       </q>
       </qa>

       <qa>
             <q id = "1.3" type="FACTOID">
                         What was this person's relationship to the Prophet Mohammad?
       </q>
       </qa>

       <qa>
             <q id = "1.4" type="FACTOID">
                         Who was the third Imam of Shiite Muslims?
       </q>
       </qa>

       <qa>
             <q id = "1.5" type="FACTOID">
                         When did he die?
       </q>
       </qa>

       <qa>
             <q id = "1.6" type="FACTOID">
                         What portion of Muslims are Shiite?
       </q>
       </qa>

       <qa>
             <q id = "1.7" type="LIST">
                         What Shiite leaders were killed in Pakistan?
       </q>
       </qa>

       <qa>
             <q id="1.8" type="OTHER">
                         Other
       </q>
       </qa>
</target>

<target id="2" text="skier Alberto Tomba">
       <qa>

            <q id="2.1" type="FACTOID">
                        How many Olympic gold medals did he win?
       </q>
       </qa>

       <qa>
            <q id="2.2" type="FACTOID">
                        What nationality is he?
       </q>
       </qa>

       <qa>
            <q id="2.3" type="FACTOID">
                        What is his nickname?
       </q>
       </qa>

       <qa>
            <q id="2.4" type="OTHER">
                        Other
       </q>
       </qa>

</target>

<target id="3" text="Kama Sutra">
       <qa>
            <q id ="3.1" type="FACTOID">
                         Who wrote it?
       </q>
       </qa>

       <qa>
            <q id="3.2" type="FACTOID">
                         When was it written?
       </q>
       </qa>

       <qa>
            <q id="3.3" type="FACTOID">
                         What is it about?
       </q>
       </qa>

       <qa>
            <q id="3.4" type="OTHER">
        Other
       </q>
       </qa>

</target>

</trecqa>

The questions used in previous tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous QA tracks that you may use to develop your QA system. This data includes judgment files, answer patterns, top ranked document lists, and sentence files. See the web site for a description of each of these resources. The draft of the TREC 2005 QA track overview paper contains an analysis of the TREC 2005 evaluation. The main task in TREC 2005 was essentially the same as this year's main task, and participants are encouraged to read that discussion of the evaluation methodology.

Submission format

A submission consists of exactly one response for each question. The definition of a response varies for the different question types.

Factoid response

For factoid questions, a response is a single [answer-string, docid] pair or the string "NIL". The "NIL" string will be judged correct if there is no answer known to exist in the document collection; otherwise it will be judged as incorrect. If an [answer-string, docid] pair is given as a response, the answer-string must contain nothing other than the answer, and the docid must be the id of a document in the collection that supports answer-string as an answer. An answer string must contain a complete, exact answer and nothing else. As with correctness, exactness will be in the opinion of the assessor.

Time-dependent questions Time-dependent questions are those for which the correct answer can vary depending on the timeframe that is assumed. When a question is phrased in the present tense, the implicit timeframe will be the date of the last document in the AQUAINT document collection; thus, systems will be required to return the most up-to-date answer supported by the document collection. This requirement is a departure from past TREC QA evaluations, in which an answer was judged as correct even if it was actually not true, as long as the document provided with the answer said that it was true. With the introduction of EVENT targets, however, it is not uncommon for reports published during or immediately after an event, to contain conflicting answers to a question (e.g., "Which country had the highest death toll from Hurricane Mitch?"). In this case, the assessor may judge an answer as incorrect even if the supporting document says that it is correct, if later documents state otherwise.

When the question is phrased in the past tense then either the question will explicitly specify the time frame (e.g., "What cruise line attempted to take over NCL in December 1999?") or else the time frame will be implicit in the question series. For example, if the target is the event "France wins World Cup in soccer" and the question is "Who was the coach of the French team?" then the correct answer must be "Aime Jacquet" (the name of the coach of the French team in 1998 when France won the World Cup), and not just the name of any past or current coach of the French team.

The document must support the answer as being the correct answer, but certain additional axiomatic knowledge and temporal inferencing can be used, including:

temporal ordering of months (January-December), days of the week (Monday-Sunday)
calculation of specific dates from relative temporal expressions, possibly using a universal calendar. For example, "August 12, 2000" is a supported correct answer to "On what date did the Kursk sink?" if it is supported by a document dated August 13, 2000, containing the phrase "the Kursk sank Saturday..."

Responses will be judged by human assessors who will assign one of five possible judgments to a response:

incorrect: the answer-string does not contain a correct answer or the answer is not responsive;
unsupported: the answer-string contains a correct answer but the document returned does not support that answer;
non-exact: the answer-string contains a correct answer and the document supports that answer, but the string contains more than just the answer (or is missing bits of the answer);
locally correct: the answer-string consists of exactly a correct answer and that answer is supported by the document returned, but a later document contradicts the answer.
globally correct: the answer-string consists of exactly a correct answer, that answer is supported by the document returned, and there are no later documents that contradict the answer.

Being "responsive" means such things as including units for quantitative responses (e.g., $20 instead of 20) and answering with regard to a famous entity itself rather than its replicas or imitations. (The paper "The TREC-8 Question Answering Track Evaluation" in the TREC-8 proceedings contains a very detailed discussion of responsiveness.) Only the judgment of "globally correct" will be accepted for scoring purposes.

List response

For list questions, a response is an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. The interpretation of the pair is the same as for factoid questions, and each pair will be judged in the same way as for a factoid question. Note that this means that if an answer-string actually contains multiple answers for the question it will be marked inexact and will thus hurt the question's precision score. In addition to judging the individual pairs in a response, the assessors will also mark a subset of the pairs judged correct as being distinct. If a single instance is correct but listed multiple times, the assessor will mark exactly one arbitrary instance as distinct, and the others will not be marked as distinct. Scores will be computed using the number of correct, distinct instances in the set.

Other response

A response for the "other" question that ends each target's series is also an unordered, non-empty, set of [answer-string, docid] pairs, but the interpretation of the pairs differs somewhat from a list response. A pair in an "other" response is assumed to represent a new (i.e., not part of the question series nor previously reported as an answer to an earlier question in the series), interesting fact about the target. Individual pairs will not be judged for these questions. Instead, the assessor will construct a list of desirable information nuggets about the target, and count the number of distinct, desirable nuggets that occur in the response as a whole. There is no expectation of an exact answer to "other" questions, but responses will be penalized for excessive length.

File format

A submission file for the main task comprises the responses to all the questions in the main task and must contain a response for each question in the main task. Each line in the file must have the form

          qid run-tag docid answer-string
where qid           is the question number (of the form X.Y),
      run-tag       is the run id, 
      docid         is the id of the supporting document or
			the string "NIL" (no quotes) if the
			question is a factoid question and
			there is no answer in the collection,
and
      answer-string is a text string with no embedded
			newlines or is empty if docid is NIL

Any amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present (modulo empty answer-string when docid is NIL). Answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. The total length of all answer-strings for each question cannot exceed 7000 non-white-space characters. The "run tag" should be a unique identifier for your group AND for the method that produced the run.

The first few lines of an example submission might look like this:

    1.1  nistqa06  NYT19990326.0303	    Nicole Kidman
    1.2  nistqa06  NIL
    1.3  nistqa06  APW20000908.0100        Godiva
    1.3  nistqa06  NYT19981215.0192        Wittamer of Brussels
    1.3  nistqa06  NYT19990209.0202        Nirvana
    1.3  nistqa06  NYT19990210.0179        Leonidas
    1.3  nistqa06  NYT20000211.0296        Guylian
    1.3  nistqa06  NYT20000824.0034        Les Cygnes
    1.3  nistqa06  NYT20000927.0157        Callebaut
    1.4  nistqa06  NYT19990421.0438        fringe rock music genres
    1.4  nistqa06  NYT19990430.0345        gloomy subculture
    2.1  nistqa06  NIL

Each group may submit up to three runs. All submitted runs will be judged.

Scoring

The different types of questions have different scoring metrics. Each of the three (factoid, list, other) scores has a range of 0-1 with 1 being the high score. We will compute the factoid-score, list-score, and other-score for each series. The combined weighted score of a single series will be a simple weighted average of these three scores for questions in the series:

The final score for a run will be the mean of the per-series combined weighted scores.

Factoid-score

For factoid questions, the response will be judged as "incorrect", "unsupported", "non-exact", "locally correct", or "globally correct". All judgments are in the opinion of the NIST assessor. The factoid-score for a series is the fraction of factoid questions in the series judged to be "globally correct".

List-score

The response to a list question is a non-null, unordered set of [answer-string, docid] pairs, where each pair is called an instance. An individual instance is interpreted as for factoid questions and will be judged in the same way. The final answer set for a list question will be created from the union of the distinct, correct responses returned by all participants plus the set of answers found by the NIST assessor during question development. An individual list question will be scored by first computing instance recall (IR) and instance precision (IP) using the final answer set, and combining those scores using the F measure with recall and precision equally weighted.
That is,

 
    IR = # instances judged correct & distinct/|final answer set|
    IP = # instances judged correct & distinct/# instances returned
    F = (2*IP*IR)/(IP+IR)

The list-score for a series is the mean of the F scores of the list questions in the series.

Other-score

The response for an "other" question is syntactically the same as for a list question: a non-null, unordered set of [answer-string, docid] pairs. The interpretation of this set is different, however. For each "other" question, the assessor will create a list of acceptable information nuggets about the target from the union of the returned responses and the information discovered during question development. This list will NOT contain the answers to previous questions in the target's series: systems are expected to report only new information for the "other" question. Some of the nuggets will be deemed vital, while the remaining nuggets on the list are okay. There may be many other facts related to the target that, while true, are not considered acceptable by the assessor (because they are not interesting, for example). These items will not be on the list at all (and thus including them in a response will be penalized). All decisions regarding acceptability are in the opinion of the assessor. Once the list of acceptable nuggets is created, the assessor will view the response for one question from one system in its entirety and mark the vital and okay nuggets contained in it. Each nugget that is present will be matched only once. An individual "other" question will be scored using nugget recall (NR) and an approximation to nugget precision (NP) based on length. These scores will be combined using the F measure with recall three times as important as precision. In particular,

  NR = # vital nuggets returned in response/# vital nuggets
  NP is defined using
	allowance = 100*(# vital+okay nuggets returned)
	length = total # non-white-space characters in answer strings
 	NP = 	1 if length < allowance
  		else 1-[(length-allowance)/length]
  F = (10*NP*NR)/(9*NP + NR)

The other-score for a series is simply the F score of its "other" question.

Additional other-score

The official score for an "other" question is based on a single assessor's judgment of whether a nugget is vital or okay. However, secondary scores for each "other" question will also be computed using multiple assessors' judgments of the importance of information nuggets, as inspired by:

Will Pyramids Built of Nuggets Topple Over?

Proceedings of the HLT/NAACL 2006

Multiple judgments of vital/okay may be used to create a nugget pyramid, from which a single F-score is computed using the weighted nuggets (micro-average). Alternatively, a set of F-scores can be computed, where each F-score is based on a single assessor's judgment of vital/okay, and the F-scores can be averaged over the various assessors (macro-average). Both micro- and macro-average F-scores for each "other" question will be computed using the multiple judgments of vital/okay.

Document lists

As a service to the track, NIST will provide the ranking of the top 1000 documents retrieved by the PRISE search engine when using the target as the query, and the full text of the top 50 documents per target (as given from that same ranking). NIST will not provide document lists for individual questions. Note that this is a service only, provided as a convenience for groups that do not wish to implement their own document retrieval system. There is no guarantee that this ranking will contain all the documents that actually answer the questions in a series, even if such documents exist. The document lists will be in the same format as previous years:

    qnum rank docid rsv
where
    qnum is the target number 
    rank is the rank at which PRISE retrieved the doc
        (most similar doc is rank 1)
    docid is the document identifier
    rsv is the "retrieval status value" or similarity
         between the query and doc where larger is better

IV. Submission deadlines

The deadline for all results for the main task is 11:59 p.m. EDT on August 2, 2006. This is a firm deadline. Each participant will have one week between the time that the questions are released and the time that the results are due back at NIST. The primary week will be July 26-August 2. That is, NIST will post the questions on the web site by noon EDT on July 26 and results will have to be submitted to NIST by 11:59 p.m. August 2 (so you get a week plus twelve hours). However, there may be participants for whom that week will not work. In this case, the participant must make prior arrangements with Ellen Voorhees ([email protected]) to choose another one-week period that begins after July 5 and ends before August 2. Results must be received one week after you receive the questions or by August 2, whichever is earlier. Late submissions will be discarded.

Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the main TREC 2006 participants' mailing list in early July. At that time, NIST will release a routine that checks for common errors in result files including such things as invalid document numbers, wrong formats, missing data, etc. You should check your runs with this script before submitting them to NIST because if the automatic submission procedure detects any errors, it will not accept the submission.

Restrictions
No part of the system can be changed in any manner between the time the test questions are downloaded and the time the results are submitted to NIST. For the main task, no manual processing of questions, answers or any other part of a system is allowed: all processing must be fully automatic. Targets must be processed independently. Questions within a series must be processed in order, without looking ahead. That is, your system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.

V. Timetable

Documents available:	now
Main task questions available:	July 26, 2006
Top ranked documents for main task available:	July 26, 2006
Results for main task due at NIST:	August 2, 2006
Evaluated results from NIST:	October 10, 2006
Conference workbook papers to NIST:	October 25, 2006 (by 6:00 a.m. EDT)
TREC 2006 Conference:	November 14-17, 2006

Last updated:
Date created: Wednesday, 03-May-06
[email protected]