TREC 2006 Question Answering Track Guidelines
I. Summary:The goal of the TREC QA track is to foster research on systems that retrieve answers rather than documents in response to a question. The focus is on systems that can function in unrestricted domains. The TREC 2006 QA track will consist of two tasks:
The two tasks are independent, and participants in the QA track may do one or both tasks.
The main task is the same as in TREC 2005, in that the test set will consist of question series where each series asks for information regarding a particular target. As in TREC 2005, the targets will include people, organizations, events and other entities. Each question series will consist of some factoid and some list questions and will end with exactly one "Other" question. The answer to the "Other" question is to be interesting information about the target that is not covered by the preceding questions in the series. The runs will be evaluated using the same methodology as in TREC 2005, except that the per-series score will give equal weight to each of the three types of questions (FACTOID, LIST, OTHER). Secondary scores for the "Other" questions will also be computed using multiple assessors' judgments of the importance of information nuggets, as inspired by (Lin and Demner-Fushman, 2006), but these secondary scores are experimental and will not be used to calculate the per-series scores.
The complex interactive QA task is a blend of the TREC 2005 relationship QA task and the TREC 2005 HARD track, which focused on single-iteration clarification dialogs. For information about the complex interactive QA task, please visit the ciQA homepage, which includes a detailed task description, timetable, and sample data.
The question set for the main task will be available
on the TREC web site on July 26 (by 12 noon EDT). YOU ARE REQUIRED
TO FREEZE YOUR SYSTEM BEFORE DOWNLOADING THE QUESTIONS. No changes
of any sort can be made to any component of your system or any
resource used by your system between the time you fetch the questions
and the time you submit your results to NIST.
All runs for the
main task must be completely automatic; no
manual intervention of any sort is allowed. All targets must be
processed from the same initial state (i.e., your system may not adapt
to targets that have already been processed).
Results for the main task are due at NIST on or
before August 2, 2006. You may submit up to three runs
for the main task; all submitted runs will be judged.
II. Document setThe document set for the 2006 QA tasks is the same as was used in the past four QA tracks. It consists of the set of documents on the AQUAINT disk set. See the TREC 2006 Welcome email message for how to obtain the AQUAINT disks. The AQUAINT collection consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The following is a sample from the collection.
<DOCNO> NYT19990430.0001 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1999-04-30 00:01 </DATE_TIME>
A8974 &Cx1f; taf-z
u s &Cx13; &Cx11; BC-BOX-TVSPORTS-COLUMN-N 04-30 0809
<SLUG> BC-BOX-TVSPORTS-COLUMN-NYT </SLUG>
SPORTS COLUMN: A MARCIANO DOCUDRAMA GETS MUCH OF IT WRONG
(ATTN: Iowa) (rk)
By RICHARD SANDOMIR
c.1999 N.Y. Times News Service
Toward the end of ``Rocky Marciano,'' a coming Showtime
docudrama, the retired boxer flies to Denver on the final day of
his life to visit Joe Louis in a psychiatric hospital. When he
leaves, Marciano hands the hospital administrator a bag full of
cash to upgrade Louis' care.
NYT-04-30-99 0001EDT &QL;
III. Details of main task
Test set of questions
The test set consists of a series of questions for each of a set of targets. Each series is an abstraction of a user session with a QA system; therefore, some questions may depend on knowing answers to previous questions in the same series (see the series for the example "Shiite" target below).
The targets and questions were developed by NIST assessors after searching the document set to determine if there was an appropriate amount of information about the target. There will be 75 targets in the test set. Each series will contain several factoid questions, 0--2 list questions, and a question called "other". The "other" question is to be interpreted as "Give additional information about the target that would be interesting to the user but that the user has not explicitly asked for." As in the TREC 2005 track, we will assume that the user is an "average" adult reader of American newspapers.
The question set will be XML-formatted such that the format will explicitly tag the target as the target as well as the type of each question in the series (type is one of FACTOID, LIST, and OTHER). Each question will have an id of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Factoid questions are questions that seek short, fact-based answers such as have been used in previous TREC QA tracks. Some factoid questions may not have an answer in the document collection. List questions are requests for a set of instances of a specified type. Factoid and list questions require exact answers to be returned. Responses to the "other" question need not be exact, though excessive length will be penalized.
The document collection may not contain a correct answer for some of the factoid questions. In this case, the correct response is the string "NIL". A question will be assumed to have no correct answer in the collection if our assessors do not find an answer during the answer verification phase AND no participant returns a correct response. (If a system returns a right answer that is unsupported, NIL will still be the correct response for that question.)
Each question is tagged as to its type using the XML format below:<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCTYPE trec2006qa [
TREC 2006 QA track data definition
Written 2004-05-11 by Jochen L. Leidner <email@example.com.>
Modified 2004-05-21 by Ellen Voorhees <firstname.lastname@example.org>
This file has been released in the public domain.
<!ELEMENT trecqa (target+) >
<!ELEMENT target (qa+) >
<!ELEMENT qa (q, as) >
<!ELEMENT q (CDATA) > <!-- don't want text of question parsed -->
<!ELEMENT as (a*, nugget*) >
year CDATA #REQUIRED
task CDATA #REQUIRED>
<!ATTLIST target id ID #REQUIRED
text CDATA #REQUIRED>
<!ATTLIST q id ID #REQUIRED
type (FACTOID|LIST|OTHER) #REQUIRED>
<!ATTLIST a src CDATA #REQUIRED
regex CDATA #IMPLIED >
<!ATTLIST nugget id ID #REQUIRED
type (VITAL|OKAY) #REQUIRED>
<!-- ====================================================================== -->]>
<trecqa year="2006" task="main">
<target id="1" text="Shiite">
<q id = "1.1" type="FACTOID">
Who was the first Imam of the Shiite sect of Islam?
<q id = "1.2" type="FACTOID">
Where is his tomb?
<q id = "1.3" type="FACTOID">
What was this person's relationship to the Prophet Mohammad?
<q id = "1.4" type="FACTOID">
Who was the third Imam of Shiite Muslims?
<q id = "1.5" type="FACTOID">
When did he die?
<q id = "1.6" type="FACTOID">
What portion of Muslims are Shiite?
<q id = "1.7" type="LIST">
What Shiite leaders were killed in Pakistan?
<q id="1.8" type="OTHER">
<target id="2" text="skier Alberto Tomba">
<q id="2.1" type="FACTOID">
How many Olympic gold medals did he win?
<q id="2.2" type="FACTOID">
What nationality is he?
<q id="2.3" type="FACTOID">
What is his nickname?
<q id="2.4" type="OTHER">
<target id="3" text="Kama Sutra">
<q id ="3.1" type="FACTOID">
Who wrote it?
<q id="3.2" type="FACTOID">
When was it written?
<q id="3.3" type="FACTOID">
What is it about?
<q id="3.4" type="OTHER">
The questions used in previous tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous QA tracks that you may use to develop your QA system. This data includes judgment files, answer patterns, top ranked document lists, and sentence files. See the web site for a description of each of these resources. The draft of the TREC 2005 QA track overview paper contains an analysis of the TREC 2005 evaluation. The main task in TREC 2005 was essentially the same as this year's main task, and participants are encouraged to read that discussion of the evaluation methodology.
A submission consists of exactly one response for each question. The definition of a response varies for the different question types.Factoid response
For factoid questions, a response is a single [answer-string, docid] pair or the string "NIL". The "NIL" string will be judged correct if there is no answer known to exist in the document collection; otherwise it will be judged as incorrect. If an [answer-string, docid] pair is given as a response, the answer-string must contain nothing other than the answer, and the docid must be the id of a document in the collection that supports answer-string as an answer. An answer string must contain a complete, exact answer and nothing else. As with correctness, exactness will be in the opinion of the assessor.
Time-dependent questions Time-dependent questions are those for which the correct answer can vary depending on the timeframe that is assumed. When a question is phrased in the present tense, the implicit timeframe will be the date of the last document in the AQUAINT document collection; thus, systems will be required to return the most up-to-date answer supported by the document collection. This requirement is a departure from past TREC QA evaluations, in which an answer was judged as correct even if it was actually not true, as long as the document provided with the answer said that it was true. With the introduction of EVENT targets, however, it is not uncommon for reports published during or immediately after an event, to contain conflicting answers to a question (e.g., "Which country had the highest death toll from Hurricane Mitch?"). In this case, the assessor may judge an answer as incorrect even if the supporting document says that it is correct, if later documents state otherwise.
When the question is phrased in the past tense then either the question will explicitly specify the time frame (e.g., "What cruise line attempted to take over NCL in December 1999?") or else the time frame will be implicit in the question series. For example, if the target is the event "France wins World Cup in soccer" and the question is "Who was the coach of the French team?" then the correct answer must be "Aime Jacquet" (the name of the coach of the French team in 1998 when France won the World Cup), and not just the name of any past or current coach of the French team.
The document must support the answer as being the correct answer, but certain additional axiomatic knowledge and temporal inferencing can be used, including:
Responses will be judged by human assessors who will assign one
of five possible judgments to a response:
For list questions, a response is an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. The interpretation of the pair is the same as for factoid questions, and each pair will be judged in the same way as for a factoid question. Note that this means that if an answer-string actually contains multiple answers for the question it will be marked inexact and will thus hurt the question's precision score. In addition to judging the individual pairs in a response, the assessors will also mark a subset of the pairs judged correct as being distinct. If a single instance is correct but listed multiple times, the assessor will mark exactly one arbitrary instance as distinct, and the others will not be marked as distinct. Scores will be computed using the number of correct, distinct instances in the set.Other response
A response for the "other" question that ends each target's series is also an unordered, non-empty, set of [answer-string, docid] pairs, but the interpretation of the pairs differs somewhat from a list response. A pair in an "other" response is assumed to represent a new (i.e., not part of the question series nor previously reported as an answer to an earlier question in the series), interesting fact about the target. Individual pairs will not be judged for these questions. Instead, the assessor will construct a list of desirable information nuggets about the target, and count the number of distinct, desirable nuggets that occur in the response as a whole. There is no expectation of an exact answer to "other" questions, but responses will be penalized for excessive length.File format
A submission file for the main task
comprises the responses to all the questions in the main task and must
contain a response for each question in the main task. Each line in
the file must have the form
qid run-tag docid answer-string where qid is the question number (of the form X.Y), run-tag is the run id, docid is the id of the supporting document or the string "NIL" (no quotes) if the question is a factoid question and there is no answer in the collection, and answer-string is a text string with no embedded newlines or is empty if docid is NILAny amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present (modulo empty answer-string when docid is NIL). Answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. The total length of all answer-strings for each question cannot exceed 7000 non-white-space characters. The "run tag" should be a unique identifier for your group AND for the method that produced the run.
The first few lines of an example submission might
look like this:
1.1 nistqa06 NYT19990326.0303 Nicole Kidman 1.2 nistqa06 NIL 1.3 nistqa06 APW20000908.0100 Godiva 1.3 nistqa06 NYT19981215.0192 Wittamer of Brussels 1.3 nistqa06 NYT19990209.0202 Nirvana 1.3 nistqa06 NYT19990210.0179 Leonidas 1.3 nistqa06 NYT20000211.0296 Guylian 1.3 nistqa06 NYT20000824.0034 Les Cygnes 1.3 nistqa06 NYT20000927.0157 Callebaut 1.4 nistqa06 NYT19990421.0438 fringe rock music genres 1.4 nistqa06 NYT19990430.0345 gloomy subculture 2.1 nistqa06 NILEach group may submit up to three runs. All submitted runs will be judged.
The different types of questions have different scoring metrics. Each of the three (factoid,
list, other) scores has a range of 0-1 with 1 being the high score.
We will compute the factoid-score, list-score, and other-score for
each series. The combined weighted score of a single series will be a
simple weighted average of these three scores for questions in the
For factoid questions, the response will be judged as "incorrect", "unsupported", "non-exact", "locally correct", or "globally correct". All judgments are in the opinion of the NIST assessor. The factoid-score for a series is the fraction of factoid questions in the series judged to be "globally correct".List-score
The response to a list question is a non-null, unordered
set of [answer-string, docid] pairs, where each
pair is called an instance. An individual instance is interpreted as
for factoid questions and will be judged in the same way.
The final answer set for a list question will be created from
the union of the distinct, correct responses returned by all
participants plus the set of answers found by the NIST assessor
during question development. An individual list question will be scored
by first computing instance recall (IR) and instance precision (IP)
using the final answer set, and combining those scores using
the F measure with recall and precision equally weighted.
IR = # instances judged correct & distinct/|final answer set| IP = # instances judged correct & distinct/# instances returned F = (2*IP*IR)/(IP+IR)The list-score for a series is the mean of the F scores of the list questions in the series. Other-score
The response for an "other" question is syntactically the same
as for a list question: a non-null, unordered set
of [answer-string, docid] pairs. The interpretation of this
set is different, however. For each "other" question, the
assessor will create a list of acceptable information nuggets
about the target from the union of the returned responses and
the information discovered during question development.
This list will NOT contain the answers to previous questions in
the target's series: systems are expected to report only new
information for the "other" question. Some of the nuggets will
be deemed vital, while the remaining nuggets on the list are
okay. There may be many other facts related to the target that,
while true, are not considered acceptable by the assessor (because they
are not interesting, for example). These items will not be on the
list at all (and thus including them in a response will be penalized).
All decisions regarding acceptability are in the opinion of the assessor.
Once the list of acceptable nuggets is created, the assessor will view the
response for one question from one system in its entirety and
mark the vital and okay nuggets contained in it.
Each nugget that is present will be matched only once.
An individual "other" question will be scored using
nugget recall (NR) and an approximation to nugget precision (NP) based
on length. These scores will be combined using the F measure with
recall three times as important as precision.
NR = # vital nuggets returned in response/# vital nuggets NP is defined using allowance = 100*(# vital+okay nuggets returned) length = total # non-white-space characters in answer strings NP = 1 if length < allowance else 1-[(length-allowance)/length] F = (10*NP*NR)/(9*NP + NR)The other-score for a series is simply the F score of its "other" question. Additional other-score
The official score for an "other" question is based on a single assessor's judgment of whether a nugget is vital or okay. However, secondary scores for each "other" question will also be computed using multiple assessors' judgments of the importance of information nuggets, as inspired by:
As a service to the track, NIST will provide the ranking of the
top 1000 documents retrieved by the PRISE search engine
when using the target as the query, and the full text
of the top 50 documents per target (as given from
that same ranking). NIST will not provide document lists
for individual questions. Note that this is a service only,
provided as a convenience for groups that do not wish to
implement their own document retrieval system. There is no
guarantee that this ranking will contain all the documents that
actually answer the questions in a series, even if such documents exist.
The document lists will be in the same format as previous years:
qnum rank docid rsv where qnum is the target number rank is the rank at which PRISE retrieved the doc (most similar doc is rank 1) docid is the document identifier rsv is the "retrieval status value" or similarity between the query and doc where larger is better
IV. Submission deadlinesThe deadline for all results for the main task is 11:59 p.m. EDT on August 2, 2006. This is a firm deadline. Each participant will have one week between the time that the questions are released and the time that the results are due back at NIST. The primary week will be July 26-August 2. That is, NIST will post the questions on the web site by noon EDT on July 26 and results will have to be submitted to NIST by 11:59 p.m. August 2 (so you get a week plus twelve hours). However, there may be participants for whom that week will not work. In this case, the participant must make prior arrangements with Ellen Voorhees (email@example.com) to choose another one-week period that begins after July 5 and ends before August 2. Results must be received one week after you receive the questions or by August 2, whichever is earlier. Late submissions will be discarded.
Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the main TREC 2006 participants' mailing list in early July. At that time, NIST will release a routine that checks for common errors in result files including such things as invalid document numbers, wrong formats, missing data, etc. You should check your runs with this script before submitting them to NIST because if the automatic submission procedure detects any errors, it will not accept the submission.
No part of the system can be changed in any manner between the time the test questions are downloaded and the time the results are submitted to NIST. For the main task, no manual processing of questions, answers or any other part of a system is allowed: all processing must be fully automatic. Targets must be processed independently. Questions within a series must be processed in order, without looking ahead. That is, your system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.
Last updated:Thursday, 15-Jun-2006 16:03:45 MDT
Date created: Wednesday, 03-May-06