TREC 2005 Question Answering Track Guidelines
II. Document set
III. Submission deadlines
IV. Task details V. Timetable
I. Summary:The goal of the TREC QA track is to foster research on systems that retrieve answers rather than documents in response to a question. The focus is on systems that can function in unrestricted domains. The TREC 2005 QA track will consist of three tasks:
The main task is essentially the same as the single task from 2004, in that the test set will consist of a set of question series where each series asks for information regarding a particular target. As in TREC 2004, the targets will include people, organizations, and other entities; unlike TREC 2004 the target can also be an event. Events were added since the document set from which the answers are to be drawn are newswire articles. (The 2005 corpus will again be the AQUAINT corpus.) Each question series will consist of some factoid and some list questions and will end with exactly one "Other" question. The answer to the "Other" question is to be interesting information about the target that is not covered by the preceding questions in the series. The runs will be evaluated using the same methodology as in TREC 2004, though the primary measure will be the per-series combined score.
The document ranking task will be to submit, for a subset of the questions in the main task, a ranked list of <=1000 documents for each question. To address the concern regarding document retrieval and QA, we are requiring that all submissions to the main task include a ranked list of documents for each question in this subset. This ranked list should be the set of documents used by your system to create the answer. Furthermore, we are encouraging groups whose primary emphasis is document retrieval, not QA, to also submit document rankings for the questions. (You may submit a document ranking and not an answer file, but if you submit an answer file to the main task you must also submit a document ranking.) Judging and scoring of the document rankings will be done using trec-eval as for standard ad hoc retrieval, except that a document will be judged as relevant if it contains an answer to the question.
The relationship task will be the same task as was performed in the AQUAINT relationship pilot. A description of the AQUAINT relationship pilot can be found on the "Additional Question Answering Resource" page of the Data/QA section of the TREC web site (http://trec.nist.gov/data/qa/add_qaresources.html). The task in the pilot was such that systems were given TREC-like topic statements that ended with a question asking for evidence for a particular relationship. The initial part of the topic statement set the context for the question. The question was either a yes/no question, which was understood to be a request for evidence supporting the answer, or an explicit request for the evidence itself. The system response was a set of information nuggets that were evaluated using the same scheme as definition questions.
Participants in the QA track may do one, two, or all three tasks. The relationship task is independent from the other two tasks. However, if you participate in the main task then you are required to participate in the document ranking task as well. All runs for the main task and document ranking task must be completely automatic; no manual intervention of any sort is allowed. All targets must be processed from the same initial state (i.e., your system may not adapt to targets that have already been processed). For the relationship task, some manual processing is allowed, but when you submit your results you must describe what manual processing was done (if any).
Results for the main and document ranking tasks are due at NIST on or
before July 27, 2005. Results for the relationship task are due at
NIST on or before August 31, 2005. You may submit up to three runs
each for the main task and document ranking task, and up to two runs
for the relationship task. All submitted runs will be judged for the
main task and relationship task. At least one run will be judged for
the document ranking task; please designate the ranking of the runs
for the document ranking task at the time of submission.
II. Document setThe document set for all three tasks is the same as was used in the past three QA tracks. It consists of the set of documents on the AQUAINT disk set. See the Welcome message (in the email archive) for how to obtain the AQUAINT disks. The AQUAINT collection consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The following is a sample from the collection.
<DOCNO> NYT19990430.0001 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1999-04-30 00:01 </DATE_TIME>
A8974 &Cx1f; taf-z
u s &Cx13; &Cx11; BC-BOX-TVSPORTS-COLUMN-N 04-30 0809
<SLUG> BC-BOX-TVSPORTS-COLUMN-NYT </SLUG>
SPORTS COLUMN: A MARCIANO DOCUDRAMA GETS MUCH OF IT WRONG
(ATTN: Iowa) (rk)
By RICHARD SANDOMIR
c.1999 N.Y. Times News Service
Toward the end of ``Rocky Marciano,'' a coming Showtime
docudrama, the retired boxer flies to Denver on the final day of
his life to visit Joe Louis in a psychiatric hospital. When he
leaves, Marciano hands the hospital administrator a bag full of
cash to upgrade Louis' care.
NYT-04-30-99 0001EDT &QL;
III. Submission deadlinesThe deadline for all results for the main and document ranking tasks is 11:59 p.m. EDT on July 27, 2005. This is a firm deadline. Each participant will have one week between the time that the questions are released and the time that the results are due back at NIST. The primary week will be July 20--27. That is, NIST will post the questions on the web site by noon EDT on July 20 and results will have to be submitted to NIST by 11:59 pm July 27 (so you get a week plus twelve hours). However, there may be participants for whom that week will not work. In this case, the participant must make prior arrangements with Hoa Dang (email@example.com) to choose another one-week period that begins after July 5 and ends before July 27. Results must be received one week after you receive the questions or by July 27, whichever is earlier. Late submissions will be discarded.
The deadline for all results for the relationship task is 11:59 p.m. EDT on August 31, 2005. This is also a firm deadline. Because the judging for the relationship task will be done after the judging for the main and document ranking tasks, we can allow more time for participants to process the test data before submitting their results.
Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the main TREC 2005 participants' mailing list in early July. At that time, NIST will release a routine that checks for common errors in result files including such things as invalid document numbers, wrong formats, missing data, etc. You should check your runs with this script before submitting them to NIST because if the automatic submission procedure detects any errors, it will not accept the submission. Participants submitting a run for the main task must submit a corresponding run for the document ranking task at the same time.
No part of the system can be changed in any matter between the time the test questions are downloaded and the time the results are submitted to NIST. For the main task and document ranking task, no manual processing of questions, answers or any other part of a system is allowed: all processing must be fully automatic. Targets must be processed independently. Questions within a series must be processed in order, without looking ahead. That is, your system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.
IV. Task details
1. Main TaskThe question set will be XML-formatted such that the format will explicitly tag the target as the target as well as the type of each question in the series (type is one of FACTOID, LIST, and OTHER). Each question will have an id of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Factoid questions are questions that seek short, fact-based answers such as have been used in previous TREC QA tracks. Some factoid questions may not have an answer in the document collection. List questions are requests for a set of instances of a specified type. Factoid and list questions require exact answers to be returned. Responses to the "other" question need not be exact, though excessive length will be penalized.
The document collection may not contain a correct answer for some of the factoid questions. In this case, the correct response is the string "NIL". A question will be assumed to have no correct answer in the collection if our assessors do not find an answer during the answer verification phase AND no participant returns a correct response. (If a system returns a right answer that is unsupported, NIL will still be the correct response for that question.)
Test set of questions
The test set consists of a series of questions for each of a set of targets. The targets and questions were developed by NIST assessors after searching the document set to determine if there was an appropriate amount of information about the target. There will be 75 targets in the test set. Each series will contain several factoid questions, 0--2 list questions, and a question called "other". The "other" question is to be interpreted as "Give additional information about the target that would be interesting to the user but that the user has not explicitly asked for." As in the TREC 2004 track, we will assume that the user is an "average" adult reader of American newspapers.
Each question is tagged as to its type using the XML format below:<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCTYPE trec2004qa [
TREC 2004 QA track data definition
Written 2004-05-11 by Jochen L. Leidner <firstname.lastname@example.org.>
Modified 2004-05-21 by Ellen Voorhees <email@example.com>
This file has been released in the public domain.
<!ELEMENT trecqa (target+) >
<!ELEMENT target (qa+) >
<!ELEMENT qa (q, as) >
<!ELEMENT q (CDATA) > <!-- don't want text of question parsed -->
<!ELEMENT as (a*, nugget*) >
year CDATA #REQUIRED
task CDATA #REQUIRED>
<!ATTLIST target id ID #REQUIRED
text CDATA #REQUIRED>
<!ATTLIST q id ID #REQUIRED
type (FACTOID|LIST|OTHER) #REQUIRED>
<!ATTLIST a src CDATA #REQUIRED
regex CDATA #IMPLIED >
<!ATTLIST nugget id ID #REQUIRED
type (VITAL|OKAY) #REQUIRED>
<!-- ====================================================================== -->]>
<trecqa year="2004" task="main">
<target id="1" text="AmeriCorps">
<q id = "1.1" type="FACTOID">
When was AmeriCorps founded?
<q id = "1.2" type="FACTOID">
How many volunteers work for it?
<q id = "1.3" type="LIST">
What activities are its volunteers involved in?
<q id="1.4" type="OTHER">
<target id="2" text="skier Alberto Tomba">
<q id="2.1" type="FACTOID">
How many Olympic gold medals did he win?
<q id="2.2" type="FACTOID">
What nationality is he?
<q id="2.3" type="FACTOID">
What is his nickname?
<q id="2.4" type="OTHER">
<target id="3" text="Kama Sutra">
<q id ="3.1" type="FACTOID">
Who wrote it?
<q id="3.2" type="FACTOID">
When was it written?
<q id="3.3" type="FACTOID">
What is it about?
<q id="3.4" type="OTHER">
The questions used in previous tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous QA tracks that you may use to develop your QA system. This data includes judgment files, answer patterns, top ranked document lists, and sentence files. See the web site for a description of each of these resources. A draft of the TREC 2004 QA track overview paper containing an analysis of the the TREC 2004 evaluation, can be found in the QA section on the tracks page in the active participants' section of the TREC web site (http://trec.nist.gov/act_part/tracks.html). The TREC 2004 task was essentially the same as this year's main task, and participants are strongly encouraged to read that discussion of the evaluation methodology.
A submission consists of exactly one response for each question. The definition of a response varies for the different question types.
For factoid questions, a response is a single [answer-string, docid] pair or the string "NIL". The "NIL" string will be judged correct if there is no answer known to exist in the document collection; otherwise it will be judged as incorrect. If an [answer-string, docid] pair is given as a response, the answer-string must contain nothing other than the answer, and the docid must be the id of a document in the collection that supports answer-string as an answer. An answer string must contain a complete, exact answer and nothing else. As with correctness, exactness will be in the opinion of the assessor.
Responses will be judged by human assessors who will assign one
of four possible judgments to a response:
For list questions, a response is an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. There is no limit on the number of pairs that may be returned for a list question. The interpretation of the pair is the same as for factoid questions, and each pair will be judged in the same way as for a factoid question. Note that this means that if an answer-string actually contains multiple answers for the question it will be marked inexact and will thus hurt the question's precision score. In addition to judging the individual pairs in a response, the assessors will also mark a subset of the pairs judged correct as being distinct. If a single instance is correct but listed multiple times, the assessor will mark exactly one arbitrary instance as distinct, and the others will not be marked as distinct. Scores will be computed using the number of correct, distinct instances in the set.
A response for the "other" question that ends each target's series is also an unordered, non-empty, unbounded set of [answer-string, docid] pairs, but the interpretation of the pairs differs somewhat from a list response. A pair in an "other" response is assumed to represent a new (i.e., not part of the question series nor previously reported as an answer to an earlier question in the series), interesting fact about the target. Individual pairs will not be judged for these questions. Instead, the assessor will construct a list of desirable information nuggets about the target, and count the number of distinct, desirable nuggets that occur in the response as a whole. responses to these questions will be penalized for excessive length, but there is no expectation of an exact answer.
You are required to submit a run for the
document ranking task with each submission for the main task.
A submission file for the main task consists of two parts,
separated by a single blank line. The first part comprises the document
rankings for the questions in the document ranking task (following the
submission format for the document ranking task); the second part
comprises the responses to all the questions in the main task and must
contain a response for each question in the main task. Each line in
the second part of the submission must have the form
qid run-tag docid answer-string where qid is the question number (of the form X.Y), run-tag is the run id, docid is the id of the supporting document or the string "NIL" (no quotes) if the question is a factoid question and there is no answer in the collection, and answer-string is a text string with no embedded newlines or is empty if docid is NILAny amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present (modulo empty answer-string when docid is NIL). Answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. There is no limit imposed on the length of answer-string. The "run tag" should be a unique identifier for your group AND for the method that produced the run. The run tag for the main task must be constructed by appending the string "M" to the run tag for the document ranking task from the first part of the submission file.
The first few lines of the second part of an example submission might
look like this:
1.1 nistqa05M NYT19990326.0303 Nicole Kidman 1.2 nistqa05M NIL 1.3 nistqa05M APW20000908.0100 Godiva 1.3 nistqa05M NYT19981215.0192 Wittamer of Brussels 1.3 nistqa05M NYT19990209.0202 Nirvana 1.3 nistqa05M NYT19990210.0179 Leonidas 1.3 nistqa05M NYT20000211.0296 Guylian 1.3 nistqa05M NYT20000824.0034 Les Cygnes 1.3 nistqa05M NYT20000927.0157 Callebaut 1.4 nistqa05M NYT19990421.0438 fringe rock music genres 1.4 nistqa05M NYT19990430.0345 gloomy subculture 2.1 nistqa05M NILwhere the answers for run "nistqa05M" in the main task have been constructed from the document rankings in run "nistqa05" in the document ranking task. Each group may submit up to three runs. Please rank each run at the time of submission. All submitted runs for the main task will be judged, but only the top ranked run will be guaranteed to be judged for the document ranking task. Scoring
The different types of questions have different scoring metrics. Each of the three (factoid,
list, other) scores has a range of 0-1 with 1 being the high score.
We will compute the factoid-score, list-score, and other-score for
each series. The combined weighted score of a single series will be a
simple weighted average of these three scores for questions in the
For factoid questions, the response will be judged as "incorrect", "unsupported", "non-exact", or "correct". All judgments are in the opinion of the NIST assessor. The factoid-score for a series is the fraction of factoid questions in the series judged to be "correct".
The response to a list question is a non-null, unordered,
and unbounded set of [answer-string, docid] pairs, where each
pair is called an instance. An individual instance is interpreted as
for factoid questions and will be judged in the same way.
The final answer set for a list question will be created from
the union of the distinct, correct responses returned by all
participants plus the set of answers found by the NIST assessor
during question development. An individual list question will be scored
by first computing instance recall (IR) and instance precision (IP)
using the final answer set, and combining those scores using
the F measure with recall and precision equally weighted.
IR = # instances judged correct & distinct/|final answer set| IP = # instances judged correct & distinct/# instances returned F = (2*IP*IR)/(IP+IR)The list-score for a series is the mean of the F scores of the list questions in the series.
The response for an "other" question is syntactically the same
as for a list question: a non-null, unordered, and unbounded set
of [answer-string, docid] pairs. The interpretation of this
set is different, however. For each "other" question, the
assessor will create a list of acceptable information nuggets
about the target from the union of the returned responses and
the information discovered during question development.
This list will NOT contain the answers to previous questions in
the target's series: systems are expected to report only new
information for the "other" question. Some of the nuggets will
be deemed essential, while the remaining nuggets on the list are
acceptable. There may be many other facts related to the target that,
while true, are not considered acceptable by the assessor (because they
are not interesting, for example). These items will not be on the
list at all (and thus including them in a response will be penalized).
All decisions regarding acceptability are in the opinion of the assessor.
Once the list of acceptable nuggets is created, the assessor will view the
response for one question from one system in its entirety and
mark the essential and acceptable nuggets contained in it.
Each nugget that is present will be matched only once.
An individual "other" question will be scored using
nugget recall (NR) and an approximation to nugget precision (NP) based
on length. These scores will be combined using the F measure with
recall three times as important as precision.
NR = # essential nuggets returned in response/# essential nuggets NP is defined using allowance = 100*(# essential+acceptable nuggets returned) length = total # non-white-space characters in answer strings NP = 1 if length < allowance else 1-[(length-allowance)/length] F = (10*NP*NR)/(9*NP + NR)The other-score for a series is simply the F score of its "other" question. Document lists
As a service to the track, NIST will provide the ranking of the
top 1000 documents retrieved by the PRISE search engine
when using the target as the query, and the full text
of the top 50 documents per target (as given from
that same ranking). NIST will not provide document lists
for individual questions. Note that this is a service only,
provided as a convenience for groups that do not wish to
implement their own document retrieval system. There is no
guarantee that this ranking will contain all the documents that
actually answer the questions in a series, even if such documents exist.
The document lists will be in the same format as previous years:
qnum rank docid rsv where qnum is the target number rank is the rank at which PRISE retrieved the doc (most similar doc is rank 1) docid is the document identifier rsv is the "retrieval status value" or similarity between the query and doc where larger is better
2. Document ranking taskTest set of questions
The test set for the document ranking task will be a list of question numbers (of the form X.Y) for some subset of the questions from the main task (how many questions will be determined in part by how long assessing will take, but will be approximately 50 individual questions), but instead of returning an exact answer for each question, systems will return a ranked list of <=1000 documents that contain an answer for each question.
Participants in the main task MUST also participate in this document ranking task, though you may participate in the document ranking task without participating in the main task. If your QA system doesn't produce a ranked list of documents as an initial step, or uses multiple rounds of retrieval, or operates in some other way such that it is not clear what the ranked list of documents should be, you must still submit some list. Since the purpose of these lists is to create document pools both to get a better understanding of the number of instances of correct answers in the collection and to support research on whether some document retrieval techniques are better than others in support of QA, we want the list you produce to be as close as possible to the order in which your system considers the documents.Submission format
A document ranking submission consists of a single file with 6 columns per line. White space is used to separate columns. The width of the columns is not important, but it is important to have exactly six columns per line with at least one space between the columns.
1.1 Q0 ZF08-175-870 1 4238 prise1 1.1 Q0 ZF08-306-044 2 4223 prise1 1.1 Q0 ZF09-477-757 3 4207 prise1 1.1 Q0 ZF08-312-422 4 4194 prise1 1.1 Q0 ZF08-013-262 5 4189 prise1 etc.
Each question must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 1000 documents for a question, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 documents per question.
NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same question, invalid document numbers, wrong format, and multiple tags within runs. This routine will be made available to participants to check their runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.
Groups may submit up to three runs to the document ranking task. At least one run will be judged by NIST assessors; NIST may judge more than one run per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will arbitrarily choose the run to assess.Scoring
For some subset of the questions, NIST will pool the document lists, and the assessors will judge each document in the pool as "contains an answer" or "does not contain an answer". We will then score the submitted runs using trec_eval and treating the contains-answer documents as the relevant documents. Note that unlike other QA evaluations, trec_eval (more specifically, MAP) rewards recall, so retrieving more documents with the same answer will earn a higher MAP score than retrieving a single document with that answer.
3. Relationship task
AQUAINT defined a "relationship" as the ability of one entity to influence another, including both the means to influence and the motivation for doing so. Eight spheres of influence were noted including financial, movement of goods, family ties, communication pathways, organizational ties, co-location, common interests, and temporal. Recognition of when support for a suspected tie is lacking and determining whether the lack is because the tie doesn't exist or is being hidden/missed is a major concern. The analyst needs sufficient information to establish confidence in any support given. The particular relationships of interest depend on the context.Test set of topics
In the relationship task, we will use TREC-like topic statements to set a context. The topic will be specific about the type of relationship being sought with respect to the spheres of influence mentioned above. The topic will end with a question that is either a yes/no question, which is to be understood as a request for evidence supporting the answer, or a request for the evidence itself. The system response is a set of information nuggets that provides evidence for the answer. We anticipate having 25 topics in this task.
The topics and their list of evidence (information nuggets) from the AQUAINT 2004 pilot can be downloaded from the bottom of the page http://trec.nist.gov/data/qa/add_qaresources.html .Submission format
For each topic, the submission file should contain one or more lines of the form
topic-number run-tag doc-id evidence-stringrun-tag is a string that is used as a unique identifier for your run. Please limit it to no more than 12 characters, and it may not contain embedded white space. evidence-string is the piece of evidence derived (extracted/concluded/etc.) from the given document. It may contain embedded white space but may NOT contain embedded newlines. The response for all the topics should be contained in a single file. Please include a response for all topics, even if the response is just a place-holder response like:
5 RUNTAG NYT20000101.0001 don't knowEvidence-strings have no length restrictions, but excessive length in a response is penalized in the scoring. Some manual processing is allowed, but when you submit your results you must describe what manual processing was done (if any). Each group may submit at most two runs. All runs will be judged. Since judging will be done after the judging for the main and document ranking tasks, we will accept submissions for the relationship task until 11:59 p.m. EDT on August 31, 2005. Scoring
The system response will be evaluated as in the AQUAINT definition pilot (see the bottom of http://trec.nist.gov/data/qa/add_qaresources.html for details). We will use F(beta=3) rather than F(beta=5) as the official score. [F(beta=3) penalizes excessive length somewhat more than does F(beta=5).]Instructions given to analysts for creating topics
The purpose of this task is to create test data to evaluate how well computer systems are able to locate certain kinds of ``evidence'' for relationships between various entities. Evidence for a relationship includes both the means to influence something and the motivation for doing so. For this task, we will consider only evidence that than be found in the AQUAINT corpus, which is a collection of newspaper articles from the Associated Press and New York Times (covering the time period of 1998--2000), and from the English portion of the Xinhua Newswire (covering 1996--2000).
Your task is to create mini-scenarios (called ``topics'' below) that have relevant information contained in the document collection. Each topic statement should set the context for a question that asks for a specific type of evidence, where the type is based on a kind of influence: financial, communication pathway, movement of goods, temporal connection, co-location, organizational ties, family ties, common interests. We hope to create approximately 30 topics total. Across the 30 topics, we would like to cover as many different kinds of influence as the corpus will allow.
The question contained within the topic statement can be either a yes/no question (in which case the evidence should support the answer), or a question that specifically asks for a list of evidence. Ideally, we would like to have some questions that can be answered in the positive (i.e., the relationship exists), some that can be answered in the negative (this is likely to be difficult to find), and some where there is not really enough evidence to conclude either way. There can only be a few questions of this last type because otherwise we won't learn much about how well the systems can do.
To create the topics, you will search the document collection using our search engine called ``WebPrise''. You will get a short introduction on using WebPrise before you begin. You can issue as many queries and look at as many documents as you desire, to feel that you have gotten a good idea of what the collection contains for a given area. Once you believe you have a viable candidate for a topic, you will use a text editor to type in the topic, the document ids of documents containing evidence, and a short gloss of the evidence itself.
Last updated:Friday, 24-Jun-2005 13:37:18 MDT
Date created: Thursday, 22-May-05