TREC 2005 Question Answering Track Guidelines

Tracks home

National Institute of Standards and Technology
Home Page

Contents:
I. Summary
II. Document set
III. Submission deadlines
IV. Task details
1. Main Task
2. Document ranking task
3. Relationship task

V. Timetable

I. Summary:

The goal of the TREC QA track is to foster research on systems that retrieve answers rather than documents in response to a question. The focus is on systems that can function in unrestricted domains. The TREC 2005 QA track will consist of three tasks:

Main task
Document ranking task (required if participating in main task)
Relationship task

The main task is essentially the same as the single task from 2004, in that the test set will consist of a set of question series where each series asks for information regarding a particular target. As in TREC 2004, the targets will include people, organizations, and other entities; unlike TREC 2004 the target can also be an event. Events were added since the document set from which the answers are to be drawn are newswire articles. (The 2005 corpus will again be the AQUAINT corpus.) Each question series will consist of some factoid and some list questions and will end with exactly one "Other" question. The answer to the "Other" question is to be interesting information about the target that is not covered by the preceding questions in the series. The runs will be evaluated using the same methodology as in TREC 2004, though the primary measure will be the per-series combined score.

The document ranking task will be to submit, for a subset of the questions in the main task, a ranked list of <=1000 documents for each question. To address the concern regarding document retrieval and QA, we are requiring that all submissions to the main task include a ranked list of documents for each question in this subset. This ranked list should be the set of documents used by your system to create the answer. Furthermore, we are encouraging groups whose primary emphasis is document retrieval, not QA, to also submit document rankings for the questions. (You may submit a document ranking and not an answer file, but if you submit an answer file to the main task you must also submit a document ranking.) Judging and scoring of the document rankings will be done using trec-eval as for standard ad hoc retrieval, except that a document will be judged as relevant if it contains an answer to the question.

The relationship task will be the same task as was performed in the AQUAINT relationship pilot. A description of the AQUAINT relationship pilot can be found on the "Additional Question Answering Resource" page of the Data/QA section of the TREC web site (http://trec.nist.gov/data/qa/add_qaresources.html). The task in the pilot was such that systems were given TREC-like topic statements that ended with a question asking for evidence for a particular relationship. The initial part of the topic statement set the context for the question. The question was either a yes/no question, which was understood to be a request for evidence supporting the answer, or an explicit request for the evidence itself. The system response was a set of information nuggets that were evaluated using the same scheme as definition questions.

Participants in the QA track may do one, two, or all three tasks. The relationship task is independent from the other two tasks. However, if you participate in the main task then you are required to participate in the document ranking task as well. All runs for the main task and document ranking task must be completely automatic; no manual intervention of any sort is allowed. All targets must be processed from the same initial state (i.e., your system may not adapt to targets that have already been processed). For the relationship task, some manual processing is allowed, but when you submit your results you must describe what manual processing was done (if any).

Results for the main and document ranking tasks are due at NIST on or before July 27, 2005. Results for the relationship task are due at NIST on or before August 31, 2005. You may submit up to three runs each for the main task and document ranking task, and up to two runs for the relationship task. All submitted runs will be judged for the main task and relationship task. At least one run will be judged for the document ranking task; please designate the ranking of the runs for the document ranking task at the time of submission.

The question set for the main task, list of questions for the document ranking task, and topics for the relationship task will be available on the TREC web site on July 20 (by 12 noon EDT). YOU ARE REQUIRED TO FREEZE YOUR SYSTEM BEFORE DOWNLOADING THE QUESTIONS. No changes of any sort can be made to any component of your system or any resource used by your system between the time you fetch the questions and the time you submit your results to NIST.

II. Document set

The document set for all three tasks is the same as was used in the past three QA tracks. It consists of the set of documents on the AQUAINT disk set. See the Welcome message (in the email archive) for how to obtain the AQUAINT disks. The AQUAINT collection consists of 1,033,461 documents taken from the New York Times, the Associated Press, and the Xinhua News Agency newswires. The following is a sample from the collection.

<DOC>
<DOCNO> NYT19990430.0001 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 1999-04-30 00:01 </DATE_TIME>
<HEADER>
A8974 &Cx1f; taf-z
u s &Cx13; &Cx11; BC-BOX-TVSPORTS-COLUMN-N 04-30 0809
</HEADER>
<BODY>
<SLUG> BC-BOX-TVSPORTS-COLUMN-NYT </SLUG>
<HEADLINE>
SPORTS COLUMN: A MARCIANO DOCUDRAMA GETS MUCH OF IT WRONG
</HEADLINE>

(ATTN: Iowa) (rk)
By RICHARD SANDOMIR
c.1999 N.Y. Times News Service
<TEXT>
<P>
Toward the end of ``Rocky Marciano,'' a coming Showtime
docudrama, the retired boxer flies to Denver on the final day of
his life to visit Joe Louis in a psychiatric hospital. When he
leaves, Marciano hands the hospital administrator a bag full of
cash to upgrade Louis' care.
</P>
...
</TEXT>
</BODY>
<TRAILER>
NYT-04-30-99 0001EDT &QL;
</TRAILER>
</DOC>

III. Submission deadlines

The deadline for all results for the main and document ranking tasks is 11:59 p.m. EDT on July 27, 2005. This is a firm deadline. Each participant will have one week between the time that the questions are released and the time that the results are due back at NIST. The primary week will be July 20--27. That is, NIST will post the questions on the web site by noon EDT on July 20 and results will have to be submitted to NIST by 11:59 pm July 27 (so you get a week plus twelve hours). However, there may be participants for whom that week will not work. In this case, the participant must make prior arrangements with Hoa Dang ([email protected]) to choose another one-week period that begins after July 5 and ends before July 27. Results must be received one week after you receive the questions or by July 27, whichever is earlier. Late submissions will be discarded.

The deadline for all results for the relationship task is 11:59 p.m. EDT on August 31, 2005. This is also a firm deadline. Because the judging for the relationship task will be done after the judging for the main and document ranking tasks, we can allow more time for participants to process the test data before submitting their results.

Results are submitted to NIST using an automatic submission procedure. Details about the submission procedure will be emailed to the main TREC 2005 participants' mailing list in early July. At that time, NIST will release a routine that checks for common errors in result files including such things as invalid document numbers, wrong formats, missing data, etc. You should check your runs with this script before submitting them to NIST because if the automatic submission procedure detects any errors, it will not accept the submission. Participants submitting a run for the main task must submit a corresponding run for the document ranking task at the same time.

Restrictions
No part of the system can be changed in any matter between the time the test questions are downloaded and the time the results are submitted to NIST. For the main task and document ranking task, no manual processing of questions, answers or any other part of a system is allowed: all processing must be fully automatic. Targets must be processed independently. Questions within a series must be processed in order, without looking ahead. That is, your system may use the information in the questions and system-produced answers of earlier questions in a series to answer later questions in the series, but the system may not look at a later question in the series before answering the current question. This requirement models (some of) the type of processing a system would require to process a dialog with the user.

IV. Task details

1. Main Task

The question set will be XML-formatted such that the format will explicitly tag the target as the target as well as the type of each question in the series (type is one of FACTOID, LIST, and OTHER). Each question will have an id of the form X.Y where X is the target number and Y is the number of the question in the series (so question 3.4 is the fourth question of the third target). Factoid questions are questions that seek short, fact-based answers such as have been used in previous TREC QA tracks. Some factoid questions may not have an answer in the document collection. List questions are requests for a set of instances of a specified type. Factoid and list questions require exact answers to be returned. Responses to the "other" question need not be exact, though excessive length will be penalized.

The document collection may not contain a correct answer for some of the factoid questions. In this case, the correct response is the string "NIL". A question will be assumed to have no correct answer in the collection if our assessors do not find an answer during the answer verification phase AND no participant returns a correct response. (If a system returns a right answer that is unsupported, NIL will still be the correct response for that question.)

Test set of questions

The test set consists of a series of questions for each of a set of targets. The targets and questions were developed by NIST assessors after searching the document set to determine if there was an appropriate amount of information about the target. There will be 75 targets in the test set. Each series will contain several factoid questions, 0--2 list questions, and a question called "other". The "other" question is to be interpreted as "Give additional information about the target that would be interesting to the user but that the user has not explicitly asked for." As in the TREC 2004 track, we will assume that the user is an "average" adult reader of American newspapers.

Each question is tagged as to its type using the XML format below:

<?xml version="1.0" encoding="ISO-8859-1"?>
<DOCTYPE trec2004qa [



<!ELEMENT trecqa       (target+)           >
<!ELEMENT target        (qa+)                 >
<!ELEMENT qa              (q, as)               >
<!ELEMENT q               (CDATA)    > 
<!ELEMENT as             (a*, nugget*)     >

<!ATTLIST trecqa
                                      year CDATA #REQUIRED
                                      task CDATA #REQUIRED>

<!ATTLIST target        id    ID           #REQUIRED
                                     text CDATA  #REQUIRED>

<!ATTLIST q               id    ID            #REQUIRED
                                     type       (FACTOID|LIST|OTHER)   #REQUIRED>

<!ATTLIST a               src   CDATA   #REQUIRED
                                     regex CDATA #IMPLIED >

<!ATTLIST nugget      id    ID             #REQUIRED
                                     type     (VITAL|OKAY)                     #REQUIRED>
]>

<trecqa year="2004" task="main">

<target id="1" text="AmeriCorps">
       <qa>
             <q id = "1.1" type="FACTOID">
                         When was AmeriCorps founded?
       </q>
       </qa>

       <qa>
             <q id = "1.2" type="FACTOID">
                         How many volunteers work for it?
       </q>
       </qa>

       <qa>
             <q id = "1.3" type="LIST">
                         What activities are its volunteers involved in?
       </q>
       </qa>

       <qa>
             <q id="1.4" type="OTHER">
                         Other
       </q>
       </qa>
</target>

<target id="2" text="skier Alberto Tomba">
       <qa>

            <q id="2.1" type="FACTOID">
                        How many Olympic gold medals did he win?
       </q>
       </qa>

       <qa>
            <q id="2.2" type="FACTOID">
                        What nationality is he?
       </q>
       </qa>

       <qa>
            <q id="2.3" type="FACTOID">
                        What is his nickname?
       </q>
       </qa>

       <qa>
            <q id="2.4" type="OTHER">
                        Other
       </q>
       </qa>

</target>

<target id="3" text="Kama Sutra">
       <qa>
            <q id ="3.1" type="FACTOID">
                         Who wrote it?
       </q>
       </qa>

       <qa>
            <q id="3.2" type="FACTOID">
                         When was it written?
       </q>
       </qa>

       <qa>
            <q id="3.3" type="FACTOID">
                         What is it about?
       </q>
       </qa>

       <qa>
            <q id="3.4" type="OTHER">
        Other
       </q>
       </qa>

</target>

</trecqa>

The questions used in previous tracks are in the Data/QA section of the TREC web site. This section of the web site contains a variety of other data from previous QA tracks that you may use to develop your QA system. This data includes judgment files, answer patterns, top ranked document lists, and sentence files. See the web site for a description of each of these resources. A draft of the TREC 2004 QA track overview paper containing an analysis of the the TREC 2004 evaluation, can be found in the QA section on the tracks page in the active participants' section of the TREC web site (http://trec.nist.gov/act_part/tracks.html). The TREC 2004 task was essentially the same as this year's main task, and participants are strongly encouraged to read that discussion of the evaluation methodology.

Submission format

A submission consists of exactly one response for each question. The definition of a response varies for the different question types.

For factoid questions, a response is a single [answer-string, docid] pair or the string "NIL". The "NIL" string will be judged correct if there is no answer known to exist in the document collection; otherwise it will be judged as incorrect. If an [answer-string, docid] pair is given as a response, the answer-string must contain nothing other than the answer, and the docid must be the id of a document in the collection that supports answer-string as an answer. An answer string must contain a complete, exact answer and nothing else. As with correctness, exactness will be in the opinion of the assessor.

Responses will be judged by human assessors who will assign one of four possible judgments to a response:

incorrect: the answer-string does not contain a correct answer or the answer is not responsive;
unsupported: the answer-string contains a correct answer but the document returned does not support that answer;
non-exact: the answer-string contains a correct answer and the document supports that answer, but the string contains more than just the answer (or is missing bits of the answer);
correct: the answer-string consists of exactly a correct answer and that answer is supported by the document returned.

Being "responsive" means such things as including units for quantitative responses (e.g., $20 instead of 20) and answering with regard to a famous entity itself rather than its replicas or imitations. (The paper "The TREC-8 Question Answering Track Evaluation" in the TREC-8 proceedings contains a very detailed discussion of responsiveness.) Answers correctly drawn from documents that are themselves in error will be judged correct. Only the judgment of "correct" will be accepted for scoring purposes.

For list questions, a response is an unordered, non-empty set of [answer-string, docid] pairs, where each pair is called an instance. There is no limit on the number of pairs that may be returned for a list question. The interpretation of the pair is the same as for factoid questions, and each pair will be judged in the same way as for a factoid question. Note that this means that if an answer-string actually contains multiple answers for the question it will be marked inexact and will thus hurt the question's precision score. In addition to judging the individual pairs in a response, the assessors will also mark a subset of the pairs judged correct as being distinct. If a single instance is correct but listed multiple times, the assessor will mark exactly one arbitrary instance as distinct, and the others will not be marked as distinct. Scores will be computed using the number of correct, distinct instances in the set.

A response for the "other" question that ends each target's series is also an unordered, non-empty, unbounded set of [answer-string, docid] pairs, but the interpretation of the pairs differs somewhat from a list response. A pair in an "other" response is assumed to represent a new (i.e., not part of the question series nor previously reported as an answer to an earlier question in the series), interesting fact about the target. Individual pairs will not be judged for these questions. Instead, the assessor will construct a list of desirable information nuggets about the target, and count the number of distinct, desirable nuggets that occur in the response as a whole. responses to these questions will be penalized for excessive length, but there is no expectation of an exact answer.

You are required to submit a run for the document ranking task with each submission for the main task. A submission file for the main task consists of two parts, separated by a single blank line. The first part comprises the document rankings for the questions in the document ranking task (following the submission format for the document ranking task); the second part comprises the responses to all the questions in the main task and must contain a response for each question in the main task. Each line in the second part of the submission must have the form

          qid run-tag docid answer-string
where qid           is the question number (of the form X.Y),
      run-tag       is the run id, 
      docid         is the id of the supporting document or
			the string "NIL" (no quotes) if the
			question is a factoid question and
			there is no answer in the collection,
and
      answer-string is a text string with no embedded
			newlines or is empty if docid is NIL

Any amount of white space may be used to separate columns, as long as there is some white space between columns and every column is present (modulo empty answer-string when docid is NIL). Answer-string cannot contain any line breaks, but should be immediately followed by exactly one line break. Other white space is allowed in answer-string. There is no limit imposed on the length of answer-string. The "run tag" should be a unique identifier for your group AND for the method that produced the run. The run tag for the main task must be constructed by appending the string "M" to the run tag for the document ranking task from the first part of the submission file.

The first few lines of the second part of an example submission might look like this:

    1.1  nistqa05M  NYT19990326.0303	    Nicole Kidman
    1.2  nistqa05M  NIL
    1.3  nistqa05M  APW20000908.0100        Godiva
    1.3  nistqa05M  NYT19981215.0192        Wittamer of Brussels
    1.3  nistqa05M  NYT19990209.0202        Nirvana
    1.3  nistqa05M  NYT19990210.0179        Leonidas
    1.3  nistqa05M  NYT20000211.0296        Guylian
    1.3  nistqa05M  NYT20000824.0034        Les Cygnes
    1.3  nistqa05M  NYT20000927.0157        Callebaut
    1.4  nistqa05M  NYT19990421.0438        fringe rock music genres
    1.4  nistqa05M  NYT19990430.0345        gloomy subculture
    2.1  nistqa05M  NIL

where the answers for run "nistqa05M" in the main task have been constructed from the document rankings in run "nistqa05" in the document ranking task. Each group may submit up to three runs. Please rank each run at the time of submission. All submitted runs for the main task will be judged, but only the top ranked run will be guaranteed to be judged for the document ranking task.

Scoring

The different types of questions have different scoring metrics. Each of the three (factoid, list, other) scores has a range of 0-1 with 1 being the high score. We will compute the factoid-score, list-score, and other-score for each series. The combined weighted score of a single series will be a simple weighted average of these three scores for questions in the series:

The final score for a run will be the mean of the per-series combined weighted scores. NIST will report the individual per-series scores as well as the final score for each run.

For factoid questions, the response will be judged as "incorrect", "unsupported", "non-exact", or "correct". All judgments are in the opinion of the NIST assessor. The factoid-score for a series is the fraction of factoid questions in the series judged to be "correct".

The response to a list question is a non-null, unordered, and unbounded set of [answer-string, docid] pairs, where each pair is called an instance. An individual instance is interpreted as for factoid questions and will be judged in the same way. The final answer set for a list question will be created from the union of the distinct, correct responses returned by all participants plus the set of answers found by the NIST assessor during question development. An individual list question will be scored by first computing instance recall (IR) and instance precision (IP) using the final answer set, and combining those scores using the F measure with recall and precision equally weighted.
That is,

 
    IR = # instances judged correct & distinct/|final answer set|
    IP = # instances judged correct & distinct/# instances returned
    F = (2*IP*IR)/(IP+IR)

The list-score for a series is the mean of the F scores of the list questions in the series.

The response for an "other" question is syntactically the same as for a list question: a non-null, unordered, and unbounded set of [answer-string, docid] pairs. The interpretation of this set is different, however. For each "other" question, the assessor will create a list of acceptable information nuggets about the target from the union of the returned responses and the information discovered during question development. This list will NOT contain the answers to previous questions in the target's series: systems are expected to report only new information for the "other" question. Some of the nuggets will be deemed essential, while the remaining nuggets on the list are acceptable. There may be many other facts related to the target that, while true, are not considered acceptable by the assessor (because they are not interesting, for example). These items will not be on the list at all (and thus including them in a response will be penalized). All decisions regarding acceptability are in the opinion of the assessor. Once the list of acceptable nuggets is created, the assessor will view the response for one question from one system in its entirety and mark the essential and acceptable nuggets contained in it. Each nugget that is present will be matched only once. An individual "other" question will be scored using nugget recall (NR) and an approximation to nugget precision (NP) based on length. These scores will be combined using the F measure with recall three times as important as precision. In particular,

  NR = # essential nuggets returned in response/# essential nuggets
  NP is defined using
	allowance = 100*(# essential+acceptable nuggets returned)
	length = total # non-white-space characters in answer strings
 	NP = 	1 if length < allowance
  		else 1-[(length-allowance)/length]
  F = (10*NP*NR)/(9*NP + NR)

The other-score for a series is simply the F score of its "other" question.

Document lists

As a service to the track, NIST will provide the ranking of the top 1000 documents retrieved by the PRISE search engine when using the target as the query, and the full text of the top 50 documents per target (as given from that same ranking). NIST will not provide document lists for individual questions. Note that this is a service only, provided as a convenience for groups that do not wish to implement their own document retrieval system. There is no guarantee that this ranking will contain all the documents that actually answer the questions in a series, even if such documents exist. The document lists will be in the same format as previous years:

    qnum rank docid rsv
where
    qnum is the target number 
    rank is the rank at which PRISE retrieved the doc
        (most similar doc is rank 1)
    docid is the document identifier
    rsv is the "retrieval status value" or similarity
         between the query and doc where larger is better

2. Document ranking task

Test set of questions

The test set for the document ranking task will be a list of question numbers (of the form X.Y) for some subset of the questions from the main task (how many questions will be determined in part by how long assessing will take, but will be approximately 50 individual questions), but instead of returning an exact answer for each question, systems will return a ranked list of <=1000 documents that contain an answer for each question.

Participants in the main task MUST also participate in this document ranking task, though you may participate in the document ranking task without participating in the main task. If your QA system doesn't produce a ranked list of documents as an initial step, or uses multiple rounds of retrieval, or operates in some other way such that it is not clear what the ranked list of documents should be, you must still submit some list. Since the purpose of these lists is to create document pools both to get a better understanding of the number of instances of correct answers in the collection and to support research on whether some document retrieval techniques are better than others in support of QA, we want the list you produce to be as close as possible to the order in which your system considers the documents.

Submission format

A document ranking submission consists of a single file with 6 columns per line. White space is used to separate columns. The width of the columns is not important, but it is important to have exactly six columns per line with at least one space between the columns.

       1.1 Q0 ZF08-175-870  1 4238 prise1
       1.1 Q0 ZF08-306-044  2 4223 prise1
       1.1 Q0 ZF09-477-757  3 4207 prise1
       1.1 Q0 ZF08-312-422  4 4194 prise1
       1.1 Q0 ZF08-013-262  5 4189 prise1
          etc.

where:

the first column is the question number (of the form X.Y).
the second column is currently unused and should always be Q0.
the third column is the official document number of the retrieved document and is the number found in the "docno" field of the document.
the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). If you want the precise ranking you submit to be evaluated, the SCORES must reflect that ranking.
the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also run tags must contain 12 or fewer letters and numbers, with *NO* punctuation, to facilitate labeling graphs with the tags. (If you are also participating in the main task, then your run tags must contain 11 or fewer letters and numbers.)

Each question must have at least one document retrieved for it. Provided you have at least one document, you may return fewer than 1000 documents for a question, though note that the standard evaluation measures used in TREC count empty ranks as not relevant. You cannot hurt your score, and could conceivably improve it for these measures, by returning 1000 documents per question.

NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same question, invalid document numbers, wrong format, and multiple tags within runs. This routine will be made available to participants to check their runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed.

Groups may submit up to three runs to the document ranking task. At least one run will be judged by NIST assessors; NIST may judge more than one run per group depending upon available assessor time. During the submission process you will be asked to rank your submissions in the order that you want them judged. If you give conflicting rankings across your set of runs, NIST will arbitrarily choose the run to assess.

Scoring

For some subset of the questions, NIST will pool the document lists, and the assessors will judge each document in the pool as "contains an answer" or "does not contain an answer". We will then score the submitted runs using trec_eval and treating the contains-answer documents as the relevant documents. Note that unlike other QA evaluations, trec_eval (more specifically, MAP) rewards recall, so retrieving more documents with the same answer will earn a higher MAP score than retrieving a single document with that answer.

3. Relationship task

AQUAINT defined a "relationship" as the ability of one entity to influence another, including both the means to influence and the motivation for doing so. Eight spheres of influence were noted including financial, movement of goods, family ties, communication pathways, organizational ties, co-location, common interests, and temporal. Recognition of when support for a suspected tie is lacking and determining whether the lack is because the tie doesn't exist or is being hidden/missed is a major concern. The analyst needs sufficient information to establish confidence in any support given. The particular relationships of interest depend on the context.

Test set of topics

In the relationship task, we will use TREC-like topic statements to set a context. The topic will be specific about the type of relationship being sought with respect to the spheres of influence mentioned above. The topic will end with a question that is either a yes/no question, which is to be understood as a request for evidence supporting the answer, or a request for the evidence itself. The system response is a set of information nuggets that provides evidence for the answer. We anticipate having 25 topics in this task.

The topics and their list of evidence (information nuggets) from the AQUAINT 2004 pilot can be downloaded from the bottom of the page http://trec.nist.gov/data/qa/add_qaresources.html .

Submission format

For each topic, the submission file should contain one or more lines of the form

     topic-number run-tag doc-id evidence-string

run-tag is a string that is used as a unique identifier for your run. Please limit it to no more than 12 characters, and it may not contain embedded white space. evidence-string is the piece of evidence derived (extracted/concluded/etc.) from the given document. It may contain embedded white space but may NOT contain embedded newlines. The response for all the topics should be contained in a single file. Please include a response for all topics, even if the response is just a place-holder response like:

     5 RUNTAG NYT20000101.0001 don't know

Evidence-strings have no length restrictions, but excessive length in a response is penalized in the scoring.

Some manual processing is allowed, but when you submit your results you must describe what manual processing was done (if any). Each group may submit at most two runs. All runs will be judged. Since judging will be done after the judging for the main and document ranking tasks, we will accept submissions for the relationship task until 11:59 p.m. EDT on August 31, 2005.

Scoring

The system response will be evaluated as in the AQUAINT definition pilot (see the bottom of http://trec.nist.gov/data/qa/add_qaresources.html for details). We will use F(beta=3) rather than F(beta=5) as the official score. [F(beta=3) penalizes excessive length somewhat more than does F(beta=5).]

Instructions given to analysts for creating topics

The purpose of this task is to create test data to evaluate how well computer systems are able to locate certain kinds of ``evidence'' for relationships between various entities. Evidence for a relationship includes both the means to influence something and the motivation for doing so. For this task, we will consider only evidence that than be found in the AQUAINT corpus, which is a collection of newspaper articles from the Associated Press and New York Times (covering the time period of 1998--2000), and from the English portion of the Xinhua Newswire (covering 1996--2000).

Your task is to create mini-scenarios (called ``topics'' below) that have relevant information contained in the document collection. Each topic statement should set the context for a question that asks for a specific type of evidence, where the type is based on a kind of influence: financial, communication pathway, movement of goods, temporal connection, co-location, organizational ties, family ties, common interests. We hope to create approximately 30 topics total. Across the 30 topics, we would like to cover as many different kinds of influence as the corpus will allow.

The question contained within the topic statement can be either a yes/no question (in which case the evidence should support the answer), or a question that specifically asks for a list of evidence. Ideally, we would like to have some questions that can be answered in the positive (i.e., the relationship exists), some that can be answered in the negative (this is likely to be difficult to find), and some where there is not really enough evidence to conclude either way. There can only be a few questions of this last type because otherwise we won't learn much about how well the systems can do.

To create the topics, you will search the document collection using our search engine called ``WebPrise''. You will get a short introduction on using WebPrise before you begin. You can issue as many queries and look at as many documents as you desire, to feel that you have gotten a good idea of what the collection contains for a given area. Once you believe you have a viable candidate for a topic, you will use a text editor to type in the topic, the document ids of documents containing evidence, and a short gloss of the evidence itself.

V. Timetable

Documents available:	now
Questions and relationship topics available:	July 20, 2005
Top ranked documents for main task available:	July 20, 2005
Results for main and document ranking tasks due at NIST:	July 27, 2005
Results for relationship task due at NIST:	August 31, 2005
Evaluated results from NIST:	October 14, 2005
Conference workbook papers to NIST:	late October, 2005
TREC 2005 Conference:	November 15-18, 2005

Last updated:
Date created: Thursday, 22-May-05
[email protected]