This file summarizes the 2004 pilot for evaluation of relationship questions, performed by NIST and AQUAINT program contractors. The first part of the summary gives an overview of the pilot, and the second part describes the format of the data files resulting from the pilot. The data files include: 1. a set of 50 relationship topics and an answer to each one as compiled by their authors ("answers") 2. a list of concepts that should be included in the evidence for the answers to the questions for each topic, as determined by an assessor ("nuggets") 3. one or more document ids for each nugget of evidence ("autonuggetdocs" and "manualnuggetdocs") 1. AQUAINT Pilot for Relationship Questions =========================================== The purpose of the relationship questions pilot was to examine the issues involved in evaluating how well computer systems can locate ``evidence'' for certain kinds of relationships from a collection of documents. A group meeting with analysts in early 2004 resulted in the understanding that a "relationship" is the ability of one object to influence another. Evidence for a relationship includes both the means to influence something and the motivation for doing so. Relationships can involve both entities and events. Eight types of relationships ("spheres of influence") were noted: * financial * movement of goods * family ties * communication pathways * organizational ties * co-location * common interests * temporal connection The particular relationships of interest depend on the analyst, situation, and purpose. A major concern is recognizing when evidence for a suspected tie is lacking and determining whether the lack is because the tie doesn't exist or because it is being hidden or overlooked. The analyst needs sufficient information to establish confidence in any evidence given. Pilot Task Description ---------------------- The AQUAINT relationship pilot used TREC-like "topic" statements to set up a context for each question. The topics were developed by 13 analysts from the military who searched the AQUAINT collection looking for appropriate topics. The AQUAINT collection covers the time period of 1998--2000 and consists of news stories taken from the New York Times, the AP newswire, and the English portion of the Xinhua newswire (see LDC catalog number LDC2002T31). The military analysts created mini-scenarios that had relevant information contained in the collection. From these scenarios, an analyst from the NSA selected and developed 50 topics that were used for the evaluation pilot. Each topic statement set the context for a question that asked for evidence for one of the types of relationships listed above. The question was either a yes/no question that was to be understood as a request for evidence supporting the answer ("[w]ill the Japanese use force to defend the Senkakus?"), or a request for the evidence itself ("What types of disputes or conflict between the PLA and Hong Kong residents have been reported?). Sometimes multiple subquestions were embedded in a single topic ("Who were the participants in this spy ring, and how are they related to each other?"). The participating systems were given 50 relationship topics and the AQUAINT document collection. Systems were to return one list of text snippets per topic such that each item in the list was a piece of evidence supporting the answer to the question in the topic. There were no limits placed on either the length of an individual snippet or on the number of snippets in a list, though systems knew they would be penalized for retrieving extraneous information. The format of a response was the same as for the AQUAINT Definition Pilot, namely a file containing lines of the form topic-number run-tag doc-id evidence-string run-tag is a string that is used as a unique identifier for the run; evidence-string is the piece of evidence derived (extracted, concluded, etc.) from the given document. Evaluation of System Responses ------------------------------ Evaluation of the nuggets returned by the systems was done as in the AQUAINT Definition Pilot. (See the bottom of http://trec.nist.gov/data/qa/add_qaresources.html for details.) For each topic, an assessor first used the topic author's answer and the responses from all the systems to create a list of "information nuggets" representing evidence for the answer. An information nugget was defined as a fact for which the assessor could make a binary decision as to whether a response contained the nugget. The assessor then decided which nuggets were vital pieces of evidence and which were merely "okay". Finally, the assessor went through each of the system responses and marked where each nugget appeared in the response. If a system returned a particular nugget more than once, it was marked only once. Precision and recall for a response were computed over the nuggets. Recall was computed as the ratio of the number of correct vital nuggets retrieved to the number of vital nuggets in the assessor's list. Precision was approximated using the length of the response. The length-based measure gave an allowance of 100 (non-white-space) characters for each vital or okay nugget retrieved. The precision score was set to one if the response was no longer than this allowance. If the response was longer than the allowance, the precision score was downgraded using the function 1 - [(length-allowance)/length]. The final score for a response was computed using the F-measure, a function of both recall (R) and precision (P). The general version of the F-measure is F = (beta^2+1)RP / (beta^2P + R) where beta is a parameter signifying the relative importance of recall and precision. The evaluation in the pilot used a value of beta=3, indicating that recall is 3 times as important as precision. System Results -------------- Four groups participated in the pilot, submitting a total of six runs (labeled A-F). The following table shows the average response length, F-score (beta=3), recall, and precision for each run: Run Avg. length F(b=3) Recall Precision --- ----------- ------ ------ --------- Run-A 527.48 0.429 0.4595 0.42616 Run-B 929.78 0.393 0.45918 0.3037 Run-C 3984.74 0.391 0.66036 0.10722 Run-D 850.12 0.302 0.35482 0.23086 Run-E 689.16 0.298 0.34448 0.26058 Run-F 855 0.292 0.34418 0.22878 Because the scoring metric favored recall over precision, systems were able to improve their overall rankings by returning long responses. In particular, Run-C returned responses that were four times longer than the other runs, and managed to rank third overall despite having the lowest precision of all the runs. However, not all the runs relied on long responses to achieve a high score. The highest-scoring run (Run-A) also returned the shortest responses. Between the six runs, the systems were able to find over 85% of the 151 vital nuggets in the assessors' list. Conclusion ---------- Participants in this relationship pilot generally expressed satisfaction with the exercise, especially with the additional context provided by the topics. However, there were several requests for document ids for all the nuggets. NIST reverse-engineered this list after the evaluation, but it would ideally by created by the assessors at the time they created the nuggets list. Requiring assessors to include a document id for each nugget would improve the quality of the nuggets; sometimes the glosses of the nuggets were so vague (e.g., "NBC production") that it would be difficult for anyone other than the author to understand the intended meaning of the "evidence" without reference to a source document. 2. Data Files ============= "answers" --------- The "answers" file is a compilation of notes written by the analysts who created the topics. For each topic there is the topic id, the topic itself, the answer to the question, the document ids of documents containing evidence for the answer, and a gloss of the evidence. "nuggets" --------- The nuggets that were used for assessing the runs are in the "nuggets" file. The format of each line is: topic-number nugnum vital|okay evidence-string where topic-number is the topic number, nugnum is the nugget number, vital|okay indicates whether the nugget of evidence is "vital" or "okay", and evidence-string is the gloss of the evidence. "autonuggetdocs" and "manualnuggetdocs -------------------------------------- These two files contain document ids for the nuggets in the "nuggets" file and were created after the assessment files were returned to the pilot participants. The document ids either come from the systems' responses or were generated manually based on inspection of the "nuggets" and "answers" files. Each line in the "autonuggetdocs" and "manualnuggetdocs" files is in the format: topic_id nugget_id vital|okay doc_id string where each (topic_id, nugget_id) pair is labeled as either "vital" or "okay", based on the "nuggets" file. "autonuggetdocs": contains system responses for each nugget for which some system returned a matching response. doc_id and string are the document id and evidence string returned by the system that match this nugget_id for this topic_id. Multiple (doc_id, string) pairs may be included, one per line, for each topic_id and nugget_id pair, but duplicates are removed. "manualnuggetdocs": If a nugget was not matched by any system's response, a document id was found manually for the nugget using queries to the document collection, based on the "nuggets" and "answers" files; string is the gloss of the evidence, as given in the "nuggets" file. Only one doc_id is provided for each (topic_id, nugget_id) pair. If no document could be found for the nugget or if the intended meaning of the nugget string was not inferrable, then "UNDOCUMENTED" appears in the doc_id field. (N.B. Nugget 8 of topic 4 seems to actually belong to topic 5.)