This data is the result of a pilot evaluation of definition questions performed by NIST and AQUAINT program contractors. The data include: 1. a set of 25 definition questions ("questions") 2. an answer to each question as compiled by the authors of the questions ("author_summaries") 3. the responses from the contractors' systems ("Q.") 4. subjective scores assigned to the content and organization of system responses by three independent assessors ("holistic") 5. a list of concepts that should included in the definition of that question as determined independently by two different assessors (".") 6. the judging of the systems' responses as determined independently by two different assessors ("sys.") The remainder of this (long) file describes the format of the data files and summarizes the conclusions of the pilot. AQUAINT Pilot Evaluation of Definition Questions ------------------------------------------------ The purpose of the definition questions pilot was to examine the issues involved in evaluating how well computer systems can answer definition questions. Definition questions are questions such as "Who is Colin Powell?" or "What is mold?". Definition questions occur relatively frequently in logs of web search engines (see the TREC 2001 QA track overview paper), suggesting they are an important type of question. However, evaluating systems that answer definition questions is more difficult than evaluating systems that answer the factoid questions that have been used in the TREC QA tracks because it is not useful to judge a system response as simply right or wrong. The task within the pilot was as follows. The systems were given 25 definition questions and the AQUAINT document collection (see LDC catalog number LDC2002T31). The AQUAINT collection consists of news stories taken from the New York Times, the AP newswire, and the English portion of the Xinhua newswire. Systems were to return a list of text snippets per question such that each item in the list was a facet of the definition of the target. The list was to be ordered such that items appearing earlier in the list were thought to be more important than items appearing later in the list. There were no limits placed on either the length of an individual snippet or on the number of snippets in a list, though systems knew they would be penalized for retrieving extraneous information. 1. Data ------- The questions were developed by NIST assessors who searched the AQUAINT collection looking for appropriate targets. Each question was phrased simply as "Who/what is/are...". In addition to the question itself, the assessor also created his or her own definition of the target from the documents reviewed. These definitions are generally one or two paragraphs of English prose. The questions are in the file called "questions", and the assessor-created definitions are in the file "author_summaries". The system responses are given in the files called Q. . Within each file, the set of answers from one run are contiguous and in the same order as the submitted file. Different runs are separated by a string of asterisks (**************). The format of a line from a submission file is qnum run document-id answer-text where qnum is the question number; run is the id of the run (A-H); document-id is the document from which the answer facet was drawn; and answer-text is the text snippet. One run did not contain document ids, so document-id is a string of X's for that system. 2. Evaluating System Responses ------------------------------ The systems' responses were independently judged by two assessors: the assessor who created the question called the "author" assessor, and a second assessor called the "other" assessor. There was no particular method as to how secondary judges were assigned a question. Once they finished the question they were working on, they just took the next question from the stack. This means that some assessors judged more questions than others. Each assessor performed two rounds of assessing per question. 2.1 First round: "Holistic" evaluation -------------------------------------- The first round of assessing was called the "holistic" evaluation. In this round, the assessor assigned two scores to the response from a system. One score was for content of the response and the other for its organization, with each score on a scale of 0-10. The holistic evaluation was suggested by one of the AQUAINT contractors, who created the following guidelines for assigning the scores: Content: the system response includes easily identifiable information that answers the question adequately; penalty for misleading information (answers are misleading if they are incorrect and not obviously identifiable as such) Organization: information is well structured, with important information up front and no or little irrelevant material NIST further instructed the assessors that a 10 should be reserved for the best answer they could imagine, and a 0 for a totally hopeless response; that is, they should not necessarily expect to assign either a 10 or a 0 to this particular set of responses. Finally, we reminded the assessors not to agonize over their choices, but to go with their gut reaction to the response. The assessors could display the documents associated with a response if they chose to do so, but the response was judged on its own merits. Given that the contractor also did the holistic evaluation, there are three pairs of scores for each question. The individual scores are given in the file called "holistic". After initial comments, the file contains lines of the form qnum run content1 organization1 content2 org2 content3 org3 where qnum is the question number; run is the identifier A-H indicating which run; content is one of the content scores given; and organization is one of the organization scores given. The first pair of content, organization scores is the contractor's scores, the second pair is the question author's scores, and the third pair is the other assessor's scores. Each question was assigned a combined score using the function Score = 5 * Content + 0.5 * Content * Organization and the final average score was the mean of this score over the 25 topics. This score gives content much more emphasis than organization. One of the tests for an effective evaluation is that comparative scores be stable despite differences (if any) in judgments. To do this, the final average score was computed for each system using each of the three assessor's scores scores in turn and ranking the systems by decreasing score. This produces three system rankings since there are three assessors' scores. NIST also produced a system ranking based on assigning random scores (i.e., generate a random number between 0 and 10 inclusive for each of the content and organization scores for each question for each system and then proceed as above) and for the constant ranking "ABCDEFGH". The Kendall tau correlation between two rankings quantifies how different the rankings are from one another. Kendall's tau computes the distance between two rankings as the minimum number of pairwise adjacent swaps to turn one ranking into the other. The distance is normalized by the number of items being ranked such that two identical rankings produce a correlation of 1.0, the correlation between a ranking and its perfect inverse is -1.0, and the expected correlation of two rankings chosen at random is 0.0. The rankings and Kendall tau correlations for the holistic evaluation of the pilot systems are as follows: System rankings by assessor-type FGEADBHC (contractor) FADEBGCH (author) FAEGDBHC (other) CDBGEAFH (random) ABCDEFGH (constant) Kendall tau correlations between all pairs of system rankings contractor&author: 0.50 author&other: 0.71 other &random: -0.50 contractor&other: 0.79 author&random: -0.21 other &constant: 0.00 contractor&random: -0.28 author&constant: 0.29 random&constant: 0.36 contractor&constant: -0.21 There is good news and bad news here. The good news is that the random and constant rankings are noticeably different from the assessor-based rankings. The bad news is that the assessor-based correlations are still low. Part of the reason for the small correlations is that there are only 8 systems being ranked so any change in the ranking is relatively significant. Another major part is the "default" organizational score when there was essentially no organization (i.e, when only one item was returned). Some assessors assigned a 10, some a 5, some a 1; most assigned a score based on the content (i.e., no content, no organization either). Also, the contractor saw a different form of the data for system G than the other two assessors saw. If system G is eliminated from the correlations, then the Kendall tau scores are: Kendall tau correlations between system rankings minus system G contractor & author: 0.71 contractor & other: 0.90 author & other: 0.81 These correlations are much higher, but are still relatively low. Despite the low correlation, the holistic judgments do provide some guidance as to how a more quantitative scoring metric should rank systems. The human assessors prefer system F to the others, and systems H and C are the least preferred. 2.2 Second round of assessing ------------------------- The goal of the second round of assessing was to support a more quantitative evaluation of the system responses. In this round of assessing, an assessor first created a list of ``information nuggets'' about the target using all the system responses and the question author's definition. An information nugget was defined as a fact for which the assessor could make a binary decision as to whether a response contained the nugget. The assessor then decided which nuggets were vital--- nuggets that must appear in a definition for that definition to be good. Finally, the assessor went through each of the system responses and marked where each nugget appeared in the response. If a system returned a particular nugget more than once, it was marked only once. The results of this process are given in a pair of files per assessor per question. The first file, named . (where assessor is either "author" or "other"), contains a numbered list of information nuggets produced by that assessor for that question. The numbers in the list do *NOT* indicate importance, they are simply identifiers. A * between the number and the nugget indicates the assessor believed this to be a vital nugget. The second file, named sys., is derived from the system response files. Responses from different systems are separated by a string of asterisks. Otherwise, a non-empty line contains 6 fields: the question number, the run tag, the item number (i.e, the second answer string within a response has the number 2), the nugget number (from the list), the document id, and the piece of the answer string that the assessor marked as representing the nugget. This piece of text does *NOT* necessarily represent an "exact answer" since the assessors were not asked to do that. Also, some nuggets spanned items. In these cases, the nugget numbers are real numbers, for example, 5.1 and 5.2, meaning that those item combined produce nugget 5. Remember that the author and other assessors have different lists, so the nugget numbers in the annotated system responses always refer to the corresponding assessor's list. Items that did not contain any nuggets are not included in the system file, meaning that there are no lines in the file for some systems for some questions. Many list entries contain multiple concepts while others contain none. Thus, using the list entry as the unit for evaluation is not sensible. Instead, we should calculate measures in terms of the concepts themselves. Computing concept recall is straightforward given these judgments; it is the ratio of the number of correct concepts retrieved to the number of concepts in the assessor's list. But the corresponding measure of concept precision, the ratio of the number of correct concepts retrieved to the total number of concepts retrieved, is problematic since the correct value for the denominator is unknown. A trial evaluation prior to the pilot showed that assessors found enumerating *all* concepts represented in a response to be so difficult as to be unworkable. Using only concept recall as the final score is not workable either, since systems would not be rewarded for being selective: retrieving the entire document collection would get a perfect score for every question. Borrowing from the evaluation of summarization systems [see DUC website], we can use length as a (crude) approximation to precision. A length-based measure captures the intuition that users would prefer the shorter of two definitions that contain the same concepts. The length-based measure used in the pilot gives a system an allowance of 100 (non-white-space) characters for each correct concept it retrieves. The precision score is set to one if the response is no longer than this allowance. If the response is longer than the allowance, the precision score is downgraded using the function 1 - [(length-allowance)/length]. Remember that the assessors marked some concepts as vital and the remainder are not vital. The non-vital concepts act as a "don't care" condition. That is, systems should be penalized for not retrieving vital concepts, and penalized for retrieving items that are not on the assessor's concept list at all, but should be neither penalized nor rewarded for retrieving a non-vital concept. To implement the don't care condition, concept recall is computed only over vital concepts, while the character allowance in the precision computation is based on both vital and non-vital concepts. The final score for a response was computed using the F-measure, a function of both recall (R) and precision (P). The general version of the F-measure is F = (beta^2+1)RP / (beta^2P + R) where beta is a parameter signifying the relative importance of recall and precision. The main evaluation in the pilot used a value of 5, indicating that recall is 5 times as important as precision. The value of 5 is arbitrary, but reflects both the emphasis given to content in the first round of assessing and acknowledges the crudeness of the precision approximation. The following table shows the scores and system rankings using F as defined above and beta=5. F scores by assessor for system responses, beta=5 author other F 0.688 F 0.757 A 0.606 A 0.687 D 0.568 G 0.671 G 0.562 D 0.669 E 0.555 E 0.657 B 0.467 B 0.522 C 0.349 C 0.384 H 0.330 H 0.365 As can be seen from the table, the rankings of systems are stable across different assessors in that the only difference in the rankings are for two runs whose scores are extremely similar (D and G). While the absolute value of the scores is different when using different assessors, the magnitude of the difference between scores is generally preserved. For example, there is a large gap between the scores for systems F and A, and a much smaller gap for systems C and H. The rankings also obey the ordering constraints suggested by the holistic evaluation. The different systems in the pilot took different approaches to producing their definitions. System H always returned a single text snippet as a definition. System B returned a set of complete sentences. System G tended to be relatively terse, while F and A were more verbose. The average length of a response for each system is Average length of system response (number of non-white-space characters) A: 1121.2 B: 1236.5 C: 84.7 D: 281.8 E: 533.9 F: 935.6 G: 164.5 H: 33.7 The differences in the systems are reflected in their relative scores when different beta values are used in the F score. For example, the following table shows the scores and system rankings when beta=2. F scores by assessor for system responses, beta=2 author other G 0.587 G 0.688 F 0.584 F 0.661 D 0.550 D 0.656 A 0.516 A 0.609 E 0.495 E 0.576 C 0.371 C 0.406 H 0.348 B 0.404 B 0.339 H 0.383 Thus as expected, as precision gains in importance, system G rises in the rankings, system B falls quickly, and system F also sinks. 3 Conclusion ------------- The definition pilot demonstrated that relative F scores based on concept recall and adjusted response length are stable when computed using different human assessor judgments, and reflect intuitive judgments of quality. The main measure used in the pilot strongly emphasized recall, but varying the F measure's beta parameter allows different user preferences to be accommodated as expected. Definition questions will be included as a part of the TREC 2003 QA track. The F score based on concept recall and adjusted response length is the proposed metric to be used in the TREC track for definition questions.