This data is the result of a pilot evaluation of
definition questions performed by NIST and
AQUAINT program contractors.  The data include:
    1. a set of 25 definition questions ("questions")
    2. an answer to each question as compiled by the authors
	of the questions ("author_summaries")
    3. the responses from the contractors' systems
	("Q.<question-num>")
    4. subjective scores assigned to the content and
       organization of system responses by three independent
       assessors ("holistic")
    5. a list of concepts that should included in the
	definition of that question as determined independently
	by two different assessors ("<question-num>.<assessor>")
    6. the judging of the systems' responses as determined
	independently by two different assessors
	("sys<question-num>.<assessor>")
The remainder of this (long) file describes the format of the data
files and summarizes the conclusions of the pilot.


AQUAINT Pilot Evaluation of Definition Questions
------------------------------------------------


The purpose of the definition questions pilot was to
examine the issues involved in evaluating how well
computer systems can answer definition questions.
Definition questions are questions such as "Who is Colin Powell?"
or "What is mold?".   Definition questions occur relatively
frequently in logs of web search engines (see the TREC 2001
QA track overview paper), suggesting they are an important
type of question.  However, evaluating systems that answer
definition questions is more difficult than evaluating
systems that answer the factoid questions that have been
used in the TREC QA tracks because it is not useful to judge
a system response as simply right or wrong.

The task within the pilot was as follows.  The systems
were given 25 definition questions and the AQUAINT
document collection (see LDC catalog number LDC2002T31).
The AQUAINT collection consists of news stories taken from
the New York Times, the AP newswire, and the English portion
of the Xinhua newswire.  Systems were to return a list
of text snippets per question such that each item in the list
was a facet of the definition of the target.  The list was
to be ordered such that items appearing earlier in the list
were thought to be more important than items appearing
later in the list.  There were no limits placed on either
the length of an individual snippet or on the number of snippets
in a list, though systems knew they would be penalized for
retrieving extraneous information.


1. Data
-------

The questions were developed by NIST assessors who searched
the AQUAINT collection looking for appropriate targets.
Each question was phrased simply as  "Who/what is/are...".
In addition to the question itself, the assessor also created
his or her own definition of the target from the documents
reviewed.  These definitions are generally one or two
paragraphs of English prose.  The questions are in the file
called "questions", and the assessor-created definitions
are in the file "author_summaries".

The system responses are given in the files called Q.<qnum> .
Within each file, the set of answers from one run are contiguous
and in the same order as the submitted file.  Different runs are
separated by a string of asterisks (**************).
The format of a line from a submission file is
   qnum run document-id answer-text
where qnum is the question number;
      run is the id of the run (A-H);
      document-id is the document from which the answer facet was drawn;
  and answer-text is the text snippet. 
One run did not contain document ids, so document-id is a string
of X's for that system.


2. Evaluating System Responses
------------------------------

The systems' responses were independently judged by two
assessors: the assessor who created the question called the
"author" assessor, and a second assessor called the "other"
assessor.  There was no particular method as to how secondary
judges were assigned a question.  Once they finished the
question they were working on, they just took the next question
from the stack.  This means that some assessors judged more
questions than others.  Each assessor performed two rounds of
assessing per question.


2.1 First round: "Holistic" evaluation
--------------------------------------

The first round of assessing was called the "holistic" evaluation.
In this round, the assessor assigned two scores to the response
from a system.  One score was for content of the response and
the other for its organization, with each score on a scale of 0-10.
The holistic evaluation was suggested by one of the AQUAINT contractors,
who created the following guidelines for assigning the scores:
    Content: the system response includes easily identifiable
	 information that answers the question adequately;
         penalty for misleading information (answers are
	 misleading if they are incorrect and not obviously
	 identifiable as such)
    Organization: information is well structured, with important
	 information up front and no or little irrelevant material
NIST further instructed the assessors that a 10 should be reserved
for the best answer they could imagine, and a 0 for a totally
hopeless response; that is, they should not necessarily expect to assign
either a 10 or a 0 to this particular set of responses.  Finally,
we reminded the assessors not to agonize over their choices,
but to go with their gut reaction to the response.
The assessors could display the documents associated with a
response if they chose to do so, but the response was judged
on its own merits.

Given that the contractor also did the holistic evaluation, there are
three pairs of scores for each question.  The individual scores
are given in the file called "holistic".  After initial comments,
the file contains lines of the form
    qnum run content1 organization1 content2 org2 content3 org3
where
    qnum is the question number;
    run  is the identifier A-H indicating which run;
    content is one of the content scores given; and
    organization is one of the organization scores given. 
The first pair of content, organization scores is the contractor's
scores, the second pair is the question author's scores, and the
third pair is the other assessor's scores.

Each question was assigned a combined score using the function
    Score = 5 * Content + 0.5 * Content * Organization
and the final average score was the mean of this score over
the 25 topics.  This score gives content much more emphasis than
organization.

One of the tests for an effective evaluation is that comparative
scores be stable despite differences (if any) in judgments.
To do this, the final average score was computed for each system
using each of the three assessor's scores scores in turn and
ranking the systems by decreasing score.  This produces three system
rankings since there are three assessors' scores.  NIST also produced
a system ranking based on assigning random scores (i.e., generate a random
number between 0 and 10 inclusive for each of the content and
organization scores for each question for each system and then
proceed as above) and for the constant ranking "ABCDEFGH".
The Kendall tau correlation between two rankings quantifies
how different the rankings are from one another.  Kendall's tau
computes the distance between two rankings as the minimum number
of pairwise adjacent swaps to turn one ranking into the other.
The distance is normalized by the number of items being ranked
such that two identical rankings produce a correlation of 1.0,
the correlation between a ranking and its perfect inverse is -1.0,
and the expected correlation of two rankings chosen at random is 0.0.
The rankings and Kendall tau correlations for the holistic evaluation
of the pilot systems are as follows:

                   System rankings by assessor-type
			FGEADBHC (contractor)
			FADEBGCH (author)
			FAEGDBHC (other)
			CDBGEAFH (random)
			ABCDEFGH (constant)

    Kendall tau correlations between all pairs of system rankings
contractor&author:    0.50   author&other:    0.71  other &random:  -0.50
contractor&other:     0.79   author&random:  -0.21  other &constant: 0.00
contractor&random:   -0.28   author&constant: 0.29  random&constant: 0.36
contractor&constant: -0.21   

There is good news and bad news here.  The good news is that the
random and constant rankings are noticeably different from the
assessor-based rankings.  The bad news is that the assessor-based
correlations are still low.  Part of the reason for the small correlations
is that there are only 8 systems being ranked so any change in the
ranking is relatively significant.  Another major part is the "default"
organizational score when there was essentially no organization
(i.e, when only one item was returned).  Some assessors assigned a 10,
some a 5, some a 1; most assigned a score based on the content
(i.e., no content, no organization either).  Also, the contractor
saw a different form of the data for system G than the other two
assessors saw.  If system G is eliminated from the correlations,
then the Kendall tau scores are:

	Kendall tau correlations between system rankings minus system G
		contractor & author:   0.71
		contractor & other:    0.90
		author & other:        0.81

These correlations are much higher, but are still relatively low.
Despite the low correlation, the holistic judgments do provide some
guidance as to how a more quantitative scoring metric should rank systems.
The human assessors prefer system F to the others, and systems H and C
are the least preferred.


2.2 Second round of assessing
-------------------------

The goal of the second round of assessing was to support a more
quantitative evaluation of the system responses.  In this round of
assessing, an assessor first created a list of ``information nuggets''
about the target using all the system responses and the question author's
definition.  An information nugget was defined as a fact for which the
assessor could make a binary decision as to whether a response contained
the nugget.  The assessor then decided which nuggets were vital---
nuggets that must appear in a definition for that definition to be good.
Finally, the assessor went through each of the system responses and
marked where each nugget appeared in the response.  If a system returned
a particular nugget more than once, it was marked only once.

The results of this process are given in a pair of files per assessor
per question.  The first file, named <question-num>.<assessor>
(where assessor is either "author" or "other"), contains a numbered
list of information nuggets produced by that assessor for that question.
The numbers in the list do *NOT* indicate importance, they are simply
identifiers.  A * between the number and the nugget indicates the
assessor believed this to be a vital nugget.  The second file, named 
sys<question-num>.<assessor>, is derived from the system response files.
Responses from different systems are separated by a string of asterisks.
Otherwise, a non-empty line contains 6 fields: the question number,
the run tag, the item number (i.e, the second answer string within
a response has the number 2), the nugget number (from the list),
the document id, and the piece of the answer string that the assessor
marked as representing the nugget.  This piece of text does *NOT*
necessarily represent an "exact answer" since the assessors were not
asked to do that.  Also, some nuggets spanned items.  In these
cases, the nugget numbers are real numbers, for example, 5.1 and 5.2,
meaning that those item combined produce nugget 5.  Remember
that the author and other assessors have different lists, so the
nugget numbers in the annotated system responses always refer
to the corresponding assessor's list.  Items that did not contain
any nuggets are not included in the system file, meaning that 
there are no lines in the file for some systems for some questions.

Many list entries contain multiple concepts while others contain
none.  Thus, using the list entry as the unit for evaluation is
not sensible.  Instead, we should calculate measures in terms of
the concepts themselves.  Computing concept recall is straightforward
given these judgments; it is the ratio of the number of correct
concepts retrieved to the number of concepts in the assessor's list.
But the corresponding measure of concept precision, the ratio
of the number of correct concepts retrieved to the total
number of concepts retrieved, is problematic since the correct value
for the denominator is unknown.  A trial evaluation prior to the pilot
showed that assessors found enumerating *all* concepts represented in
a response to be so difficult as to be unworkable.  Using only concept
recall as the final score is not workable either, since systems would
not be rewarded for being selective: retrieving the entire document
collection would get a perfect score for every question.

Borrowing from the evaluation of summarization systems [see DUC website],
we can use length as a (crude) approximation to precision.  A length-based
measure captures the intuition that users would prefer the shorter of
two definitions that contain the same concepts.  The length-based measure
used in the pilot gives a system an allowance of 100 (non-white-space)
characters for each correct concept it retrieves.  The precision score
is set to one if the response is no longer than this allowance.
If the response is longer than the allowance, the precision
score is downgraded using the function 1 - [(length-allowance)/length].

Remember that the assessors marked some concepts as vital and
the remainder are not vital.  The non-vital concepts act as a
"don't care" condition.  That is, systems should be penalized
for not retrieving vital concepts, and penalized for retrieving
items that are not on the assessor's concept list at all, but
should be neither penalized nor rewarded for retrieving a non-vital
concept.  To implement the don't care condition, concept recall is
computed only over vital concepts, while the character allowance
in the precision computation is based on both vital and non-vital concepts.

The final score for a response was computed using the F-measure,
a function of both recall (R) and precision (P).
The general version of the F-measure is
  F = (beta^2+1)RP / (beta^2P + R)
where beta is a parameter signifying the relative importance
of recall and precision.  The main evaluation in the pilot used
a value of 5, indicating that recall is 5 times as important
as precision.  The value of 5 is arbitrary, but reflects both
the emphasis given to content in the first round of assessing and
acknowledges the crudeness of the precision approximation.

The following table shows the scores and system rankings using
F as defined above and beta=5.

    F scores by assessor for system responses, beta=5
	 author 	        other
	F  0.688              F  0.757
	A  0.606              A  0.687
	D  0.568              G  0.671
	G  0.562              D  0.669
	E  0.555              E  0.657
	B  0.467              B  0.522
	C  0.349              C  0.384
	H  0.330              H  0.365

As can be seen from the table, the rankings of systems are stable
across different assessors in that the only difference in the rankings
are for two runs whose scores are extremely similar (D and G).
While the absolute value of the scores is different when using
different assessors, the magnitude of the difference between scores
is generally preserved.  For example, there is a large gap between
the scores for systems F and A, and a much smaller gap for systems
C and H.  The rankings also obey the ordering constraints suggested by
the holistic evaluation.

The different systems in the pilot took different approaches to producing
their definitions.  System H always returned a single text snippet
as a definition.  System B returned a set of complete sentences.
System G tended to be relatively terse, while F and A were more verbose.
The average length of a response for each system is

        Average length of system response
      (number of non-white-space characters)
 		A: 1121.2
		B: 1236.5
		C:   84.7
		D:  281.8
		E:  533.9
		F:  935.6
		G:  164.5
		H:   33.7

The differences in the systems are reflected in their relative
scores when different beta values are used in the F score.
For example, the following table shows the scores and system rankings
when beta=2.

    F scores by assessor for system responses, beta=2
	 author			  other
	G  0.587		G  0.688
	F  0.584		F  0.661
	D  0.550		D  0.656
	A  0.516		A  0.609
	E  0.495		E  0.576
	C  0.371		C  0.406
	H  0.348		B  0.404
	B  0.339		H  0.383

Thus as expected, as precision gains in importance, system G rises
in the rankings, system B falls quickly, and system F also sinks.


3 Conclusion
-------------

The definition pilot demonstrated that relative F scores based on concept
recall and adjusted response length are stable when computed using
different human assessor judgments, and reflect intuitive judgments of
quality.  The main measure used in the pilot strongly emphasized recall,
but varying the F measure's beta parameter allows different user preferences
to be accommodated as expected.

Definition questions will be included as a part of the TREC 2003 QA track.
The F score based on concept recall and adjusted response length is the
proposed metric to be used in the TREC track for definition questions.