This file summarizes the 2004 pilot for evaluation of relationship
questions, performed by NIST and AQUAINT program contractors.

The first part of the summary gives an overview of the pilot, and the
second part describes the format of the data files resulting from the
pilot.

The data files include:
    1. a set of 50 relationship topics and an answer to each one as
        compiled by their authors ("answers")
    2. a list of concepts that should be included in the evidence for
	the answers to the questions for each topic, as determined by
	an assessor ("nuggets")
    3. one or more document ids for each nugget of evidence
       ("autonuggetdocs" and "manualnuggetdocs")
	

1. AQUAINT Pilot for Relationship Questions
===========================================

The purpose of the relationship questions pilot was to examine the
issues involved in evaluating how well computer systems can locate
``evidence'' for certain kinds of relationships from a collection of
documents.

A group meeting with analysts in early 2004 resulted in the
understanding that a "relationship" is the ability of one object to
influence another.  Evidence for a relationship includes both the
means to influence something and the motivation for doing
so. Relationships can involve both entities and events.  Eight types
of relationships ("spheres of influence") were noted:
   * financial 
   * movement of goods 
   * family ties
   * communication pathways 
   * organizational ties 
   * co-location 
   * common interests 
   * temporal connection 

The particular relationships of interest depend on the analyst,
situation, and purpose.  A major concern is recognizing when evidence
for a suspected tie is lacking and determining whether the lack is
because the tie doesn't exist or because it is being hidden or
overlooked.  The analyst needs sufficient information to establish
confidence in any evidence given.


Pilot Task Description
----------------------

The AQUAINT relationship pilot used TREC-like "topic" statements to
set up a context for each question.  The topics were developed by 13
analysts from the military who searched the AQUAINT collection looking
for appropriate topics.  The AQUAINT collection covers the time period
of 1998--2000 and consists of news stories taken from the New York
Times, the AP newswire, and the English portion of the Xinhua newswire
(see LDC catalog number LDC2002T31).  The military analysts created
mini-scenarios that had relevant information contained in the
collection.  From these scenarios, an analyst from the NSA selected
and developed 50 topics that were used for the evaluation pilot.

Each topic statement set the context for a question that asked for
evidence for one of the types of relationships listed above.  The
question was either a yes/no question that was to be understood as a
request for evidence supporting the answer ("[w]ill the Japanese use
force to defend the Senkakus?"), or a request for the evidence itself
("What types of disputes or conflict between the PLA and Hong Kong
residents have been reported?).  Sometimes multiple subquestions were
embedded in a single topic ("Who were the participants in this spy
ring, and how are they related to each other?").

The participating systems were given 50 relationship topics and the
AQUAINT document collection.  Systems were to return one list of text
snippets per topic such that each item in the list was a piece of
evidence supporting the answer to the question in the topic.  There
were no limits placed on either the length of an individual snippet or
on the number of snippets in a list, though systems knew they would be
penalized for retrieving extraneous information.

The format of a response was the same as for the AQUAINT Definition
Pilot, namely a file containing lines of the form
     topic-number run-tag doc-id evidence-string
run-tag is a string that is used as a unique identifier for the run;
evidence-string is the piece of evidence derived (extracted,
concluded, etc.) from the given document.


Evaluation of System Responses
------------------------------

Evaluation of the nuggets returned by the systems was done as in the
AQUAINT Definition Pilot. (See the bottom of
http://trec.nist.gov/data/qa/add_qaresources.html for details.)

For each topic, an assessor first used the topic author's answer and
the responses from all the systems to create a list of "information
nuggets" representing evidence for the answer. An information nugget
was defined as a fact for which the assessor could make a binary
decision as to whether a response contained the nugget.  The assessor
then decided which nuggets were vital pieces of evidence and which
were merely "okay".  Finally, the assessor went through each of the
system responses and marked where each nugget appeared in the
response.  If a system returned a particular nugget more than once, it
was marked only once.

Precision and recall for a response were computed over the nuggets.
Recall was computed as the ratio of the number of correct vital
nuggets retrieved to the number of vital nuggets in the assessor's
list.  Precision was approximated using the length of the response.
The length-based measure gave an allowance of 100 (non-white-space)
characters for each vital or okay nugget retrieved.  The precision
score was set to one if the response was no longer than this
allowance.  If the response was longer than the allowance, the
precision score was downgraded using the function
1 - [(length-allowance)/length].

The final score for a response was computed using the F-measure, a
function of both recall (R) and precision (P).  The general version of
the F-measure is
  F = (beta^2+1)RP / (beta^2P + R)
where beta is a parameter signifying the relative importance of recall
and precision.  The evaluation in the pilot used a value of beta=3,
indicating that recall is 3 times as important as precision.


System Results
--------------

Four groups participated in the pilot, submitting a total of six runs
(labeled A-F).  The following table shows the average response length,
F-score (beta=3), recall, and precision for each run:

    Run		Avg. length	F(b=3)	Recall		Precision
    ---		-----------	------	------		---------
    Run-A	 527.48		0.429	0.4595		0.42616
    Run-B	 929.78		0.393	0.45918		0.3037
    Run-C	3984.74		0.391	0.66036		0.10722
    Run-D	 850.12		0.302	0.35482		0.23086
    Run-E	 689.16		0.298	0.34448		0.26058
    Run-F	 855		0.292	0.34418		0.22878

Because the scoring metric favored recall over precision, systems were
able to improve their overall rankings by returning long responses.
In particular, Run-C returned responses that were four times longer
than the other runs, and managed to rank third overall despite having
the lowest precision of all the runs.  However, not all the runs
relied on long responses to achieve a high score. The highest-scoring
run (Run-A) also returned the shortest responses.  Between the six
runs, the systems were able to find over 85% of the 151 vital nuggets
in the assessors' list.


Conclusion
----------

Participants in this relationship pilot generally expressed
satisfaction with the exercise, especially with the additional context
provided by the topics.  However, there were several requests for
document ids for all the nuggets.  NIST reverse-engineered this list
after the evaluation, but it would ideally by created by the assessors
at the time they created the nuggets list.  Requiring assessors to
include a document id for each nugget would improve the quality of the
nuggets; sometimes the glosses of the nuggets were so vague (e.g.,
"NBC production") that it would be difficult for anyone other than the
author to understand the intended meaning of the "evidence" without
reference to a source document.


2. Data Files
=============


"answers"
---------

The "answers" file is a compilation of notes written by the analysts
who created the topics.  For each topic there is the topic id, the
topic itself, the answer to the question, the document ids of
documents containing evidence for the answer, and a gloss of the
evidence.


"nuggets"
---------

The nuggets that were used for assessing the runs are in the "nuggets"
file.  The format of each line is:
	  topic-number nugnum vital|okay evidence-string 
where topic-number is the topic number, nugnum is the nugget number,
vital|okay indicates whether the nugget of evidence is "vital" or
"okay", and evidence-string is the gloss of the evidence.


"autonuggetdocs" and "manualnuggetdocs
--------------------------------------

These two files contain document ids for the nuggets in the "nuggets"
file and were created after the assessment files were returned to the
pilot participants.  The document ids either come from the systems'
responses or were generated manually based on inspection of the
"nuggets" and "answers" files.

Each line in the "autonuggetdocs" and "manualnuggetdocs" files is in
the format:
    topic_id  nugget_id  vital|okay  doc_id  string
where each (topic_id, nugget_id) pair is labeled as either "vital" or
"okay", based on the "nuggets" file.

"autonuggetdocs": contains system responses for each nugget for which
some system returned a matching response. doc_id and string are the
document id and evidence string returned by the system that match this
nugget_id for this topic_id. Multiple (doc_id, string) pairs may be
included, one per line, for each topic_id and nugget_id pair, but
duplicates are removed.

"manualnuggetdocs": If a nugget was not matched by any system's
response, a document id was found manually for the nugget using
queries to the document collection, based on the "nuggets" and
"answers" files; string is the gloss of the evidence, as given in the
"nuggets" file. Only one doc_id is provided for each (topic_id,
nugget_id) pair.  If no document could be found for the nugget or if
the intended meaning of the nugget string was not inferrable, then
"UNDOCUMENTED" appears in the doc_id field.  (N.B. Nugget 8 of topic 4
seems to actually belong to topic 5.)