Workshop on:
                                Evaluation Issues

   Evaluation has two distinct but complementary purposes. The first purpose of evaluation
is to allow researchers and system developers to understand how their system is working,
hopefully to improve its performance. The second purpose is to allow groups outside a sys-
tem to understand its strengths and weaknesses and to do cross-system comparison. TREC
has a need for both types of evaluation and the workshop discussed the general issue of
evaluation and specifically how to improve evaluation in TREC-2.
   The evaluation measures used in TREC-1 were standard recall and precision measures for
each set of system results, plus a listing for each topic showing the best, worst, and median
performance on that topic. These listings were helpful to TREC groups in finding which
topics were easiest (or hardest) for their systems to handle. The standard recall and precision
measures allowed some cross-system comparison, although any absolute ranking of systems is
impossible because of the different levels of user and/or developer effort that must be part of
a full evaluation.
   Several improvements were suggested for TREC-2 (all of which will be implemented).
 1.  More documents need to be submitted for use in evaluation. The artificially low cutoff
     of 200 documents meant that the recall/precision figures were accurate only to about 40%
     recall, as all documents retrieved after rank 200 were marked non-relevant for evaluation
     purposes. Whereas all evaluation results were equally hurt by this, a higher cutoff (500
     or 1000 documents) would allow better performance measurement accuracy.
 2.  Better methods are needed to deal with systems working on a variable thresholding
     method, where a threshold is set for each topic and far fewer than 200 documents could
     be submitted for evaluation for some topics. This particularly applies to routing.
 3.  Clearer definitions of automatic, manual, and feedback methods for constructing queries
     are needed.
Additionally several suggestions were made that are harder to implement (or possibly even to
define), but would be helpful to researchers and system developers.
 1. Provide two levels of relevance judgments, with the levels being "on topic" and "meets
     all criteria". The effects of the complex narrative in the topic could then be more easily
     separated from searching effects. However this might cause more subjective relevance
     assessment and create too many relevant documents for some topics.
 2.  Find some method of marking relevant paragraphs or sentences. This would be particu-
     larly useful for machine learning algorithms including relevance feedback.
Finally workshop participants had several suggestions to increase the general "learning level"
of the conference.
 1.

     Strongly urge all participants to obtain the evaluation programs from Chris Buckley to
     allow more internal evaluation.
 2.  Strongly urge all participants to attempt to modularize their various techniques to
     separate the effects of each technique. For example, it was suggested that all routing
     results using any type of training also discuss their baseline performance without training.
     The internal use of the TREC evaluation program could be done for this.


                                  371