Workshop on: Evaluation Issues Evaluation has two distinct but complementary purposes. The first purpose of evaluation is to allow researchers and system developers to understand how their system is working, hopefully to improve its performance. The second purpose is to allow groups outside a sys- tem to understand its strengths and weaknesses and to do cross-system comparison. TREC has a need for both types of evaluation and the workshop discussed the general issue of evaluation and specifically how to improve evaluation in TREC-2. The evaluation measures used in TREC-1 were standard recall and precision measures for each set of system results, plus a listing for each topic showing the best, worst, and median performance on that topic. These listings were helpful to TREC groups in finding which topics were easiest (or hardest) for their systems to handle. The standard recall and precision measures allowed some cross-system comparison, although any absolute ranking of systems is impossible because of the different levels of user and/or developer effort that must be part of a full evaluation. Several improvements were suggested for TREC-2 (all of which will be implemented). 1. More documents need to be submitted for use in evaluation. The artificially low cutoff of 200 documents meant that the recall/precision figures were accurate only to about 40% recall, as all documents retrieved after rank 200 were marked non-relevant for evaluation purposes. Whereas all evaluation results were equally hurt by this, a higher cutoff (500 or 1000 documents) would allow better performance measurement accuracy. 2. Better methods are needed to deal with systems working on a variable thresholding method, where a threshold is set for each topic and far fewer than 200 documents could be submitted for evaluation for some topics. This particularly applies to routing. 3. Clearer definitions of automatic, manual, and feedback methods for constructing queries are needed. Additionally several suggestions were made that are harder to implement (or possibly even to define), but would be helpful to researchers and system developers. 1. Provide two levels of relevance judgments, with the levels being "on topic" and "meets all criteria". The effects of the complex narrative in the topic could then be more easily separated from searching effects. However this might cause more subjective relevance assessment and create too many relevant documents for some topics. 2. Find some method of marking relevant paragraphs or sentences. This would be particu- larly useful for machine learning algorithms including relevance feedback. Finally workshop participants had several suggestions to increase the general "learning level" of the conference. 1. Strongly urge all participants to obtain the evaluation programs from Chris Buckley to allow more internal evaluation. 2. Strongly urge all participants to attempt to modularize their various techniques to separate the effects of each technique. For example, it was suggested that all routing results using any type of training also discuss their baseline performance without training. The internal use of the TREC evaluation program could be done for this. 371