Design and Evaluation of the CLARIT-TREC-2 System David A. Evans1'2 and Robert G. Lefferts2 1Laboratory for Computational Linguistics Carnegie Mellon University Pittsburgh, Pennsylvania 15213-3890 1 Introduction The CLARIT team used the opportunity of the TREC- 2 evaluations to explore several facets of the CLARIT system. In particular, given the performance of the CLARIT system on TREC-1 tasks (Evans et aL. 1993), we focused our attention on evaluating 1. fully-automatic processing of topics and potentially- relevant documents and 2. topic/query augmentation using CLARIT thesaurus- discovery techniques. All of the results we report in this paper follow from straightforward applications of base-level CLARIT pro- cessing, utilizing essentially the same CLARIT com- ponents that were employed in the CLARIT-TREC- 1 system. The general improvements we observe in CLARIT-TREC-2 processing are attributable to modifi- cations (especially simplifications) in processing steps and in the settings of system variables. In the following sections, we describe the CLARIT- TREC-2 system, report our official processing results, and offer a brief analysis of performance. In addition, we report on several subsequent experiments we have conducted on the TREC-2 collection that test the pa- rameters of the CLARIT-TREC-2 system and identify sources of immediate improvements in processing. 2 CLARIT-TREC-2 System Description and Processing Method The CLARIT-TREC-2 system reflects a re-organization of the tools and techniques employed in the CLARIT- TREC-1 system. One of our principal goals was to streamline CLARIT processing and to establish a base- line method that is amenable to parameterization and analysis. As a consequence, the flow of data in the CLARIT-TREC-2 system is simple, straightforward, and efficient; furthermore, all CLARIT processing is fully automatic. 137 and 2CLARIT Corporation Suite 200A, 319 South Craig St. Pittsburgh, Pennsylvania 15213-3726 2.1 Changes from TREC-1 The essential differences between the CLARIT-TREC-1 and TREC-2 systems are in the preparation and evalua- tion of queries (TREC-2 "topics") and the automation of steps designed to identify and process potentially rel- evant documents for use in query augmentation. The following summaries highlight these points. * One-Pass Querying. The CLARIT-TREC-1 sys- tem employed a two-step process to retrieve documents-a first pass for partitioning ("evok- ing") and a second pass for final ranking ("dis- crimination"). This has been eliminated in the CLARIT-TREC-2 system. Querying takes place in one step over the entire collection using vector- space-retrieval methods. * Automatic Query Creation. The CLARIT-TREC- 1 system was categorized as a "manual" system, though the required manual intervention was min- imal. In particular, users were expected to assign an importance coefficient (with possible values "1", "2", or "3") to the CLARIT-parsed terms in a topic statement and possibly also to add terms to or delete terms from the CLARIT-generated list. In the CLARIT-TREC-2 system, the importance coef- ficient is assigned automatically by simple heuris- tics (described below). While users are still free to modify coefficients or terms, such intervention is not required. Ml "CLARTA" results reported in this paper reflect processing in which queries were fully automatically prepared by the CLARIT system, without review or modification. * Automatic Retrieval Refinement. When pro- cessing ad-hoc queries, the CLARIT-TREC-1 sys- tem required that the user evaluate a few of the top-ranked retrieved documents. User-nominated documents were processed to identify terms for use in supplementing the source query. In the CLARIT-TREC-2 system, user evaluations are not required. Initial querying is accurate enough to support the automatic processing of the highest- scoring retrieved documents without `inspection'. * Sub-Document Processing. The CLARif-TREC- 1 system treated all documents as whole texts; retrieval `scores' were calculated over full doc- uments. The CLAR[r-TREC-2 system treats all documents as collections of one or more sub- documents, operationalized as variable-sized units of approximately paragraph length. Such units are used as the basis for all statistical calculations and for measuring `similarity' to a query A full docu- ment is assigned the score (e.g., for ranicing) of the highest-scoring sub-document it contains. 2.2 Processing Method Figure 1 offers a schematic overview of processing in the CLARIT-TREC-2 system. All topics were parsed for noun phrases. These, in turn, were either manually ("CLARTM") or automatically ("CLARTA") assigned weights (values "1", "2", or "3") for `importance'. The terms for each topic were automatically supplemented with terms from a (pseudo-)thesaurus, automatically extracted from available known-relevant documents (in the case of routing topics) or from the top-ranked sub-documents returned in a first-pass querying of the TREC-2 collection (in the case of ad-hoc topics). All in- stances of retrieval took place over the applicable full set of documents, which had undergone an inltial round of CLARIT processing (parsing). The CLARIT-TREC-2 system incorporates a vector- space retrieval system that uses several CLARIT- specific techniques to improve retrieval results. The principal techniques involve the use of (1) natural- language processing to identify and normalize index- ing terms, (2) fully automatic query augmentation based on CLARIT thesaurus discovery, and (3) sim- ple text-analysis heuristics to approximate the effect of more sophisticated discourse analysis of texts. These techniques are described in greater detail in the follow- ing sections. 2.2.1 Natural-Language Processing CLARIT natural-language processing (NLP) encom- passes an inflectional morphological analyzer for word recognition and normalization and a determInistic rule- based parser for phrase identification. For ThEC-2 pro- cessing, only simplex noun phrases (NPs) were used. Simplex NPs are phrasal constitutents that include the modifiers and head noun(s) of an NP but not the post- head prepositional phrases, relative clauses, or verb constructions. The CLARIT parser can provide a more complex linguistic analysis of texts, but such additional detail was not used in TREC-2 experiments. Th\source Relevant ~................................. Topic1 Documents Possibly Identical + (Ad~Hoc Queries) Heuristics Retrieval Training ½½ parse Corpus Corpus Optional Manual Thessurus parse panse Correction Extraction Sansplel Query Vector Construction Vector-Space _________ I Median -Median Median -Median