CLARIT TREC Design, Experiments, and Results David A. Evans, Robert G. Lefferts, Gregory Grefenstette, Steven K. Handerson, William R. Hersh, Armar A. Archbold Laboratory for Computational Linguistics Depart ment of Philosophy Carnegie Mellon University 1 Introduction This report presents an abbreviated description1 of the approach and the results of~the CLARIT team in completing the tasks of the "Text Retrieval Conference" (TREC) organized by the National Institute of Standards and Technology (NIST) and the Defense Advanced Research Projects Agency (DARPA) in 1992.2 1.1 A Characterization of the TREC Tasks TREC activities required participants to `retrieve' 200 documents for each of 100 different `top- ics' from a large database of full-text documents. Each topic was given as a one-page description of an item of interest. This feature of the TREC tasks was somewhat unusual, at least com- pared to many traditional `bibliographic'~retrieval evaluations, in which the topic or `query `is a minimal, often telegraphic, single-phrase statement of a `subject' or `an interest'.3 However, the principal distinguishing features of the TREC tasks were (1) their scal~involving a total of approximately 2 gigabytes of text, representing approximately 750,000 full-text documents of varying length-and (2) the careful attention of the organizers in evaluating the results submitted by each participating group. More specifically, TREC tasks were designed to simulate two general types of information `retrieval' situations, "routing" and "ad-hoc" querying. "Routing" corresponds to situations in which a topic is possibly well documented (e.g., with examples) and the user desires to find more similar documents. In the case of TREC tasks, 50 topics were designated as "routing" topics; each was accompanied by a set of documents judged to be "relevant" to the topic.4 The first installment of the full set of documents, representing approximately 1.1-gigabytes of text, was available to each team for use in identifying possible other relevant documents for each 1A more complete and detailed description of the CLARIT-TREC activities and results is available as a technical report, [Evans et al. in preparation]. 2The TREC activities were organized at the end of 1991. Data was made available in the Spring of 1992. All processing results were submitted by September 1, 1992, to NIST. The "Conference" itself-a Workshop involving the approximately two dozen groups that submitted partial or full processing results-took place on November ~6, 1992, in Rockville, MD. 3The longer statements of `topics' in TREC were arguably more interesting as a test of systems and more representative of many contemporary information-seeking situations. See Figure 9 for a sample topic statement. 4The number of sample relevant documents varied greatly from topic to topic. Some topics had almost 100 sample relevants; other had only about ten. 251 routing topic. The rules of the exercise required each group to submit `models' for each routing topic (e.g., a set of procedures or a `query vector'), which then were `on record' and had to be used in the final evaluation of the routing task. That evaluation required that each group retrieve documents from the second installment of documents, approximately 0.9-gigabytes of text. "Ad-hoc" querying corresponds to situations in which a topic is presented to a system and appropriate documents must be found; no example documents are available. In TREC, the second 50 topics were designed as "ad-hoc-query" topics. The rules required each group to use the full 2-gigabyte database as the search space for ad-hoc queries. All results were reported as a ranked list of the 200 `top' documents in response to each topic, whether a routing topic or an ad-hoc-query topic. 1.2 Notes on CLARIT Team Participation The CLARIT team submitted results, labeled "A" and "B" `5 representing the top 200 docu- ments at the end of each of two sequential steps in the processing of topics. Since the actual processing of topics was designed to give `best' results only after both stages of processing were completed, the "A" results are known to be suboptimal; the "B" results represent the true test of the CLARIT-TREC design. The large scale of the tasks challenged the resources that were available to the CLARIT team. Storage for the source data and topics alone required 2 gigabytes of space. The research- prototype version of the CLARIT system, which was used in the task, generates various sec- ondary and intermediate resources in the course of processing. Such intermediate files also require temporary storage. In all, approximately 8 gigabytes of disk space was used for the process. The system-engineering work required to manage the data represented a significant ef- fort for the team; more than 75% of the team effort was devoted to (a) re-implementing critical CLARIT processes to deal with larger volumes of data and limited space and (b) monitoring and directing the use of resources and the sequence of processes when making actual `runs' over the data. Data for the final tests was made available from NIST only after preliminary processing results were submitted. The CLARIT team submitted its preliminary results (the `frozen' forms of the routing queries) on Friday, August 21, 1992. NIST express-mailed the new test data to Carnegie Mellon on the same day, but the package was misaddressed and did not arrive. A second mailing finally did arrive on Tuesday, August 25, one week before the deadline for final results. Thus, all final processing took place in seven days. The CLARIT team utilized, variously, six machines (including a DECsystem 5820 and DEC- station 5000s and 3100s) and the approximately 8 gigabytes of dedicated storage for TREC- processing tasks. Actual processing occurred in batch mode over several machines and across a network (as some storage was remote). 2 Background Description of Basic CLARIT Processing in TREC Basic CLARIT processing is described elsewhere.6 A schematic representation of the `standard' CLARIT process for document indexing is given in Fignre 1. A representation of the simplified CLARIT process that was employed in the case of CLARIT-TREC document indexing is given in Figure 2. 5The Conference provided a special category ("Category B") for groups that intended to work only with a subset (100 megabytes) of the TREC data. This should not be confused with what we call the CLARIT "A" and "B" results: all CLARIT processing involved the full set of TREC data. 6Cf. [Evans 1990], [Evans et al. 1991a,b,c]. 252 Formating Text Prep T NLP Morph I Parsing I Candidate NPs Term/Doc Statistics ~NP Scoring - ~ Matching Thesaurus Scoring - Filtering Lexicon `Core' Lex (100,000 items) Optional Su~Domain Lex `Heuristic' Grammar `Simplex' NP Optional "Complex" NP Optional "Full Sentence" Constituents General Set of Terms "Exact" "Novel" "General" Specific Set of Terms "1s~Order" Thesaurus: Flat List of Terms, Implicit Compositional, Hierarchical Structure Figure 1: `St~d&d' CLARIT Indexing Overview Formating [~w Text Prep NLP Scoring I Morph I Parsing I Candidate NPs Lexicon `Core' Lex (100,000 items) `Heuristic' Grammar `Simplex' NP Term/Doc_________ I { NPs & Words{ General Set of Terms Statistics Figure 2: Modified CLARIT Indexing in TREC 253 2.1 Selective NLP to Nominate Information Units In brief, the CLARIT indexing process as shown in Figure 1 involves several steps, one of which utilizes selective natural-language processing (NLP) to identify noun phrases (NPs) in texts, which are taken as the relevant information units in all further processing. Subsequent steps take advantage of several statistical measures of `importance' to evaluate NPs as potential index terms. One special feature of CLARIT processing is the use of an automatically-generated `first-order' thesaurus for a domain to support the selection of appropriate terms. The standard CLARIT process returns three categories of index terms, corresponding to terms that (1) occur in the document and exactly match terms in the thesaurus, (2) terms that are in the thesaurus and are more general than near-matching terms in the document, and (3) terms that are `novel' to the document and not found in the thesaurus. In addition to being categorized as an exact, general, or novel index term, each term is given a numerical relevance weight deemed to reflect its relative value in characterizing the contents of the document. 2.2 `Thesaurus Discovery' to Nominate Sets of Terms for Collections First-order thesauri are `discovered' via another CLARIT process, distinct from indexing. The process requires a sample of documents representing a `domain'. The sample must be moder- ately large (e.g., minimally 2 megabytes of text) and must be composed of documents that are more or less `about' the topic of the domain.7 In general, CLARIT `thesaurus discovery' comprises algorithms and techniques for cluster- ing phrases in collections of documents to construct first-order thesauri that optimally `cover' an arbitrary percentage of all the terminology in the domain represented by the document col- lection. `Normal' thesaurus discovery involves (1) decomposition of candidate NPs from the documents to build a term lattice in which nodes are organized hierarchically from words to phrases based on the number of phrases subsumed by the term associated with each node and (2) selection of nodes that have high subsumption scores and that also satisfy certain structural and statistical characteristics (such as being legitimate NPs, well distributed in the corpus, and relatively uncommon in general English). Terms thus selected represent a subset of vocabulary that accurately characterizes the domain. Thesaurus discovery is quite fast8 and typically yields a subset of terminology that represents less than 5% of all the available terms in the corpus.9 Since the TREC experiments involved a heterogeneous collection of documents and since it was not possible to identify specific subsets of documents in the database as `about' one or another topic, it was not possible to discover and use relevant thesauri in TREC tasks. Thus, as shown in Figure 2, the simpllfied CLARIT indexing process in TREC tasks did not involve matching' of terms against a first-order thesaurus and did not result in three-way-categorized index terms. ~An example of an appropriate sample might be 50 full-text articles involving "AIDS Research"; or 2,000 abstracts about "Silicon Engraving"; or even one's personal file of recent e-mail correspondence, provided it is sufficiently large and topically coherent. 8At present, using the CLARIT research system, a thesaurus can be found for a 3-megabyte corpus in less than 10 minutes on a DECstation 5000/200. 91n fact, the number of terms returned will vary depending on parameters the user selects when generating the thesaurus. 254 2.3 Vector-Space `Similarity' Measures The principal method used by the CLARIT system in comparing `information objects' (e.g., in retrieval, in routing) is vector-space distance.'0 The basic metric is that of `similarity' of terms. `Similarity' is determined by different procedures in different contexts. Partial or `fuzzy' matching of terms is facilitated by noting whether terms share words or attested subphrases. For example, in vector-space modeling of documents, the contained words of all terms (in the document vector as well as the query vector) are broken out, giving, in effect, the possibility of matching parts of terms, though, technically, the individual words are realized as independent dimensions of the term space.11 2.4 Notes on the Limited Version of CLARIT Processing in TREC Because of the time and space limitations in the task, the CLARIT team did not utilize several features of CLARIT processing that normally produce enhanced results. One of the features- the automatic `tokenization' or identification of proper names-would certainly have assisted processing of some topics. Another featur~the identification of equivalence classes of terms- also would have aided the task. In addition, no attempt was made to establish `uniform-length' documents or sub-documents (e.g., by setting a maximum word count or sentence length for such units). Though CLARIT processing supports the treatment of documents as sub-document collections, that feature of CLARIT processing was not utilized in the experiments. All topic statements were treated uniformly and simply: no attempt was made to handle implicit or explicit quantification, time intervals, satisfaction conditions, etc., except as literally encoded in the topics. Though CLARIT NLP modules can produce full sentence analyses or complex-NP analyses, neither of these features was utilized in TREC processing. All documents were processed only for simplex NPs; inevitably, some non-NP information was lost. In indexing TREC documents, term weights were based on a general IDF-TF score12 for topic `domains'. In the case of multi-word terms (the norm), the full terms are assigned an inde- pendent IDF-TF score, and each word in the term was broken out and assigned an independent IDF-TF score. While all CLARIT processing is designed to be fully automatic, we did not employ fully automatic processing in TREC tasks. In particular, there were two steps in the CLARIT- TREC process that required non-automatic processing: (1) initial review and weighting of the index terms automatically-nominated and derived from each topic statement and (2) review of first-pass retrieved documents to identify 5-10 relevant ones for `feedback'. The two steps involved minimal user intervention (and, in fact, required very little time and effort); however, they do qualify the CLARIT-TREC system as a manual process In general, we regard the CLARIT-TREC system as a minimal system for purposes of evaluation. The results of CLARIT-TREC processing are useful in helping us establish baseline performance for core but abbreviated CLARIT functions. 100f. [Salton & McGill 1983] for background on vector-space modeling in information retrieval applications. 11 Cf. [Evans et al. 1992) and [Hersh et al. 1992] for an evaluation of CLARIT vector-space `similarity' measures. 12"IDF-TF" represents the standard inver8e document frequenc~ x intradocument term frequenc~ score for terms. 255 3 Overview of CLARIT-TREC Processing There were three major phases of processing for the CLARIT-TREC retrieval experiments. Initially, the entire corpus, along with the topic statements, was parsed to extract candidate NPs via CLARIT NLP. In the special case of topics, the candidate NPs were manually reviewed and evaluated to produce weighted query terms. Second, the entire corpus (in noun phrase form) was passed through a quick, and somewhat rough, ranking procedure that was designed to nominate a large subset of documents for further analysis. This step is referred to as "partitioning". A "partitioning thesaurus", or list of weighted, representative terminology was automatically created for each topic. In the final phase of processing, referred to as "querying", a "query vector" was produced for each topic. The query vector was used to retrieve (= rank) documents in the selected partition for the topic using a vector-space `similarity' metric. The details of these phases of processing are presented below, along with a discussion of different techniques used for "routing" and "ad-hoc" queries. 3.1 Design Philosophy-"Evoke" and "Discriminate" In approaching the principal TREC task of returning 200 ranked documents for each topic, we used a two-stage processing strategy, illustrated in Figure 3. The first stage of processing was designed to identify candidate documents that seemed likely to contain information related to a topic. Of course, since the topic was represented as a set of weighted terms, this step involved scoring each document based on the set of terms. Because this step involved scoring every document in the database against every topic, it was important to design the scoring procedure so that it was not computationally expensive. In fact, it was based on summing the value and number of `hits' between the topic's set of terms and the terms (NPs) in each document and was expected to result in an over-generated set of candidate documents. The highest-scoring documents were retained as a candidate `partition' of the database with respect to the topic. The second stage was designed to find the subset of documents in each partition that best matched the topic. In theory, greater (= more discriminating) processing resources could be devoted to this second-stage task, as the total number of documents involved was small compared to the whole collection. In practice, as illustrated in Figure 3, partitioning resulted in a set of 2,000 ranked docu- ments. The top 200 documents from the partition were submitted to NIST as the CLARIT "A" set of results. Final querying or `discrimination' among the documents in each partition yielded another, more accurately ranked set of 200 ranked documents, which were submitted as the CLARIT "B" results. 3.2 Overview of the Task As Figure 4 shows, different portions of the total TREC database were used for the "routing" and "ad-hoc" phases of the experiment. The routing task required `training' of the first fifty topics on the first set of data (represented as the darkened block in Figure 4). In the second step of processing, the partitioning and query vectors that were derived from step one were used to identify, first, 2000-document partitions in the second set of data (represented as a light block in Fignre 4) and, second, the top-200 ranked documents in each partition. The ad-hoc query task involved the whole database, but the CLARIT team actually used the first set of data for a preliminary retrieval of documents (based on partitioning). A few (5-10) of the top 2~50 were chosen by quick manual inspection to supplement the query vector and then a second automated round of partitioning over the total database was performed. The final top-200 ranked documents ultimately derived from these second-pass, 2000-document partitions. 256 Partitioning 2000-Doc Top 200 "Feature Scoring" Discrimination Do O Ranking via ~ctor-Space "Similarity" CLARIT "A" CLARiT "B" Figure 3: Overview of CLARIT-TREC Processing 200 of 2000 5-lOof5O 2000 00 of 2000 Figure 4: Overview of Processing for "Routing" vs "Ad-Hoc" Queries 257 WS3891102-0187 McDermott International Inc. said its Babcock & Wilcox unit completed the sale of its Bailey Controls Operations to Finmeccanica S.p.A. for $295 million. Finmeccanica is an Italian state-owned holding company with interests in the mechanical engineering industry. Bailey Controls, based in Wickliffe, Ohio, makes computerized industrial controls systeRs. It employs 2,700 people and has annual revenue of about $370 million. Figure 5: Sample of Data-Document After Text Formating 4 Details of the CLARIT-TREC Experiments Both "routing" and "ad-hoc" query experiments took advantage of basic CLARIT processing. There are several features the two experiments share. The experiments are distinct in that routing" involved a special step of creation of a partitioning thesaurus using larger sets of supplied relevant documents and "ad-hoc" queries involved partitioning the document set once using only automatically derived (but manually weighted) query terms and choosing a small set of relevant documents to expand the final query vector. 4.1 Preparing Data Each TREC document had to be formated for CLARIT processing. This involved making the unique text ID accessible to CLARIT as a special field and delimiting the beginning and end of each text in a file. Figure 5 gives a sample formated document. As can be seen in the sample, the beginning and end of the record is marked by a backslash followed by "*". The unique ID is set off by a backslash followed by "#". The beginning and end of the text of the document is marked by a backslash followed by "!". Each paragraph is separated from the next by a backslash followed by "C".13 4.2 Processing TREC Corpora (NLP) Figure 6 gives a schematic representation of the processing steps that occurred subsequent to data formating. The process labeled "NLP" in the figure includes all the steps illustrated in the "NLP" portion of Figure 2: morphological analysis of words and parsing for simplex NPs. Simplex NPs were extracted for all TREC documents; words were morphologically normalized.14 13Though CLARIT data preparation demarks paragraph units, the CLARIT-TREC process did not distinguish divisions of text at this level. For CLARIT-TREC purposes, all the text between the "!"-marks was used as the source of information about a document. Thus, longer and shorter documents were treated uniformly as `unit' texts. 14 The manually-supplied keywords attached to some TREC documents in a "keyword field" were discarded. 258 Step: Input: Process: Output: 1 Document(s) NLP TermsD0~ "Parsed-Doc" 2 Topic(s) NLP TermsT0~ 3 Hand Filter: TermsT0~ 1. "Eliminate" Weighted-TermsT0~ 2. "Weight"-3/2/1 "Source- Query" Figure 6: Schematic R~presentation of Data Preparation WS3891102-0187 "ucdersott" na Rcdermott "international" adj international "inc." ukw? inc. "sajd" vt-past say vt-pastprt say "its" gen its "babcock" na babcock "\&" *and* and "wilcox" na wilcox "unit" Sn unit "coRpleted" vt-past couplete vt-pastprt couplete "the" d the "sale" sn sale sn sell "of" prep of "its" gen its "bailey" sn bailey "controls" vt-pressg3 control pn control vt-pressg3 control "operations" pn operation "of" prep of "about" prep about "$370" ukw? $370 "uillion" quant uillion "\." *period* \. Figure 7: Sample of Data-Document After Morphological Analysis 259 #1 WS3891102-0187 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -i -1 -1 -1 -1 -1 -1 (mcdermott) 0 0 (international inc.) 0 0 (babcock) 0 0 (wilcox unit) 0 0 (sale) 0 0 (bailey control operation) 0 0 (iinmeccanica s.p.a.) 0 0 ($295) 0 0 (finmeccanica) 0 0 (italian state) 0 0 (owned holding company) (interest) 0 0 (mechanical engineering 0 0 industry) 0 0 (bailey control) 0 0 (vickliffe) 0 0 (ohio) 0 0 (computerized industrial control (employ) 0 0 (people) 0 0 (annual revenue) 0 0 ($370) 0 0 system) 0 0 Figure 8: Sample of Data-Document After NP Extraction 260 A sample of a document after morphological analysis is given in Figure 7. A sample of the same document after simplex-NP extraction is given in Figure 8. Note that "owned holding company" and "$295" or "$370" are treated as NPs along with legitimate phrases like "comput- erized industrial control system". While CLARIT does have facilities to discover and eliminate inappropriate participles (such as "owned" in isolation) and can recognize nonce adjectives, such as "state-owned", such processing was not employed in the TREC tasks. Hence, the cor- rect expression, "Italian state-owned holding company" was not found or used in this case. In addition, as noted previously, the CLARIT-TREC system did not `tokenize' company names or dates or other `regular-expression'-like phrases; there was no time in our schedule for such processing. All NLP (and other) processing steps were piped through the system; intermediate files were not retained. The parsed representation of all the texts took up approximately 98% of the space occupied by the original text. Intermediate (but unretained) files generated in CLARIT processing included a file of the words in each text, in their original order, annotated with morphological categories. Other files contained the output of the parser as a list of NPs in the order in which they occurred in each text. The parsed representation of the text was retained and used at all subsequent steps of processing. Indeed, hereafter, unless otherwise specified, any reference to a document or collection of documents refers to the CLARIT representation of the text, viz., a sequence of normalized simplex NPs.15 4.3 Identifying Terms from Topics All fields of topic statements, such as given in Figure 9, were similarly processed for NPs. Team members reviewed the NPs and assigned weights of "1", "2", or "3" to each NP according to whether the term was central or peripheral to the topic. (Some extracted NPs were discarded as irrelevant or ill-formed; the vast majority were retained.) A sample set of weighted terms for the topic in Figure 9 is given in Figure 10. The manual review and weighting of terms from the topic statement took less than 5 minutes per topic. All subsequent processing of the query was performed automatically. 4.4 Establishing Sets of `Relevant' Documents Given the need to `evoke' candidate documents and to `partition' the database into subsets that were easier to manage, we were naturally interested in identifying features in the topics that would be useful as discriminators. We had little confidence, however, that the specific terms in topics, which constitute the "source query", were either most respresentative of the domain of the topic (= the `satisfaction class') or reasonably comprehensive. We thus decided to supplement the source query with additional terms. In particular, we used the CLARIT thesaurus-discovery technique on known relevant doc- uments to identify terminology that might be better representative of the satisfaction-class documents than the source query alone. The process produced a list of terms from the avail- able topic-relevant documents (or from a small sample of relevent documents that we may have found) and automatically nominated the top (approximately 20%) ranked terms to supplement the original query (as derived from the topic statement) to produce a "routing/partitioning thesaurus" for the topic. Since the routing topics already had accompanying relevant documents, we used these as a source of additional terminology. Ad-hoc queries, on the other hand, had no associated relevant documents, so we designed a preliminary, partial `retrieval' step that would help us 15From the point of view of the CLARIT system, the information in a document is entirely represented by the extracted noun phrases. 261 (top> (head> Tipster Topic Description (num> Number: 057 (dom> Domain: U.S. EconoRics (title> Topic: NCI (desc> Description: Document will discuss how NCI has been doing since the Bell System breakup. (narr> Narrative: A relevant document will discuss the financial health of NCI Communications Corp. since the breakup of the Bell System (AT&T and the seven regional Baby Bells) in January 1984. The status indicated may not necessarily be a direct or indirect result of the breakup of the systeR and ensuing regulation and deregulation of Na Bell or of the restrictions placed upon the seven Bells; it Ray result from any number of factors, such as advances in telecommunications technology, NCI initiative, etc. NCI's financial health may be reported directly: a broad statement about its earnings or cash flow, or a report containing financial data such as a quarterly report; or it. Ray be reflected by one or more of the following: credit ratings, share of custoRers, volume growth, cuts in capital spending, $$ figure net loss, pre-tax charge, analysts' or NCI's own forecast about how well they will be doing, or NCI's response to price cuts that AT&T makes at its own initiative or under orders from the Federal Communications Comission (FCC), such as price reductions, layoffs of employees out of a perceived need to cut costs, etc. Daily OTC trading stock Rarket and monthly short interest reports are NOT relevant; the inventory must be longer term, at least quarterly. (con> Concept(s): 1. NCI Comunicat ions Corp. 2. Bell System breakup 3. Federal Communications Comission, FCC 4. regulation, deregulation 5. profits, revenue, net income, net loss, write-downs 6. NOT daily OTC trading, NOT monthly short interest (f ac> Factor(s): Time: after January 1984 (Ifac> (def> Definition(s): (Itop> Figure 9: Sample of Data-Topic 57 262 057 2 (bell system breakup) 0 0 2 (capital spending) 0 0 2 (cash flow) 0 0 2 (credit rating) 0 0 2 (custoRer) 0 0 1 3 3 1 2 2 2 1 2 2 2 2 2 1 1 (telecommunication technology) 0 0 (Ra bell) 0 0 (mci communication corporation) 0 0 (mci financial health) 0 0 (mci initiative) 0 0 (mci) 0 0 (net income) 0 0 (net loss) 0 0 (order) 0 0 (pre tax charge) 0 0 (price cut) 0 0 (price reduction) 0 0 (profit) 0 0 (quarterly report) 0 0 (regional baby bell) 0 0 Figure 10: Sample of Data-Hand-Weighted Term-Set for Topic 57 263 Step: Input: Process: Output: 4a Parsed-Doc JFeatureScoring~ Scored-DocT0~ t I Weighted-TerrnsT0~ I 4b Scored-DocT0~ Ranking Top-2000 Scored-Doc(s)T0P 50-Top/2000 4c Scored-Doc(s)T0P Hand Filter I = Review Top_I)oc~ 5-10 Rel-DOC(s)T0P "Relevance- Feedback" Step in Ad-Hoc Cases Figure 11: Schematic Representation of Processin~ When `Relevant' Documents are not C'Tiv~n find candidate relevant documents. In practice, this required a partitioning of a sample of data and a review of the returned top-ranked documents. This phase of processing is illustrated in Figure 11. As shown in Figure 11, Step 4a, the weighted, relevant terms were taken as a query vector representing a subset of positive instances of concepts in the equivalence class of the topic. In the case of ad-hoc querying, the query vector was used to identify a sample of 50 candidate documents from a subset of the corpus, which were reviewed in rank order by team members until 5-10 `true' relevant documents were identified (Step 4c). This can be regarded as a `relevance-feedback' step in the querying process. In the case of routing, the sample of `true' relevants provided by the TREC organizers was accepted as valid and no review was performed. 4.5 Using Relevant Documents to Create `Partitioning Thesauri' As indicated in Figure 11 Step 4d, the `authoritative' set of relevant documents was processed with CLARIT `thesaurus~discovery' modules to produce a set of terms that (arguably) bear some relation to the topic. We refer to the output of this process as a "pseudo-thesaurus". The actual routing/partitioning thesaurus was generated by CLARIT by combining the set of weighted terms for the topic with the pseudo-thesaurus, as shown in Step 5. Note that partial noun phrases, derived from pseudo-thesaurus entries, and attested in the documents, were also added to the routing/partitioning thesaurus with a partial score. As illustrated in Figures 13 (and Figure 14), the partitioning thesaurus itself is a list of terms, where each term has an associated vector of information specifying its importance in any number of topics. In the case illustrated for Topic 57, for example, the term "bell system breakup" has the triple "<057 1 2.0>" associated with it. The "057" indicates that the term is relevant to Topic 57; the "1" indicates that the term is a full term (not an attested sub-phrase of a term); and the "2.0" gives the term's relative weight or importance (in this case, reflecting the score that was assigned by hand). 4.6 `Feature Scoring' to Partition Documents Figure 14 gives a portion of the composite or `super thesaurus' for all 100 topics. Each 264 Step: Input: Process: Output: 4d Rel-DocsT0~ IThesaurusi Pseudo-ThesT.~ IDiscoveryl Weighted-TermsT0~ 5 Pseudo-ThesT0~ D]Mer~e Part-ThesT0~ 6 Parsed-Doc Scored-DocT0~ Feature Scoring t IPartThesTopI 7 Scored-DocT0~ Ranking Top-2000 Scored-Doc(s)T0p Figure 12: Schematic Pepresentation of Processing When `Pelevant' Documents are Available advance I (057 1 1.0> at&t I <057 1 1.0> bell system breakup I (057 1 2.0> bell system I <057 1 1.0> bell I <057 1 1.0> breakup I <057 1 1.0> broad statement I <057 1 1.0> capital spending I <057 1 2.0> cash ~lov I <057 1 2.0> credit rating I <057 1 2.0> customer I <057 1 2.0> cut cost I <057 1 1.0> cut I <057 1 1.0> deregulation I <057 1 1.0> direct indirect result I <057 1 1.0> Ra bell I <057 1 1.0> Rci comunication corporation I <057 1 3.0> mci ~inancial health I <057 1 3.0> mci initiative I <057 1 1.0> mci I <057 1 2.0> net income I <057 1 2.0> net loss I <057 1 2.0> order I <057 1 1.0> telecommunication technology I <057 1 1.0> united states economics I <057 1 1.0> volume growth I <057 1 2.0> Figure 13: Sample of Data-i ,201-Term Partitioning Thesaurus for Topic 57 265 advance I <057 1 1.0> <065 1 0.30> <075 1 0.30> <076 1 0.30> american telephone I <057 1 0.30> analyst I <054 1 0.30> <055 1 0.50> <057 1 0.50> <074 1 0.50> <080 1 0.50> <082 1 0.50> <088 1 0.50> announcement I <057 1 0.30> at&t cut I <057 1 0.50> at&t price I <057 1 0.50> at&t I <057 1 1.0> bell system breakup I <057 1 2.0> bell system I <057 1 1.0> bell I <057 1 1.0> benefit I <057 1 0.30> <060 1 0.30> <073 1 0.50> <074 1 0.50> <075 1 0.50> <088 1 0.30> <099 1 0.30> breakup I <057 1 1.0> broad statement I <057 1 1.0> business customer I <057 1 0.50> capital spending I <057 1 2.0> jack grubman I <057 1 0.50> late price cut I <057 1 0.30> layoff I <057 1 1.0> least quarterly I <057 1 2.0> local phone company I <057 1 0.50> local telephone company I <057 1 0.30> long distance carrier I <057 1 0.30> long distance telephone rate I <057 1 0.30> ma bell I <057 1 1.0> margin I <057 1 0.30> market share I <057 1 0.30> <089 1 0.30> mci comunication corporation I <057 1 3.0> mci communication I <057 1 0.50> mci earning I <057 1 0.50> mci executiye I <057 1 0.50> mci financial health I <057 1 3.0> mci initiative I <057 1 1.0> mci move I <057 1 0.50> mci official I <057 1 0.50> mci price I <057 1 0.50> mci spokesman I <057 1 0.50> mci I <057 1 2.0> result I <057 1 1.0> <070 1 1.0> <081 1 2.0> revenue I <057 1 1.0> <081 1 2.0> <053 1 0.30> <054 1 0.30> <093 1 0.30> rising cost I <057 1 0.30> telecommunication technology I <057 1 1.0> telegraph company I <057 1 0.30> telegraph I <057 1 0.30> united states economics I <057 1 1.0> <072 1 1.0> united telecommunication inc. I <057 1 0.50> united telecommunication I <057 1 0.50> Washington based mci I <057 1 0.50> Washington based telecommunication concern I <057 1 0.50> villiam e conway jr. I <057 1 0.50> Figure 14: Sample of Data-15,287-Term `Super' P&titioning Thesaurus 266 * Data from Partition:ng Theaaurua: Thes~Weight(term): Thes~Whole(term): * Data from Document Test: Real number weight assigned to term in par- titioning thesaurus Boolean value indicating that term is a whole term in the thesaurus (1) or an attested su~ term of a whole term (0) Tot~Terms: Number of terms in a document Num~Terms: Term~eq(term): Term~ength(term): Text~Who1e(term): Number of unique terms (or sub-terms) found in document that match terms (or sub-terms) in the partitioning thesaurus Frequency of term in document Number of words in term Boolean value indicating that term is a whole term in the text (1) or an attested sub-term of a whole term (0). Figure 15: Feature Matching Score (Partitioning) Input Data Doc£core = Num~Terms ~ Term~core(termi) ln(Tot~Terms +1.72) [ Num Term8 2 ~ II'erm~Freq(term~)~ 1=1 I x i)j (Tot~Terms + Term£core(term) = I if (Thes~Whole(term) = 1) Raw~core(term) A (Text~Whole(term) = 1) Raw~core(term) if (Thes~Whole(term) = 1) 4.0 A (Text~Whole(term) = 0) R&w ~core(term) if (Thes~Whole(term) = 0) 8.0 A (Text~Whole(term) = 1) if (Thes~Whole(term) = 0) 0.0 A (Text~Whole(term) = 0) RAW£core(term) - 2[~~n(Term~ength(term)3)-1I )( Thes~Weight(term) x Term~Freq(term) Figure 16: Formula for Scoring Documents in `Partitioning' Doc- Hit- Phrasal- Length Opportunity Term Status Factor Factor Factor Factor [~n(to1tGL)] X [(~th0i~t:1)2] x ~ term~weight x term~froq x [2~~~1 x ~ 0I~ ~ Figure 17: Schematization of `P&titioning' Formula 267 ZFO9-435-245 7.720000 WS3870123-0031 6.470000 FR89214-0026 6.310000 WS3870519-0094 6.130000 WS39009 12-0046 5.830000 WSJ870305-0055 5.360000 ZFO7-783-164 5.310000 ZFO7-189-244 5.100000 ZF07-443-642 4.980000 AP881122-0107 4.060000 WS3911018-0122 4.060000 ZFO7-971-724 4.050000 ZFO7-251-245 3.930000 WS3870421-0065 3.780000 ZFO8-084-048 3.740000 ZFO7-621-948 3.670000 WSJ911030-0170 3.610000 ZFO9-584-807 3.570000 ZFO7-294-735 3.420000 ZFO9-526-239 3.390000 ZFO7-789-516 3.330000 ZFO7-218-520 3.300000 ZFO7-800-964 3.300000 ZFO7-495-528 3.280000 WS3900629-0110 3.210000 ZFO9-559-173 3.200000 ZFO7-118-812 3.170000 WS3871030-0149 3.160000 ZFO7-878-828 3.120000 WSJ870309-01 10 3.070000 AP880419-0280 3.050000 Figure 18: Sample of Data-First~Pass Partitioning Results for Topic 57 TREC document was `scored' against the super thesaurus in a single pass (Step 6 in Figure 11): effectively, each document was scored against the routing/partitioning thesaurus for each topic in parallel. In particular, every NP in each document was matched against the NPs (terms) in the routing thesaurus; partial matches were allowed. The definitions in Figure 15 and the formula in Figure 16 (given schematically in Figure 17) were used to yield a composite score for the document based on the number of exact and partial hits as a function of document length and term `value'. The routing/partitioning thesaurus was used to score the full database, yielding a ranking of all documents relative to all topics simultaneously. As shown in Step 7 in Figure 11, the top 2000 documents for each topic were retained as the partition for the topic for the next stage of processing. Figure 18 gives sample results of the rankings of documents based on feature scoring for Topic 57. Figure 19 shows the set of `true' relevants chosen by manual review of the top 10-50 ranked documents. 4.7 Final `Querying' Figure 20 gives the final steps in the process. There were two essential phases in querying at this point: building the final query vector and querying a partition of the database to retrieve the final set of relevant documents. Note that the query was weighted based on statistics 268 WS3861204-0059 WSJ870123-0031 ZFO8-695-706 WS3870305-0055 WSJ870519-0094 WS3871030-0149 ZFO8-096-680 WSJ861208-0024 ZFO8-318-964 ZFO8-338-122 Figure 19: Sample of Data-Hand-Selected 10 Relevants for Topic 57 Step: Input: Process: Output: Source- Query ~ IDFxTF Indexed Docs 8 ~Indexin~~ (Words Broken Out) 2000 Docs Rel-Docs Ilntersectl Q uery-Sup-Docs 9 (Possibly ~) 2000 Docs 10 Source- Query Query-Sup-Docs I Merge Wjli&brate _ Final Query Vector 11 Indexed-Docs Final Query Vector Vector Space _ Top 200 Docs L-JRanking Figure 20: Schematic Representation of Final Steps in Processing 269 58.274554 (Rci) 0 0 54.311172 (uci couuunication corporation) 0 0 27.003252 (custouer) 0 0 22.711744 (price cut) 0 0 20.039764 (sprint) 0 0 19.527712 (price reduction) 0 0 19.383471 (at&t) 0 0 15.882106 (pre tax charge) 0 0 15.090805 (uake) 0 0 14.682230 (industrywide price cut) 0 0 14.592388 (price) 0 0 13.906499 (comunication) 0 0 13.878939 (distance) 0 0 13.527033 (industrywide) 0 0 12.290086 (capital spending) 0 0 12.258577 (response) 0 0 11.278964 (cash flow) 0 0 10.960371 (telecomunication) 0 0 10.681927 (conway) 0 0 10.480049 (williau e conway jr.) 0 0 1.260289 (product) 0 0 1.134040 (york) 0 0 1.133174 (couputer) 0 0 1.121569 (reported) 0 0 1.121431 (gain) 0 0 \I. Figure 21: Sample of Data-Final Query for Topic 57 270 extracted from a partition of the database. For the ad-hoc queries, the partition used for the statistics was the same as the partition actually being queried. For the routing queries, however, the final query vector was fixed before processing the new text (i.e., the second set of TREC documents). In particular, in this case, the partition used to weight the routing-query vector was extracted from the training corpus (the first set of TREC documents); this vector was then queried against a partition extracted from the new, test corpus. The NPs and their contained words among the documents in each partition were scored for distribution and frequency; each NP/term- and word-type was given an IDF-TF score. As noted above, for routing queries, the IDF-TF score was based on statistics from the original partition of 2000 documents from the training corpus; it was a static query vector. For the ad-hoc queries, on the other hand, the final partition of 2000 documents was used as the source of statistics for the IDF-TF scoring. Therefore, the scores for terms in the query vector for the ad-hoc queries could vary depending on the set of documents selected in the partitioning process. Figure 21 gives a sample of a final query. The terms in each topic's routing/partitioning thesaurus were given IDF-TF scores based on the sample; original-query terms were added and the factors of those terms ("1", "2", or "3") were used to multiply their IDF-TF-based scores; the combined terms and their contained words thus formed an extended-query vector (the final query vector). The 2000 documents for each topic were modeled in vector space (in which all terms and their contained words formed the dimensions) and the final query vector was used to identify and rank the 200 `best' documents, which constituted our results. 4.8 Summary of the Process Figures 22 and 23 summarize the CLARIT-TREC processes described in detail in the preceed- mg sections. As noted previously, there were only two steps in the CLARIT-TREC process that required non-automatic processing: (1) initial review and weighting of the index terms automatically-nominated and derived for the topic and (2) in the case of ad-hoc queries, review of first-pass retrieved documents to identify 5-10 relevant ones for use in creating a pseud~ thesaurus for further processing. 5 Results and Evaluation This section presents the CLARIT-TREC results in several forms, including broad overviews of the performance, the "official" results tables, and tables of data that focus on statistics that are especially relevant to the CLARIT-TREC approach. Results are presented with only abbreviated explanations.16 As noted previously, the CLARIT team submitted both intermediate results ("A") and final results ("B"). The intermediate results were generated by taking the highest-scoring 200 (out of 2000) documents as determined by the routing/partitioning process. Since the strategy of rout- ing/partitioning was to nominate a moderately large candidate subset of documents in which all the true relevants would be found and since the procedure and scoring were designed to over- generate candidates, we expected to have many `false positives' in each set of 2000. We had no reason to expect the relative ranking of these documents by their evoking routing/partitioning scores would be a good measure of fit to the source topic. By contrast, we expected the final steps (which utilize subset-specific term scoring and vector-space similarity measures) to induce a relative ranking of documents that would represent a good fit to the source topic. 16More detailed analysis of the results is given in [Evans et al. in preparation]. 271 lbpic Corptts Hand ~ighting Source-Query 4 Part-Thes ~ FTmLUrC I Sco[iIl~ I Pseudo-Thes ½ Thesaurus Discovery A Relevant ________1Hand1 ~ 2000-Doc Does Review Partition (½¼OoO-DOC ~1t'on Figure 22: (~LARIT ~A"~E~'okiii.g Top 200 Do~~ \~tor Space Ranking m SOQICO-Query IDF.TF ~ 2000 Scored Does 2000 Does ~ ~Scoring~ Relevant ¾ ~ m]Intemect ~ Does Query Supplementing Does FinaiQuery~ctor Source-Query Figure 23: CLARIT ~ 272 >Median =Median Median =Median