Knowledge-Based Searching with TOPIC® John W. Lehman, Clifford A. Reid, et al. Verity, Inc. 1550 Plymouth Street, Mountain View, CA 94043 (415) 960-7620 I jlehman @ verity.com 1. OBJECTIVE OF VERYFY'S TREC-2 EXPERIMENTS Verity, Inc. is the first major commercial product participant in ThEC. Verity's product is TOPIC®. Verity participated in ThEC-2 as a Category A Site. This participafion was Verity's first TREC, and we encountered many of the logistical problems of other sites in their TREC- 1 experience. Topic's search users wish to understand the search result quality to expect in their personal searches on their (large) collecfions. Verity also expects to obtain insights for future product improvements. Topic is a mature commercial-off-the-shelf manual text search program combining the results of human expertise with a powerful search expression language and fast search algorithms. Topic's installations use manually or semi-automatically developed libraries of searches (topics) , which are instances of the search expression language and which are supplied to all users. Verity begins its TREC experiments with a gathering of "ground truth" regarding unaided adhoc end user search result quality. Future experiments will incorporate predefined searches (topics) and other Topic search aids to determine their level of improvement/impact on search result quality. 2. TOPIC SEARCH APPROACH The Topic philosophy: Domain knowledge, both descriptive and content-based, using constructs specifically designed to discriminate between jull text material, is the only way to consistently obtain high recalbprecision on large heterogeneous collections. Search result quality may be enhanced by the employment of collection-specific statistics to locate additional domain-relevant terminology. Searches are repeated and subject-matter expertise is a scarce resource. The problem that Topic addresses is the effective use of a human's time in analyzing search results to locate the 209 preponderance of relevant details in the fewest possible documents, and therefore the smallest possible elapsed time. 2.1 TOPIC KNOWLEDGE REPRESENTATION The Topic product employs several approaches to individual term search, organized by a rule-based, or concept-based, approach to search term aggregafion. In Topic, the search focus is the topic, (concept, notion, idea, or subject), and the topic is the user-specified "smart" description of all of the evidence "about" or "of' the topic as it (the evidence) would be found in text documents. 2.1.1 TOPICINDICIES The Topic product line catalogs and indexes both fielded (structured) data, and full-text. Topic automatically extracts structured data (such as title, author, etc.) into searchable fields, using a lexical analyzer. Fielded data is searchable separately or in combination with full-text. Indexes on the full-text are (for all non-stopped characters and strings): -word/string -stemmed word (morphological variant) -soundex (phonetic spelling variety) -statistically correlated terms (called the suggestion index) -typographical error index -thesaurus -wildcard (universal character/group expansion) An index on all values (choices) for fielded data is also produced. 2.1.2 TOPIC SEARCH RULES Search rules cop.. sist of relational comparisons to field values, exact or fuzzy matches on full-text search terms, aggregated by boolean and evidential reasoning operators and point value uncertainty at the term level (each piece of evidence has a strength/uncertainty attached to its predictability of its parent concept). Topic provides search rule management functions to support the creation, repeated use, modification, sharing and display of one or more libraries of related search rules. The search rule libraries are themselves searchable, including text annotations of the rules. Search rules are interactive queries, automatic queries, and a training mechanism for the installation's domains. A search rule definition may include several thousand pieces of evidence in over one hundred levels of detail. One search rule library may contain twenty thousand rules. Search rules (topics) are named, and a reference to the name in a search expression inherits all lower levels of evidence. Any query which includes a search rule name will automatically receive the full definition of the rule in the search. The lowest level of evidence is the text expression. Search rules may be composed of other named search rules. Search rules appear as an alphabetical list of topic names, an indented outline showing the levels of rules, or a graphical "family tree" display of rules and their parents/children, including evidence combination operators and evidence "weights". Searches may be executed directly from any node (name) in the search rule family. A topic search rule graphic display example appears in Figure 1. The search rule syntax consists of an exact or fuzzy match (pattern match) capability for individual terms (case sensitive); a boolean combination (and (all), or (any), not), of terms; dual direction, nested, grammatical (paragraph, sentence, phrase) proximity operators; a relative (fuzzy) proximity operator for two or more terms, an evidence aggregation operator (accrue) for both full-text and structured field data, and inexact match techniques as follows: 1. wildcard expressions for term expansion; single character, character group, or character class 2. soundex (first letter common) expressions for morphological term expansion 3. source language-specific stemming (morphological variants) expressions for term expansion 4. typographical expressions for term expansion (n- character infidelity to search term) 5. multi-direction thesaurus (user-modifiable) for term expansion 6. suggestion (statistical correlation) for term expansion 7. evidence appearing in a field value, or as the field value (contains, matches, substring, starts, ends). Each of the above inexact match techniques may be executed automatically. Negative evidence may be applied on a term-by-term basis with any operator. The structured field data types are character, number and date. Date arithmetic is provided, as well as relative date expressions such as "yesterday" ,"today" etc. 2.1.3 SEARCH RESULT RANKING Results of searches are relevance ranked lists of documents, with displayed titles or other descriptive information. The numeric score, and the accompanying rank, are the result of a best fit comparison of the full- text document and descriptor content and the search rule evidence. The ranking is subject to an optional threshold, used primarily to limit output, but the threshold may be used to describe search recall and precision. The relevance threshold is always used in dissemination/notification. Evidence consists of terms, operators (syntax) and the numeric strength of the relationship between the evidence and its (next higher level) search rule. The evidence may be aggregated or evaluated with boolean operators. Aggregation involves giving relevance score credit for each piece of evidence found (breadth of evidence first). As each level is evaluated in a search rule (tree), potential document score modification occurs (since successive levels may be weighted evidence for their next broader concept). The scoring of an individual term may include a frequency-of-occurrence factor (a normalized concentration factor) , a less powerful scoring factor than the absolute presence of the evidence in the document. A document score explanation function is included. 2.2 AGGREGATh SEARCH FUNCTIONS Searches may iterate on the results of the previous search. Any search may be named/saved along with its results manipulation criteria (sorting by fields, grouping) for later execution. Any search criteria may be interactively defined as a logical view of the collection, which then provides many alternative search universes for the user population. All Topic activities are audited. A search which supports discretionary access control may be transparently appended to any users search. 210 Figure 1 TOPIC Search Rule and Result TOPIC EE File Edit View Query Navigate Launch Window Fl -Help LAW 0.50 (~ (Many> (Stem> law 0.50 (~ (Many> (Stem> court 0.50 (~ (Many> suit 0.50 (~ (Many> (Stem> plaintiffs (~ law (Accrue> (~ 0.50 (~ (Many> (Stem> regulations 0.50 (~ (Many> (Stem> justice F0 .50 @ (Many> (Stem> lawyer O.S0@courtroom (Accrue> (~~0.50@(Many> (Stem> judge L0~50 (~ (Manv> (Stem> jury Retrieved: 66 of 202 ______________ Score Date Title 0- 0.97 13-Feb-90 Law-Legal Beat: Monsanto Is Cleared in 9600 ME 0.93 13-Feb-91 Candela Announces Patent Suit 0.93 14-Feb-90 Law - Legal Beat: Suits Prompt FDA Investigation 0.93 06-Mar-90 Business World: Prop. 65 Populism Is on the Pro 0.89 13-Feb-91 North Carolina Bar Association Supports Judicial 0.89 21-Feb-90 High Court Declines to Hear Challenge On Line B 0.88 01-Mar-90 Justice Aides Say Chances forWinning Big Awar 0.88 05-Mar-90 Law - Washington Docket: Poindexter Ca se3 HUD 211 2.3 USER INTERFACE TO SEARCH Every search is automatically configured into a rule. The simplest search is a list of terms, which may be entered at the keyboard, selected from displayed document(s) content, or selected from lists of terms. This list is automatically enhanced by term expansion, expansion to existing named rules whenever the rule name appears in the search expression, and evidence aggregafion. Searches involving structured fields are generally addressed by a form interface, which aggregates field and full-text content. Any list of terms, rule-names, or extensions such as thesaurus/soundex may be used to initiate a search or add to a search expression. 2.4 SEARCH RESULT ANALYSIS AIDS The Topic philosophy of minimizing the elapsed fime to obtain the necessary relevant details that constitute an answer or support a decision necessitates analysis aids beyond the search composition and result list display. The Topic result list may be browsed (page, result number etc.). A document selected for display produces the full text display with all search evidence highlighted (e.g. in reverse video or color). The display may be the native form of the document, which for most of today's collections means a marked-up format with useful user guidance in the markup itself (e.g. sections, paragraph headings etc.). The user may choose to browse or to move directly to the firstinextiprevious occurrence of a search term in the document. Similarly, the user may move through the document using various document enhancements such as hypertext links, may follow hypertext links to other documents, including graphics and other media. Previously generated annotafions are available for browsing. Queries or other applications may be linked to document content. A specific search term (not necessary to be a part of original search) may be used as a browsing aid to the document. 2.5 SECURITY Users may be prevented from accessing information via operating system permissions, and built-in access controls, including discrefionary. The product processes have been certified at system high in many installations, and some sponsors have applied for MLS certifications based upon the delivered product. 2.6 DATA ARCHITECTUREI PERFORMANCEI CONHGURATION Topic enables the logical division of a collection of documents into "partitions", which are document descripfions and indexing data about the arbitraryl intentional subset. Partition size, purpose and characterisfics are under the application administrator's control. The raw documents are not "owned" by the Topic application. Topic will produce indicies which are approximately 70% of the size of the native text size (the TREC-2 index size was approximately 50%). This includes fielded, word, and subject (rule evidence) level indicies. The partition data is platform-independent (i.e. the documents and their associated partitions may be moved/accessed from any Topic platform. Searches may be performed on the served desktop, on a host or both. Normal performance on a personal computer is in the thousands of document-rule nodes per second, up to many tens of thousands of nodes per second on current workstations. The search rule low level evidence is contained in a sizelspeed-opfimized index (iQ~i~, which is essential to rapid response on complex rules. This index is automatically modified each fime topic evidence is added, so the word positional information is searched only on the first use of the term. The topics index normalizes document size so that all search response times are predictable. Partifions enable incremental (ranked) results, guaranteeing few-second time-to-first- result, regardless of the size of the collection. The response characteristic which Topic opfimizes is the time-to-first-meaningful-result. The rule evidence index may be centralized or distributed, and when distributed, it provides the ability to produce a ranked results list with a minimum of network access. Integration with third party components is available from the end user interface, or shared libraries. The program provides logical links between document-image, document-document, document-annotation, document- search request. Some links may be automatically determined at indexing time (image, cross-reference). The structured field values may be entered interactively, or filled automatically from a lexical analyzer. The program provides an enduser process interface between scanning, OCR/ICR and indexing. 3. THE TREC EXPERIMENTS 3.1 DATA PREP~ON The TREC-2 texts data preparation processing was performed on a Sun SPARC 10 (UNIX 4.1.3). Cataloguing and indexing was performed at the rate of approximately 100 Mbytes per hour. This process included the automafic extraction of 10 fields from the ASCII content. Partitions were set at 8000 documents for all data. There were no processing errors. 212 No markup language (SGML) interpreter was used during data preparation, and the opfional aiphabefical word list (used only for display) and typographical error index (used almost exclusively for OCR'd data) were not employed. Special indicies such as correlated terms, and paragraph/sentence posifioning were not produced. As the fuzzy proximity operator was used in the tests, only a word position index was produced. No document was divided into logical or arbitrary sections for processing or search result enhancement, although that approach is used in virtually all non-newswire Verity installations. The purpose of logical division (a forerunner of the intelligence available in a standard markup language) is to create domaln-specific logical documents, and therefore to reduce the impact of larger, multi-subject documents on results (they would appear in search results simply because of their breadth of words). 3.2 TOPIC CONSTRUCTION Verity personnel manually constructed the search rules from the subject area descriptions and the training data. No rule developer was identified or chosen as a subject matter expert, and for certain of the contributors, this was their initial interface with using Topic. [Search rule libraries are created by approximately 6% of Topic's user population and the remainder of Topic's users employ the topics developed by others]. On the average, the TREC-2 volunteers were considered novices on the Topic product, particularly the search rule development area. Volunteers were not encouraged to use specific features of the product, and in at least one case, inadequate communication produced potenfially inaccurate search expectations. As search rules were interacfively developed, the rule evidence was automatically indexed for repeated use of the rule. The twenty volunteers each produced between 3 and 8 retrospective and routing queries. The range in time spent on individual query development, and result production was from fifteen minutes to eight hours, over a several week period. The average fime to produce the TREC-2 result, obtained from interviewing the volunteers, was approximately one hour. 3.3 EXPERIMENT PERFORMANCE Typical response time performance on the searches was two seconds per 8000-document partition, or approximately two minutes to search the entire collection. A single term, indexed as rule evidence, was used to search the entire collection, and the 1.1 million document collection was searched in 21 seconds. For routing queries, the score threshold was set to zero; any document containing evidence entered the routing result list. 213 3.4 ANALYSIS OF OFFICIAL RESULTS The post hoc analysis of Topic's TREC-2 results generally found that the Topic system performed well. When compared with other manual systems, the scores are amongst the best. I the few cases where Topic appeared to fail, we have generally been able to identify easily correctable deficiencies, that, had they been noficed during the experiment proper, would have resulted in superior performance by Topic in TREC-2. Based on our analysis, we believe that the prospects for TREC-3 look very bright. Our analysis of selected results from our TREC-2 submissions focuses mainly on the "failure cases" since these are most likely to give us insights in how to improve Topics (and users) performance in future TREC experiments. This also allows us to investigate whether there are any fundamental issues with using Topic to model the information need statements used in ThEC. We analyzed two routing and three ad-hoc topics in detail. Our summary follows. The following general observations applied to all searches: -Adhoc searches were submitted against all three disks, which produced poorer quality results generally, as documents from disc three appeared in some search results.1 -Field value evidence was not used, and in some domains/subject areas, domain knowledge about the sources of information would favor (rank higher) sources with the appropriate use of terminology. (e.g. business sources about financial performance, or foreign datelines have higher likelihood of describing foreign prominent persons/activity, as in topic's 66 or 121) -The queries which used attempted to use nomenclature with hyphens (e.g. M- 1) failed to return an exact match as the hyphen was not included as an indexed character. -The fuzzy proximity (near) operator was undocumented, only one volunteer used it and other users expected sentence / paragraph proximity in their searches. The index did not contain sentence / paragraph positional data, and all uses of sentence or paragraph operators produced erroneous results because the search arbitrarily assigned sentence and paragraph boundaries. 1Reprocessing the adhoc searches against only disks I and *2 produced a numberic result improvement of 0-70 percent, with a *few changes from under the median to over the median. 3.4.1 ROUTING TOPICS Overall, Topic's performance on the routing topics was rather good. We count that 21 of the 50 results were at or above median, and three were actually the bet score. Most of the other results were on the low side of the median. The relevant comparison to the median is summarized in Figure 2. The exceptions were topics 66, 67, 69, 74, 90 and 91, for which the Topic search used could be said to have failed. Several of these were straightforwardly explained. For example, in the case of topic 67 the wrong results were submitted. Our independent scoring of the correct results set would give the Topic search below median score. For topic 69 there was in fact only one relevant document, but, at least in our reading of the definition, this seems to be a false positive. In the case of topics 90 and 91 the Topic search definitions were, in our opinion, over-constrained. Further, in the case of topic 91 an index creation decision prevented a quite reasonable Topic definifion from performing as well as it could.2 The other two topics are of more interest. No clear pattern emerged between the type of search although, in the routing augmentation category, the Topic performance was well above the median on 20 of 33 searches. 3.4.1.1 ROUTING TOPIC 66 A relevant document for this topic is one that identifies a type of natural language processing technology that is being developed or marketed in the United States. The original definition of the Topic is basically a conjunction (AND) of a natural language concept and a products/technology concept. Performance was very poor, viz: Relevant = 86 Rel_ret = 1 R-Precision = 0.0000 Inspection of the Topic revealed that one of the conjuncts (the products/technology concept) had a weight of 0.05- thus effectively limiting the range of scores that Topic could produce to be in an extremely narrow range. 2This topic is about the acquisition of advanced weapons by the U.S. Army. One of the weapons systems mentioned in the information need statement is the M-1 tank. This was included in the Topic definition as the word "M-l"; but since the "-" symbol was interpreted as with like space at database build time, there was no possibility of retrieving documents based on "M- I" as a word. We changed the 0.05 to 0.5 and produced the following: Relevant = 86 Rel_ret=44 R-Precision = 0.2442 which is a median result. We concluded that for Topics to be effective we need to ensure a sufficient range of scores to give us the discrimination needed for the TREC scoring algorithm. 3.4.1.2 ROUTING TOPIC 74 A relevant document for this topic is one that cites an instance in which the U.S. Government propounds two conflicting or opposing policies. The routing task is complicated because this conflict may not necessarily be mentioned in the same document. In our opinion, this is a case where no amount of sophistication in Topic construction would enable Topic to do very Well. The information need is simply outside the scope of a retrieval system that uses non-NLP techniques. The best one could hope for is to model a document that talks about the meta-idea of conflict (i.e., find documents that talk about the US having conflicting policies, rather than documents that reference the specific conflicting policy). This is, in fact, what was done in the original submission. The results were: Relevant = 323 Rel~ret = 18 R-Precision = 0.0464 which is, of course, rather poor. The original statement of need actually mentions three examples of conflicting policies so, as an experiment, we ran the following query: * /wordtext = "tobacco" * /wordtext = "pesticide" * * /wordtext = "infant" * Iwordtext = "formula" that is, just an ACCRUE of "tobacco pesticide" and "infant formula" (which the modification that the and operators produce. This gave the following results: Relevant = 323 Rel_ret=107 R-Precision = 0.2660 which puts the score slightly above median. We expect that most TREC-2 participant sites probably did just this, and those that did much better than median found some other specific examples of a conflicting policy and modeled these in their routing queries. 214 Figure 2 TOPIC2 Relevant vs. Median - Routing Topics U - I~III~~[ I I -u - -- E E -- - EIII~I E - E * U U 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 60 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 % 97 98 99 100 215 3.4.1.3 TOPIC 67 Our analysis located weak topic formulation examples, such as query 67, illustrated in Figure 4. In this query, a set of optional, auxiliary evidence was "ANDed" with a small set of required evidence. The weight, or strength assigned to the auxiliary evidence was .05, which means that if all auxiliary terms were located, the highest possible score for a document would be .05, severely limiting the range of scores, and thus the occurrence of random false hits in the top 1000. To make a cosmetic improvement, only the value of the auxiliary evidence node was changed, to a value of .5, as shown in Figure 5. This change alone brought the Topic relevant document count to the median. 3.4.2 ADHOCTOPICS Overall, Verity's performance on the ad hoc topics was adequate. Performance was poorer than on the routing topics, but this is to be expected since there was less time available to build the Topics and no ground truth against which to test the Topic trees. The relevant comparison to the median is summarized in Figure 3. We count that 13 of the 50 results are at or above median. In contrast thought, there were only two outright failures here, topics 124 and 139. We did not look at topic 139, but topic 124 involves searching for documents that discuss innovative approaches to cancer therapy that do not involve any of the traditional treatments. This is a very hard topic because nearly all mentions of the innovative treatments are in the context of discussion of traditional therapies The approach adopted by Verity of simply looking for documents that talk about innovative treatment produces a large number of false hits (giving poor precision), and since there is an artificial cut-off at 1000 documents in the TREC experiments, this model also produces poor recall. We do not see an obvious solution to this. We picked three ad hoc topics to analyze in detail. 3.4.2.1 AD HOC TOPIC 109 A relevant document for this topic simply needs to mention one of a list of six companies given in the information need statement. A simple Topic that is the disjunction (OR) of the company names should be all that is needed here. However, the official result is: Relevant =742 Rel_ret=192 R-Precision = 0.2588 which is well below median. furthermore, given the simplicity of the topic, this is surprisingly low recall. Examination of the official Topic showed that company acronyms we used for three of the companies (i.e., 3M, OTC, ISI) were given equal weight to the fully spelled out company names. A cursory review of the original hit list showed that ISI was a poor choice since it has multiple interpretations. Less important, but for the same reason, OTC is a poor choice in the Wall Street Journal corpus since it can mean "over the counter", and in the DOE corpus 3M is part of a designator for a particular particle accelerator and is also used as an abbreviation for "three meters". We modified the Topic by eliminating the ISI acronym and by giving OTC and 3M reduced weights. This produced the following: Relevant = 742 Rel_ret=480 R-Precision = 0.5512 which would have been the best score. An interesting note here is that original and modified Topics had perfect precision and recall for the first 100 documents. Our conclusion is that this indeed was an easy topic - the false hits produced by ISI were what impacted Topics score. 3.4.2.2 AD HOC TOPIC 121 A relevant document for this document had to mention the death of a prominent U.S. citizen due to an identified form of cancer. This is an interesting topic consisting of two major components - the idea of a prominent citizen, and the idea of a specific cancer. In the official Topic, prominence was modeled using a number of words that indicate prominence (e.g., "prominent", "celebrity") together with words that indicate prominent roles (e.g., "Nobel Prize", "actor", "actress"). Cancer death was modeled by various combinations of death words (e.g., "death", "died") and cancer words (e.g., "cancer", "tumor", "leukemia"). The official score was: Relevant = 55 Rel_ret =27 R-Precision =0.1455 which, while not good in absolute terms, was well above the median. We observed two problems with this definition. First, it uses generic cancer terms rather than the specific cancer types required by the information need statement. So, we made all the cancer terms specific by using a list of common cancers (e.g., lung cancer, breast cancer, stomach cancer, etc.). We made no attempt to make 216 Figure 3 TOPIC2 Relevant vs. Median - Adhoc Topics --I- ~IIE - - lIEU Ol 02 03 04 05 06 07 Os 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 217 Figure 4 Example of Poorly Specified Search Routing Query #66 1.00 Required Evidence Topic and .05 Secondary- Evidence Accrue ˝ ifi Score = 10 (lowest =0) `1Natural Language" Secondary Evidence Figure 5 Poorly Specified Search - Cosmetic Repair Routing Query #66 "Natural Language" 1.00 Required Evidence Topic and .50 Secondary- Evidence Accrue Score =44 (median =44) ˝ ifi Secondary Evidence "Natural Language" 218 this list exhaustive. This produced the following results: Relevant = 55 Rel_ret=17 R-Precision = 0.2182 Thus we reduced the recall, but increased the precision. Presumably by adding more specific cancers (or at least the ones that statistically are most common) we could have improved the recall here. The second problem is more severe though. It appears impossible to build any kind of model that would allow us to determine, with any kind of confidence, that the person who has died is a US citizen. In our revised results list we find many prominent persons who died of a named cancer but who are not US citizens (e.g., the Venezuelan Ambassador). In addition, the notion of prominence is also hard to capture. Of course, we might argue that anyone who's obituary is on the wire service is prominent by definition! Be that as it may, we observed a number of documents that we did not retrieve because we had not included the specific prominent role indicator in our Topic. Thus we added the following roles words - t1author't, "poettt, "writer", "artist", "painter" - to the Topic and got the following results: Relevant = 55 Rel_ret=33 R-Precision - .0909 Thus we improved the recall but at the expense of the precision again. Nofice that we still have not included any business or government roles, which presumably would help retrieve the relevant documents in the WSJ corpus. Our conclusion, is that this is a significant challenge for Topic, and all other system. The citizenship question often cannot be resolved by reference to the text alone, and we see no alternative but accept the false hits. Prominence is also difficult, but could conceivably be approached by an extensive list of prominence and role. words. The specific cancer seems tractable since there are only a finite number of cancers and just a small set of those are common. 3.4.2.3 AD HOC TOPIC 133 A relevant document for this topic must describe some design feature of the Hubble Space Telescope, but must not report of the launch activity itself nor the Hubble Constant or Edwin Hubble. The official Topic was essentially a simple structure of the form: Hubble Space Telescope and not launch and not Edwin Hubble. This gave the following results: Relevant = 80 Rel_ret = 29 R-Precision = 0.3625 219 which is surprisingly poor given the apparent simplicity of the topic. Analysis of the behavior of the negation function in Topic shows that it is too restrictive, and so we eliminated the negated concepts leaving just the phrase "Hubble Space Telescope". Using this as the query gave: Relevant = 80 Rel_ret=78 R-Precision = 0.6000 which would have been above median and close to best. Adding as disjuncts (OR) the words "Hubble" and "HST" gave: Relevant = 80 Rel_ret=79 R-Precision = 0.6000 that is we retrieved one extra relevant document with no decrease in precision. We conclude that although the information need statement is careful to spell out the cases where the document will be non-relevant, the TREC corpus has few documents where these conditions apply, so that a simple query performs very well. This is presumably the approach most sites took. 4. FINAL OBSERVATIONS FROM TREC-2 The TREC-2 topic descriptions, particularly the ad hoc topics, exceed the level of domain knowledge available to most users of heterogeneous document collections. Most Topic (content-based) search operafional users are driven by time pressures to locate/summarize the most relevant details in the fewest possible documents. The exhaustive search result analysis implied by examining hundreds of relevant documents will not be addressed in most user environments; our experience is that ten to thirty documents is the level of search result analysis performed by a user (unless significant duplicafion of material occurs earlier, which would reduce the number of documents actually analyzed). Ergonomically, high precision in the first (10,20...50) documents is more likely to keep users attracted than high recall at much larger counts. Although we have yet to perform any analysis of duplicate information on the TREC2 results, our belief is that duplicate data is plentiful in the TREC2 "relevant lists", and that the reading of duplicate data by the human user will cause the result analysis to be (prematurely) terminated. We are certain that, unless summarization is performed, the relevant search results on most topics are too numerous to warrant user attention. It would seem reasonable to examine, at least for selected topics, whether the first ten/best ten documents address the domain well from a domain "precision/recall" perspective. To the extent that the domain is well served in a few representative documents, the coverage in the representative documents may be a "better" answer for the user than the numerical count of the number relevant in the first 1000. We recommend adding a measurement of the coverage of the domain as the first ten/thirty/n results documents are examined. 220 Appendix A COMPANY AND PRODUCT SUMMARY Topic is a commercial off the shelf software product line available from Verity, Inc. Topic search technology is a commercial adaptafion of ideas extracted from the research of Tong, McCune et. al., in Rule-Based Information Retrieval, which was sponsored by the U.S. Intelligence Community. Topic supports cataloguing, indexing and retrospective search of fixed collections, automafic search of newly indexed documents according to (user) predefined search rules (profiles), and disseminafion/notification based upon satisfied search rules. Documents may be batched for indexing/profihng, or processed automatically as they arrive. The Verity, Inc. market presence in content-based text search/retrieval is described in the Delphi, Inc. 1992 Industry Summary. The Verity Topic product line is considered to have in excess of a ten percent share of the market in commercial-off-the-shelf content-based search/retrieval products for personal computer to minicomputer environments. Verity was founded in April 1988. The Topic product was first licensed and installed by the U.S. Air Force in June 1987. Verity currently has over 650 installations and some 30,000 users. Many thousands of persons have received training from Verity on the Topic products. Approximately one-third of Verity's installed base uses an event-driven or batch automatic-search-notification function. Many organizations use the routing mechanism for users who are unable to compose the (appropriate) queries, but require the expert's result quality. The Topic product line supports nearly twenty varieties of the UNIX operating environment, VMS, 0S2, DOS and MacIntosh. The product operates on data stored in the filesystem or in any SQL-based data base management system. The product as shipped supports over twenty formats of native data (markup languages), and provides the ability to insert local/third party markup language interpreters as required. A document in Topic is logical, and may be a file, subfile or any logical decomposition of a physical native document. The Topic end user (search) product is available in MSWindows, Presentation Manager, X-Windows-Motif, Macintosh, and character (keyboard/terminal) interface styles. There is a 40L-like command interpreter language for rapid applicafion development and remote command line interactive index/search. There is an Application Program Interface (C- library) to all Topic funcfions for embedded applications. 221