A Boolean Approximation Method for Query Construction and Topic Assignment in TREC* Paul S. Jacobs, George B~. Krupka, and Lisa F. R.au GE ~esearch and Development Center Schenectady, NY 12301 (518) 387 - 5059 Abstract Word-based search is the simplest and most widely-available method for process- ing and retrieving volumes of information from free text. However, this common method of searching is generally cumbersome and inaccurate. The method de- scribed here automatically generates complex Boolean queries for word-based search, so that the same mechanism can be used with higher accuracy and effi- ciency. This practical approach worked, with good results, on the entire TREC corpus on both the "routing" and "ad hoc retrieval" tasks, with the official averages in the top group for both tasks. 1 Introduction Full-text search is currently the simplest and most commonly-used method for locating information in large volumes of free text. Because users are accustomed to describing what they are looking for with specific words, and those words are often found in the texts, searching the text for selected words or word combi- nations is a natural and easy-to-implement method for information retrieval. However, it can be very inaccurate. In some experiments, the percentage of relevant texts retrieved has been shown to be lower than 10%, and a high per- centage of material that is retrieved can be irrelevant. Also, it can be very difficult for searchers to compose "queries" that combine the words that are ef- fective in locating relevant material without finding large quatities of irrelevant information as well. *Special thank~ to Michael Charbonnean, Mark Freeman, Geoff Gordon, John Mnnq.on and IJn 7ernik, who all helped participate in the design and implementation of the GE TREC system. This research was sponsored in part by the l,efense Advanced Research Project Agency. The views and concli'sions contained in this doci'ment are those of the anthors and shoi'ld not be interpreted as representing the official policies, either expressed or implied, of the l,efense Advanced Research Project Agency or the IJ~ Government. 297 There are many alternatives to full-text search that can produce much higher accuracy, including statistical methods that weight matches based on relative word frequencies, automatic indexing strategies, and knowledge-based approaches that give very high accuracy for repetitive searches at the cost of a large amount of work in constructing a knowledge base. The ideal search strat- egy would be one with the accuracy of knowledge-based approaches, but with the simple efficiency of word searches. This is the motivation for the method described here. The selection of this method for the TREC evaluation combined a sense of the practice of information retrieval with a particular interpretation of what the evaluation is about. For example, the strategy makes several assumptions about the task that are apparently different from those made by other sites, and which perhaps take a looser and less academic view of the experiment. These key assumptions are: * Relevance over ranking. The focus of text interpretation is to assign to each text, with the highest possible accuracy, the set of topics or content indicators that apply to that text (i.e., particularly for routing, to treat the topic or relevance of each text individually rather than as a relative measure against other texts in the corpus). * Technology over engineering. The goal of the experiment is to show high- accuracy, practical results, avoiding the threatening limitations of disk size, memory management, and tractability. (The system did no pre- indexing or analysis of any portion of the corpus.) * Tezi inie~reiai:on over query inieryrelalion. As there may not be any principled structure or methodology in the sample queries, the emphasis on matching texts to an internal representation of each query, rather than on automatic query processing, pushes the limits of the routing and retrieval engine instead of the interface to the user. These choices were made for various reasons, and are presented here not as the righi way to view the task but as one software system's implicit focus. A project with a different choice of focus could, for example, produce lower accu- racy but have the benefit of fully automatic query processing; another project might have higher accuracy on the top few texts in each category but lower recall on a general routing task. We view our (preliminary) results as very satisfying as a test in a high-throughput, high-accuracy, routing-style interpretation. The sections that follow cover the motivation for the design and implemen- tation of this method, some specific details of the experiment, and an analysis of the results. 298 2 Background There are several different methods for general-purpose information retrieval from text, as mentioned, including full-text search, statistical and probabilistic techniques, and those using automatic or manual indexing. Of these, word- based full-text search is generally believed to be the least accurate, but by far the most widely practiced. The method is used in spite of its inaccuracy because it is so easy to implement, and because it gives a direct and natural way for users to describe what they are looking for. Knowledge-based approaches-those that rely on formal descriptions of con- cepts rather than combinations of words-can be the most accurate for certain limited tasks, but are the most difficult to implement and apply to general information retrieval problems. While there are a variety of knowledg~based methods, they share the common framework of looking for structured informa- tion in a text, while other methods look for simple combinations of words. For example, a simple query looking for stories about takeovers by GE in a typical word-based system might be (GE AND ACQUISITION), while in a knowledge- based system it would look more like ($GE * $acquire * *compauy) where the terms marked by $ indicate classes of words or phrases that the system can recognize; for example, GE might include GE and General Electric but not General Electric company of Britain, and $comp~y would match any company named or described. The knowledge-based approach is more accurate, in general, for two rea- sons. First, it allows the program scanning the texts to look for many ways of expressing the same concept, enhancing recal~the percentage of relevant docu- ments that the program retrieves. Second, it focuses search on specific pieces of information, thus enhancing p~cision-the percentage of retrieved texts that are relevant. In this case, improved recall would come from spotting the variety of ways of expressing the concept of acquiring, and improved precision comes from eliminating texts that are about the acquisition of real estate, products, and so forth, as well as acquisitions by other companies of GE's products and services. Most full-text search systems are driven by Boolean queries-terms joined together by AND, OR, and NOT, such as (GE AND ACQUISITION) above. These can be arbitrarily complex, and one can start to approximate the opera- tion of a knowledg~based approach by combining large numbers of terms. For example, the GE query above could be expressed as: (AND (OR GE (AND GENERAL ELECThIC (NOT (AND OF BRITAIN)))) (OR ACQUIRED ACQUISITION BOUGHT (AND (OR TAKE TOOK) OVER) (OR COMPANY CORP INC CO LTD SA AG ...)) There are several problems with this approach. First, it is very difficult for users to formulate queries of this sort, so, in practice, Boolean queries are 299 quite simple. Second, it is easier for a system to recognize most or all company names than it is to list all the possible words that could appear in the name of a company. Third, the amount of information captured in the query is still limited; for example, it ignores the order that words appear as well as their ad- jacency or proximity. Many text search systems allow augmentations to queries that express these constraints, but this makes the queries still more difficult to construct and makes searches less efficient. Hence Boolean retrieval systems remain, in practice, awkward and inaccurate. The experiment that this team performed in TREC hatched under the pres- sure of a very difficult task with very limited resources. While a number of individuals participated (including the three authors and five others), the pr~ gram described here resulted from less than a week of programming over and above the existing software tools we had in hand to apply to the task. The constraints thus forced upon the effort included the realization that not much could be salvaged from the queries as distributed, that the sample relevance judgements were incomplete and the training samples too small for most statis- tical tests, and that indexing the corpus by any practical method could delay the project by weeks, overflow the disks, and prevent any corrections to the method. This led to a "bare bones" strategy that takes advantage of two strategies: (1) treating boolean queries as an approximation to more detailed, structured rep- resentations, and (2) using the co~us, rather than the queries, as the main source of information for formulating the topics. 3 Our Approach The fundamental idea behind the method is to take a knowledge-based descrip- tion of a query or topic, and convert it to a Boolean form that can be efficiently applied by a text-search engine. This Boolean form, furthermore, must be an ap- proxima~ion of the knowledge based query, in the formal sense that the Boolean expression should match all texts that the knowledge-based query would, but perhaps can admit more texts. There are several key advantages to this approach of generating a Boolean query from a knowledge-based description. The simplest benefit is that it makes building queries easier, because much of the work in forming complex Boolean expressions is done automatically. A second major advantage is that the Boolean queries, in approximating a knowledge-based approach, are more likely to give accurate results. Finally, because retrieval using the automatically~generated Boolean query approximates the knowledge-based query, the knowledge-based system can run on the results of Boolean retrieval, thus enhancing precision without having to apply the more computationally~intensive knowledge-based processing to very much text. One of the obvious problems to overcome in TREC was, with a limited amount of time to formulate 100 queries, with a small amount of training data 300 (none for the ad hoc test), how to make the queries practical and accurate. Our choice here was to keep the initial queries relatively simple, and to run the results of a "first pass" retrieval against the entire corpus through a statistical filter to pull out terms that would help to augment or refine the query. In addition, the matching engine would display the exact portion of each text that (correctly or incorrectly) matched the query, making it easy to correct glaring errors and refine ambiguous terrns. This amounts to a peculiar sort of feedback mechanism that relies on detailed analysis of portions of the corpus instead of user input. 3.1 Detailed Method The method brings together four key elements: (1) a language for express- ing knowledge-based topics or queries, developed at GE and described in the open literature, (2) a new program to generate Boolean expressions that ap- proximate these queries (called the riLic compiler), (3) a program to match the automatically-generated expressions against text to be retrieved (called the pre-flUer), and (4) a knowledge-based pattern matcher, described in the open literature [2], that takes the results of the first match and rejects texts that do not satisfy the more constrained, knowledge-based query. Because the pattern matcher is designed as an efficient "trigger" mechanism and an aid in parsing, the knowledge-based queries are mostly simple combina- tions of lexical categories. The patterns largely adopt the language of regular expressions, including the following terms and operators: * Lexical features that can be tested in a pattern: - token "name" (e.g. "AK-47") - lexical category (e.g. "adj") - root (e.g. "shoot") - conceptual category (e.g. "human") * Logical combination of lexical feature tests - OR, AND ,and NOT * Wild cards $ - 0 or 1 tokens * - 0 or more tokens + - 1 or more tokens * Variable assignment from pattern components = * Grouping operators: <>for grouping []for disjunctive grouping 301 * Repetition * - 0 or more + - 1 or more * Range - 0 to N - 1 to N In practice, certain of these features are used more than others, and most queries rely most heavily on different lexical categories, grouping, and wildcards. For example, a simple description of the query looking for texts describing sanc- tions against South Africa is the following: sanction == [(member sanction sanctions disinvestment) ] sa~rica == [(member Buthelezi Pretoria anti-apartheid apartheid) ] ;; rule 1 *sanction *50 $sa~rica => (mark-topic 52) ;;; rule 2 *sa~rica *20 $sanction => (mark-topic 52) This description says that any matching text must have bo~h an indicator of South Africa ($safrica) and one of sanctions ($sanction), and that the sanction phrase must occur within 50 words of the South Africa phrase, except if it only comes afterwards, in which case it must come within 20 words. A sanction phrase can be any of the simple words sanction, sanctions, or disinvestment, or any phrase including punitive measures with no more than two intervening words (like punitive economic measures). A South Africa phrase can also be either one of a group of simple words, or a phrase, like De Kierk, South Africa, or South African. These queries or topic descriptions can be quite complex, and the method has been designed to handle many queries simultaneously, so the rule compiler is designed to produce expressions that can be efficiently applied within a l&ge set of queries. This is important because many queries can share the same simple terms or combinations of terms, and because the pre-filter must match the simplest expressions first. For the topic description given above, the output of the rule compiler will include the following tests: 52 TERN AFRICAN 2029 TERN NEASURES 302 2134 TERN SANCTION 2135 TERN SANCTIONS 2136 TERN DISINVESTNENT 2138 TERN SULLIVAN 2139 TERN PRINCIPLES 2141 TERN PUNITIVE 2144 TERN BUTHELEZI 2145 TERN PRETORIA 2146 TERN ANTI-APARTHEID 2147 TERN APARTHEID 2149 TERN DE 2150 TERN KLERK 2152 TERN SOUTH 2153 TERN AFRICA 2137 OR 2134 2135 2136 2140 AND 2138 2139 2142 AND 2141 2029 2143 OR 2137 2140 2142 2148 OR 2144 2145 2146 2147 2151 AND 2149 2150 2154 OR 2153 52 2155 AND 2152 2154 2156 OR 2148 2151 2155 2157 AND 2143 2156 2158 AND 2156 2143 T0P1C052 OR 2157 2158 Each line in the above data gives a unique number (or topic designator) to the test, a test identifier (either TERM for a simple word test, OR, or AND), and a list of simple terms or previous tests. For example, test 2137 depends on tests 2134, 2135, and 2136, and is true if any of those tests is true, namely, if the text includes any of the words sanction, san ci ions, or disinv~stment. The tests are automatically ordered so that all tests that are dependent on other tests will have higher numbers than the tests they depend on; thus all TERM tests appear first. In this case, the TERM test AFRICAN appears with a much lower number simply because it is used in many different queries. The pr~-fili~r, which can work either on complete documents or paragraphs, goes through every word in its input and, using a fast table look-up, sets the TERM tests to true for every word it encounters. At the end of input, either the end of the paragraph or end of each document, it runs through the table of possible tests from low numbers to high numbers and sets tests to true if their 303 conditions are satisfied. A topic test produces a match if it has become true at the end of this process, meaning that the paragraph or document has passed the pre-filter for that query. A single paragraph, of course, can satisfy multiple queries. The pattern matcher then loads in only those texts that have passed the pre-filter, and uses the original queries to apply more stringent tests to those texts, such as ordering, proximity, and more complex lexical tests. In typical cases, the amount of text the pattern matcher must scan is only 2-3% of the words in the original, and the pattern matcher discards 2({30% of the matches of the pre-filter. Hence, well over 95% of the text filtering is done by the pre- filter, making the whole process efficient as well as showing that the pre-filter closely approximates the knowledge-based pat tern matcher. 4 Implementation and Results The lexical analyzer and the matching method described above were programmed in the C programming language, compiled, and tested on a corpus of about 1.2 gigabytes of text (about 200 million words) with a set of 50 topics, as part of the government-sponsored Text R~trieval Conference (TREC). The program was then tested on an additional gigabyte of text (the "second" disk) with an additional 50 topics, and the results were submitted to the government for com- parative evaluation purposes. The results, in this case, were a ranked list of up to 200 documents that matched each query. The final test, run on two SUN workstations, took about 10 hours of elapsed time for the entire corpus. The pre-filter ran at a speed of several hundred thousand words per minute, thus covering the entire contents of each disk in a few hours. This would reduce the text to 3% of the original data, with each paragraph marked according to the queries that matched in the pre-filter. During query refinement, a statistical filter ran through the entire contents of this marked portion of the corpus, pulling out words and phrases with a high correlation (a variation of the metric for categorization reported in [1)) to a particular topic. For every topic, then, the statistical filter listed terms not in the (Boolean) query that occurred with a disproportionate frequency in text that matched the query. This, for example, produced the terms Buthelezi and puni~ive measures in the South Africa query above. Figure 1 summarizes some of the official results from TREC, including the number of times the systems achieved the best score, and were above, below and equal to the median score. The system was very close to the top in both the ad hoc and routing tasks, and was well above median on most topics. The pre-filter, surprisingly, was better than the pattern matcher almost across the board, indicating that with this style of evaluation there seemed to be no real benefit to deeper analysis than a simple Boolean match. The logical conclusion is that a Boolean routing or retrieval engine can perform as well as much more 304 complex and sophisticated methods, so long as enough knowledge is used in constructing the Boolean expressions. AD HOC TEST 11-pt. Rel.@ ava. lOOdocs. Rel./ Retrieved ROUTING TEST 11-pt. Rel.@ ava. lOOdocs. Rel.I Retrieved Boolean pre-lilter .2029 47.2 .46 .2078 35.6 .34 Pattern matcher .1961 46.2 .46 .1851 34.6 .37 Median for all runs .1585 39.7 .1246 28.6 top above below top above below score median median equal score median median equal Boolean pre-tilter 5 28 15 2 5 31 6 7 Pattern matcher 5 26 17 2 6 27 10 6 Figure 1: GE TREC test results 5 Limitations and Future Work While the overall, relative results were generally strong, the system as it was implemented had some basic flaws that should be easily corrected. One char- acteristic of this method is that it seems to produce excellent results on the queries that produce large volumes of data, and tends to produce almost noth- ing on some of the narrowest queries. It also produces both high precision and high recall for the texts that match, but makes it difficult to "loosen up" to let in more texts. Since the match relies on at least some exact match between query terms and texts, there are some queries (for example, about the details of rewriteable disks), that produced no hits. By contrast, in one configuration, the system assigned over 4,700 texts to one query ("sightings of 1988 presidential candidates"), of which only 200 were included in the submitted results. This might be desirable behavior for a routing system, but in the TREC style of evaluation the system did badly on those queries where it failed to produce at least 200 documents. On the other hand, the system produced at least 6000 responses in even its strictest configuration, or at least 120 texts, on average, per query. 305 The problerns with undergeneration (and the related problem of not doing a very good job of ranking the documents) were due to the fact that our sys- tem was designed for routing, while TREC used traditional retrieval evaluation methods, along with a 200-document cutoff, effectively counting recall on the harder topics much more heavily than overall recall. Our approach can correct for this by using a more flexible statistical method to expand the queries and by performing a more sophisticated ranking (the document ranking as reported was implemented post hoc in one line of Unix code). More important than the problems to correct, there is an important result here to build on. Our experience has been that pattern matching can be a close approximation for this sort of task to natural language processing, so it might seem that advanced methods are much more critical for finding what to put in the queries than they are for the detailed analysis of the texts. The general framework of this approach means that, with the continued development of advanced methods for natural-language based corpus analysis, substantial performance improvements can come within the context of almost any current text retrieval systems. 6 Evaluation Methods One unusual characteristic of our method is that it assumes that each relevance judgement that the system makes is made independently of all other texts, as in a routing task where the system processes each incoming message in turn and assigns topics or actions for filing or routing that message. Certainly, this style has certain advantages-it is simple, clear, and makes parallel processing easy- and it reflects some real assumptions about the nature of the task. However, although it seems to have done very well relative to other systems, it is not consistent with the instructions for submitting results in TREC, and certainly doesn't lead to the best possible showing on some of the results. Topic 77, about poaching techniques, is one example of the different (naive, perhaps) perspective toward evaluation that our system adopts. The query specifies: A relevant document will identify the type of wildlife being poached, the poachzng technique or method of killing the wildlife which is used by the poacher, and the reason for the poaching (e.g. for a trophy, meat, or money). This is a very specific query. Our test (bootstrapping) sample produced a good number of hits, but most of them failed to include one of the required pieces of information, usually the technique or method of killing. So, we narrowed the query. The result is that, for this query, the system produced 9 total documents, 6 of which were judged relevant. This is high precision (.67), but it doesn't help the overall results, since for this topic the precision at 200 documents is treated 306 as .03. To us, narrowing the query seemed a good idea because the precision on this topic otherwise would have been low, but we did not realize that the documents that the system didn'i re~rievc were still treated as incorrect in this calculation. On Topic 43 (1991 Al conferences), our system produced 3 documents, all of which were irrelevant. This "routing" topic was later discarded because no relevant documents were found in the corpus, but there is nothing inherently wrong with testing topics for which there is no data. In fact, the ideal routing system should produce 0 hits for such a topic, not 200 hits as dictated in TREC. Certainly ranking and routing don't go together in any real task on a gigabyte sample. One way that future evaluations can test routing is to use a random (or otherwise fair) sample of the collections as a test, judge every document in that sample with respect to every query, and then measure each system's recall and precision on the basis of the sample. This would probably require less hand- work in judging relevance, but would require that each system produce topic assignments for every document in the collection (from which the assignments for the test sample would be extracted post hoc). This could be impossible for some systems. On the other hand, the strategy would give real numbers for both recall and precision, and would be much truer to the routing task. 7 Utility The main purpose of this method is as a front-end for computation-intensive natural language processing of large bodies of text. Because the pre-filter closely approximates more in-depth processing with a very fast, efficient process, it permits detailed processing of large volumes of text by discarding most of the irrelevant material and by producing a rough approximation of the more detailed processing. The method is more broadly applicable to problems in information dissemi- nation and retrieval. Accuracy is only one appealing characteristic of the tech- nique, since the main innovation is that it allows for improved accuracy within the context of traditional word-based full-text search. In addition to the programs described here, the method was tested with a statistical corpus analyzer that helps to identify candidate words and phrases to include in queries. This method helps to overcome some of the limitations of word-based methods in cases where statistical approaches clearly seem to do better. As an additional experiment, this automated corpus analysis can be used to reduce further the amount of labor involved in building queries. 307 8 Summary GE's participation in TREC involved a small implementation of a simple strat- egy for compiling knowledge based pattern matcher rules into the language of Boolean expressions. A statistical corpus analyzer helped to formulate and re- fine queries for both the ad hoc and routing tasks, and the resulting matching engine ran on the entire 2.3 gigabytes of text. The simple Boolean retrieval engine performed very well on both tasks. These results are promising, both from the perspective of accuracy and for the simplicity with which they seem to bring knowledge-based techniques to bear within the rudimentary framework of word-based retrieval. References [1] Paul S. Jacobs. Joining statistics with NLP for text categorization. In Proceedings of ~he 3rd Conference on Applied Na~ural Lang~age Processing, April 1992. [2] Paul S. Jacobs, George R. Krupka, and Lisa F. Rau. Lexico-semantic pat- tern matching as a companion to parsing in text understanding. In Four~h DARPA Speech and Natural Language Workshop, pages 337-342, San Ma- teo, CA, February 1991. Morgan-Kaufmann. 308