GE in TREC-2: Results of a Booleaii Approximation Method for Routing and Retrieval* Paul S. Jacobs GE Research and Development Center Schenectady, NY 12301 psjacobs~crd.ge.com Abstract This report describes a few experiments aimed at producing hzgh accuracy routing and re- trieval with a simple Boolean engine. There are several motivations for this work, including: (1) using Boolean term combznations as a filter for advanced data extraction systems, (~) improving alegacy" Boolean retrieval systems by helping to automate the generation of Boolean queries, and (3) focusing on query content, rather than re- trieval or ranking, as the key to system perfor- mance. The results show very high accuracy, and significant progress, using a Boolean engine for routing based on querzes that are manually generated with the help of corpus data. In ad- dition, the results of a straightforward imple- mentation of a fully automat:c ad hoc method show some promise of being able to do good au- tomatic query construction within the context of a Boolean system. 1 Introduction Full-text search is currently the simplest and most commonly-used method for locating information in large volumes of free text. Because users are accustomed to describing what they are looking for with specific words, and those words are often found in the texts, searching the text for selected words or word combinations is a natural and easy-to-implement method for information retrieval. However, it can be very inaccurate. It can be especially difficult for searchers to compose "queries" that combine the words that are effective in locating relevant material without finding large quantities of irrelevant information as well. One way to cope with this difficulty, while still preserving the advantages of the full-text search engine, *This research was sponsored in part by the Advanced Research Project Agency. The views and conclusions contained in this doc- ument are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Research Project Agency or the US Government. is to help to automate the process of generating Boolean queries. This was the focus of GE's TREC-2 effort. GE's involvement in TREC represents a relatively low level of effort aimed at bringing together natural language text processing, data extraction, and statistical corpus analysis methods. Our project uses innovative approaches for extracting information from text, best exemplified in our results in the MUC and TIPSIER extraction evalua- tions [7, 3] and in operational text management systems in GE. In TREC-1, we attempted to show the benefit of natural language interpretation by using Boolean ap- proximation to select portions of text that could be fur- ther interpreted. The main result of this was that natural language seems to have very little to offer as a precision filtering method, because routing and retrieval problems stem largely from having the wrong terms in the queries [6]. Thus, in TREC-2, we have stuck with the Boolean engine, concentrating on the use of corpus analysis to im- prove the queries. Figure 1 summarizes our TREC results. Our results in TREC-2, as in TREC-1, were quite good relative to other systems. The manual routing system, which comprised over 99% of our effort, produced an 11-point average of .3308, with an average of 45 relevant documents in the top 100. This put GE's system at the very top of the man- ual routing category (the system with the best 11-point average in this category was slightly higher on the 11- point average and had slightly fewer relevant documents, on average, in the top 100). The residual effort went into a fully automatic ad hoc method, which produced an 11 point average of .2183 and an average of 37 relevant documents in the top 100. As in TREC-1, performance varied dramatically by topic. The routing system showed the best results (in terms of preci- sion at 100 documents) on 8 of 50 topics. Yet it was below median on 17 topics. This not only suggests areas for fur- ther improvement, but also shows an important difference between the Boolean approach and some of the statistical retrieval systems. The Boolean approach does much bet- ter on certain topics, but the statistical approaches have more consistent performance. 191 AD HOC TEST ROUTING TEST 11-pt. ReI.@ 11-pt. ReI.@ avg. 100 docs. avg. 100 docs. Boolean `92 .2029 47.2 .2078 35.6 Pattern matcher `92 .1961 46.2 .1851 34.6 Avg. median run `92 .1585 39.7 .1246 28.6 Boolean `93 .2183* 37* .3308 45 Avg. median run `93 .2620 41 .2910 41 ~= fully automatic Figure 1: Summary GE Results on TREC-1 and TREC-2 While it is very hard to measure progress between TREC-1 and TREC-2 because none of the numbers are directly comparable, the difference between our .2078 routing average in TREC-1 and .3308 in TREC-2 is large enough that we are quite pleased with the rate of progress and confident that we are nowhere near the peak that can be achieved with manual, Boolean routing. In addition, the automatic ad hoc method that we tried, while show- ing terrible performance on some of the more convoluted topic descriptions, had a better average than our manual ad hoc system last year. Thus, there is a great deal to be gained using cor- pus analysis to automate or assist in query generation, even within the context of straightforward retrieval meth- ods. In addition to having promise for "legacy" oper- ational systems, these results suggest that natural lan- guage methods, focused on corpus analysis and query gen- eration, probably can help in improving the performance of many information retrieval systems. 2 Boolean Approximation The basic approach in our system is to compile queries into Boolean tables that can be matched at high speed against a stream of input text. This approach is meant for routing, and also to be compatible with "downstream" analysis such as what we do in TIPSTER data extraction. In fact, the Boolean compiler we use is designed for han- dling the much more complex expressions that our system uses in data extraction. Figure 2 illustrates the approach. We call this Boolean approximation because the Boolean expressions used in the basic matching engine are an approximation to more detailed processing of texts, in the sense that they are guaranteed to admit all text that would be admitted by more detailed processing, but will usually also admit many texts that would be rejected by more detailed con- straints. This is a very general method, in that the sys- tem can be configured to apply many different stages of analysis, from "shallower" processing to "deeper" inter- pretation, with each stage applying stricter constraints- for example, word order, proximity, semantic constraints, and so forth. Furthermore, at each stage, the effects of filtering can be measured, generally showing a loss of re- call and gain in precision. In TREC-1 [6], we measured this tradeoff and found that the highest 11-point averages came from the first stage of filtering; in other words, the gains in precision in later stages were not enough to make up for the loss of recall on these measures. The figure also illustrates the flow of information at development time and the sort of knowledge that applies at each stage. For example, at development time, knowl- edge can be mapped from the deeper levels into shallower levels. At run time, the subsequent stages of language analysis apply this knowledge in stages. For example, our system can analyze joint venture texts (in English and Japanese), looking for, among other things, infor- mation about the joint venture company by recognizing that the company is often the object of the verb estab- lish. In the Boolean stage, this can be approximated by looking for the combination of words like establish with words like venture. In the finite state pattern matching stage, the system might look for any word with the root establish followed by the venture term (and perhaps the reverse order with the verb in a passive form). In deeper interpretation, the system applies syntactic and semantic constraints to recognize the different ways that the con- 192 Development time knowledge base compilation Run-time control flow Boolean approximation Routingi Boolean ~ (establish OR establlshed) Retrieval tables AND venture Lexico -semantic regular expressions { Fin fte state pattern matching Extraction (root establish) * venture Statistical and manual .4 Lexicon, grammar, knolwedge bases K i Natural language semantic interpretation (c-establishing (r-effected (c-venture)) Interpretation Figure 2: Approximation and Natural Language Processing cept of establishing can be expressed, and insure that the words appear in a grammatically acceptable way in the input. This more detailed analysis can be crucial in data extraction tasks. A critical point about this model is that even detailed knowledge about language often ends up contributing to the Boolean tables, so the Boolean expressions are con- siderable more complex than those that could easily be created by hand. Furthermore, the Boolean expressions are a relaxed form of more complex knowledge-based con- structs. It is very difficult to predict the impact of sim- ply relaxing constraints. For example, TREC topic 53 includes the phrase "leveraged buy-out". In a strict syn- tactic analysis, this phrase would have to appear exactly in that form, enforcing both proximity ("leveraged" ad- jacent to "buy-out") and order ("leveraged" before "buy- out"). But relaxing such constraints seldom has any se- vere effect on results: the Boolean expression (leveraged AND buy-out) in our system will recognize the two terms occurring anywhere in the same paragraph, which is actu- ally likely to admit more texts about leveraged buy-outs than admit texts in which the words coincidentally ap- pear together. Even the phrase "prime rate", which matches some ir- relevant texts in its Boolean form (including, for example, one or two texts about Japan that mention "prime min- ister" in the context of "growth rate"), also admits some additional relevant texts in which "prime" appears in the same paragraph as "rate". The well-known effects of word order and proximity on meaning, exemplified by the dis- tinction between "blind Venetian" and "Venetian blind" do not seem to appear very frequently in real examples. At least, these effects may be less frequent In TREC-2, we did not apply any constraints stricter than Booleans at run-time; in other words, we used only a Boolean retrieval engine (because in TREC-1 we proved that the stricter constraints didn't help). We also had to implement a module to relax some of the "hard" Boolean constraints for topics with a very small number of relevant documents, because of the nature of the performance met- rics. However, we still used the knowledge that was devel- oped for more detailed processing, including the phrases, semantic groupings and so forth. In addition, we added a more sophisticated ranking mechanism than we used in TREC-1, because ranking is very important in the evalua- tion. But the only retrieval engine in our TREC-2 system is a Boolean matcher. 2.1 The Boolean Matcher The Boolean tables are efficiently organized so that a C program (which we now know as NLGREP) can match them against incoming texts at a rate of about 1 million words per minute. This program spends little time on anything other than marking where words in each doc- ument match terms in the table. In both routing and retrieval, the total number of terms used for a set of 50 topics is about 2000, or about 40 unique terms per topic. All other words are ignored entirely. For example, the following is a query for the topic "South African sanctions" (using the enhanced regular expression language of GE's pattern matcher [5]): 193 sanction == E(member sanction sanctions 2140 AND 2138 2139 disinvestment) 2142 AND 2141 2029 2143 OR 2137 2140 2142 ] 2148 OR 2144 2145 2146 2147 2151 AND 2149 2150 sa~rica == E(member Buthelezi Pretoria 2154 OR 2153 52 anti-apartheid apartheid) 2155 AND 2152 2154 2156 OR 2148 2151 2155 ] 2157 AND 2143 2156 2158 AND 2156 2143 ,; rule 1 $sanction * $safrica => (mark-topic 52) TOPICOS2 OR 2157 2158 This description says that any matching text must have both an indicator of South Africa ($safrica) and one of sanctions ($sanction), and that the sanction phrase and South Africa phrase must appear in the same paragraph in the document. A sanction phrase can I)e any of the simple words sanc- tion, sanctions, or disinvestment, or any phrase includ- ing punitive measures with no more than two intervening words (like punitive economic measures). A South Africa phrase can also be either one of a group of simple words, or a phrase, like De Kierk, South Africa, or South African. These queries or topic descriptions can be quite com- plex, and the method has been designed to handle many queries simultaneously, so the rule compiler is designed to produce expressions that can be efficiently applied within a large set of queries. This is important because many queries can share the same simple terms or combinations of terms, and because the Boolean matcher must match the simplest expressions first. For the topic description given above, the output of the rule compiler will include the following tests: 52 TERM AFRICAN 2029 TERM MEASURES 2134 TERM SANCTION 2135 TERM SANCTIONS 2136 TERM DISINVESTMENT 2138 TERM SULLIVAN 2139 TERM PRINCIPLES 2141 TERM PUNITIVE 2144 TERM BUTHELEZI 2145 TERM PRETORIA 2146 TERM ANTI-APARTHEID 2147 TERM APARTHEID 2149 TERM DE 2150 TERM KLERK 2152 TERM SOUTH 2153 TERM AFRICA 2137 OR 2134 2135 2136 Each line in the above data gives a unique number (or topic designator) to the test, a test identifier (either TERM for a simple word test, OR, or AND), and a list of simple terms or previous tests. For example, test 2137 depends on tests 2134, 2135, and 2136, and is true if any of those tests is true, namely, if the text includes any of the words sanction, sanctions, or disinvestment. The tests are automatically ordered so that all tests that are dependent on other tests will have higher numbers than the tests they depend on; thus all TERM tests appear first. In this case, the TERM test AFRICAN appears with a much lower number simply because it is used in many different queries. The matcher, which can work either on complete doc- uments or paragraphs (but we used paragraph matching only in TREC-2) goes through every word in its input and, using a fast table look-up, sets the TERM tests to true for every word it encounters. At the end of input, ei- ther the end of the paragraph or end of each document, it runs through the table of possible tests from low numbers to high numbers and sets tests to true if their conditions are satisfied. A topic test produces a match if it has be- come true at the end of this process, meaning that the paragraph or document has passed the pre-filter for that query. A single paragraph, of course, can satisfy multiple queries. This portion of the system was implemented in the space of a few days, and is almost entirely the same as in TREC-1. Our focus since last year has been on query con- struction and ranking rather than matching or retrieval. 2.2 Query construction Our approach assumes, in general, that manual query construction is acceptable for routing. In ad hoc retrieval, query time can be of the essence, but in many routing applications, queries are developed and refined over time. The amount of time spent on query construction using a manual method in our system is comparable to the amount of time spent on the topic descriptions used for automatic query generation. 194 2.2.1 "Manual" queries for routing 2.2.2 "Hard" vs. "Soft" Booleans In manual routing, our approach uses a statistical cor- pus analysis, developed originally for text categorization [4], to pull out terms based on their relative frequency in relevant documents for each topic. The statistic used combines the entropy-based mutual information statistic (testing the independence of each term with each topic) with a correction for low-frequency terms and for ambigu- ous words. Words with high weights have a high degree of association with a topic. This statistical analysis is also used in ranking. The base weighting formula is the following: C(log2 b)(log2 v) where C is a constant, 6 is number of times a term appears in a story assigned to a particular category, for example, and log2 r is the log of the ratio of combined probabilities (i.e., of a particular word or phrase occur- ring in a text about a particular category) to the prod- uct of independent probabilities~the mutual information statistic. This tests the assumption that the use of the word and the category of the text are independent. When this assumption is false, the word gets a high positive or negative weight. For example, the following are the top words for Topic 51, "Airbus subsidies": A-330 TOPIC51 1263.3 AIRBUS TOPIC51 1183.1 A-340 TOPIC51 1178.2 INDUSTRIE TOPIC51 1071.9 A-320 TOPIC51 1067.5 MESSERSCHMITT-BOELKOW-BLOHN TOPIC51 851.3 AERONAUTICAS TOPIC51 843.3 CONSTRUCCIONES TOPIC51 807.9 AEROSPATIALE TOPIC51 762.5 MBB TOPIC51 722.6 WIDE-BODY TOPIC51 617.7 MD-il TOPIC51 613.8 TOULOUSE TOPIC51 228.5 JETLINERS TOPIC51 217.8 LUFTHANSA TOPIC51 196.6 MD-80 TOPIC51 196.6 Clearly, these words all have some reason to be associ- ated with this topic, but adding them to the appropriate group in each query (or ignoring them entirely) is a "man- ual" process. Our manual routing queries, therefore, are a combination of the regular expressions that were devel- oped from the topic descriptions with terms added that were selected from the automatic training. This is, we believe, a very practical manual approach that has very good performance. The Boolean matcher uses a "hard" Boolean approach, in that it will admit only texts, for each query, that satisfy the conditions of that query. For example, in Topic 51 above, "Airbus subsidies", the matcher will allow texts only that have boih and Airbus term and a subsidy term in the same paragraph. However, this is a narrow topic, and TREC-2 allows each system to produce 1000 texts for each topic. The evaluation metrics offer no penalty for filling up the list of 1000 with texts that are likely to be irrelevant. So, in order to provide increased flexibility and consider larger numbers of texts for each topic, we used an additional engine only for the purpose of pulling in texts for very specific queries like this one. The system is still a hard Boolean system in that texts that satisfy the Boolean conditions will always be ranked higher than texts that do not satisfy the conditions; how- ever, texts that do not satisfy the conditions can appear on the final ranked lists. The "soft" engine considers such texts by relaxing some of the Boolean conditions, effec- tively pulling in texts that have a large number of terms that match the query, but do not necessarily meet all the conditions. This component of the system is more like statistical retrieval engines; however, it does not have a large impact on the overall scores, because it only af- fects the results at the low-precision extreme (the low- est rankings) for queries that match very few documents. In fact, for Topic 51, the hard Boolean query matches 11 texts, and the soft method pulls in an additional 989 texts. But there are only 11 texts that are judged rele- vant, of which 10 satisfy the hard Boolean. So enforcing the hard Boolean condition seems to work well for this topic, and the soft Boolean doesn't contribute much. For some topics, the balance between the hard Boolean condition and the soft Boolean isn't so clear; i.e. not en- forcing the hard condition would lead to better ranking. This seems be a function of how well the topics fit with Boolean expressions in general. "Airbus subsidies" is re- ally a Boolean topic, in that a relevant text must say something about Airbus and something about subsidies. Other topics like "automation" are much harder to ex- press in a Boolean form. We will cover these issues in topic-by-topic performance later in this paper. 2.2.3 "Automatic" queries for ad hoc retrieval The "m?~ual" query method is partly automated in the sense that the corpus-based statistical training suggests many of the terms that are used in the queries. But it is "manual" in that the initial formulation of the queries is done manually from the TREC topic descriptions. We have tried, as a simple experiment, to generate the Boolean term groupings and expand each term automat- ically from the topic descriptions. In the days before the TREC-2 ad hoc test, we tried several different ways of do- 195 mg this automatic Boolean query generation, and chose the one that worked best on the sample data. Our first attempt was to use the common methods for finding collocations and word associations in sentences, and these worked horribly for term expansion. The prob- lem is that this approach finds more associations like "fu- neral" and "home" than it does "hostage" and "captive", and the latter, text-level associations are what's required to generate good queries. The "solution" we tried was, for a sample of about 10 million words in the corpus, to choose the top 20 words based on TF.IDF weights for each document, store the frequency of association among these terms, and then weight each pair using the weighted mutual information statistic of the previous section. This was much better than using sentence-level information, although it is still a very straightforward approach. For example, the fol- lowing are the top 10 terms associated with the word "hostage" (in order): hostages Lebanon Beirut Iran release Terry kidnappers kidnapped Jihad Anderson While this is certainly not the optimal set of terms to use in place of "hostage",it is a good start. The next problem in automatic query construction is when to use a combination of terms and when to use a single term. For example, the term "weather-related fatalities" is a combination of two word groups (weather and fatalities) while "Iran-contra affair" is really only one group (Iran-contra), even though it might appear that "affair" is a significant term. Again we took the direct approach, choosing to com- bine terms whenever there was a reasonable percentage of overlap between their associated terms. This worked sur- prisingly well in cases where the topic title was a good de- scription (e.g. "welfare reform") and very badly for those with vague titles (e.g. "find innovative companies"). We tried to recover from these by including more words from the description and narrative, but then we had to start recognizing the language of these descriptions, filtering out words like "relevant", "mention" and so forth. At this crude stage, the main problem with the query generation method is using the structure of the topic descriptions. The second major issue with automatic query genera- tion is that it isn't nearly as good at finding good terms as the process of training from data and relevance judge- ments, as used in the routing experiments. The relevance judgements used for routing contain large volumes ofrela- tively high-accuracy data, while the training used for term expansion in query generation relied on relatively small volumes of relatively noisy data. For example, the word "welfare" used in one of the ad hoc topics occurred with a high enough TF.IDF weight only 29 times in the training sample, and the most frequently associated term, "chil- dren", occurred only 6 times. In order to establish good associations between "welfare" and less frequent terms, we would need much more data. The data from TREC-2 seem to suggest that low-frequency terms contribute more in term expansion than high-frequency terms, so using a "small" training sample (10 million words is only about 3% of the corpus) was a major error. We made many other mistakes in the training method, including mixing samples from the Federal Register and DOE sources with other texts that are much more likely to be relevant. This leaves a lot of room for future experiments and improve- ment. The fully automatic ad hoc system certainly didn't do as well as the manual routing system, but it was still at or above median for more than half of the ad hoc topics. Considering that this method could be used within the context of most any legacy retrieval system, the result is worth noting. Furthermore, the generation of Boolean queries from natural language descriptions is an interest- ing, as well as practical, research problem, because many different retrieval systems can make some use of Boolean queries. 3 Ranking In both routing and ad hoc, we used a set of word weights for ranking, acquired using the relevance judgements in the routing case and from the corpus data in the ad hoc case. In routing, the weights reflect the statistical mea- sure of association between the term and each topic (using the weighted mutual information score given earlier). In the ad hoc case, the weight is a function of the frequency of the term in the topic description, the inverse collec- tion frequency, and an additional factor to weight certain components of the topic descriptions (such as the title and description) more heavily than others. We combined the weighted frequency of these terms with an overall count of the number of topic hits per document, normalizing for document length, to produce a score for each document. This was the result of trying many different approaches on the test data, so it was definitely a good method for our system. ilowever, in comparing our results with those of other systems, our precision curve across various recall points is not nearly as good as a system that does really good ranking. In routing, we are not sure that ranking is im- portant, but it is certainly important in getting good re- sults in TREC. So, we are inclined to try to combine our 196 retrieval method with alternative ranking methods to see, for example, whether more terms are really necessary in order to get better ranking results. The separation of retrieval and ranking seems to be a valuable tool both for experimental research and for iden- tifying different techniques for applications. It is clearly a problem with both TREC-1 and TREC-2 that the routing task requires a comparison of documents across a large collection, when most routing applications deal with a stream of documents individually or in small groups. 4 Analysis of Results The results raise a number of important issues, espe- cially: why Boolean approximation works as well as it does, particularly why it works for routing; where statis- tical weighting could help more; what sort of topics this approach does well on (and which topics it does badly on); and other obvious areas for improvement. One of the most important sources of information about the advantages and disadvantages of each approach comes from comparing the performance of different sys- tems on different topics. Unfortunately, this is also a very difficult task, because, while it is easy to tell which systems did well on which topic, it is often hard to gener- alize from that evidence wh~ the approach worked or why it didn't. As we have mentioned, the Boolean approach is very erratic with respect to performance by topic, as compared with other systems, particularly the statistical methods that emphasize weighting. For example, our manual rout- ing system, which was clearly one of the best systems, had the top performance (in precision at 100 documents) on 8 topics, but was below median on 17 topics (out of 50). In the 11-point averages, that system was below median on 22 topics-more than 40% of the time-although it out- performed most of the systems on average. By contrast, one of the Cornell systems [1] was above median on ev- ery topic! This suggests that our approach degrades less gracefully than other approaches, and that it is important to explore Boolean methods as an adjunct to other meth- ods that work in the cases where the Boolean approach seems to fail. Conversely, our system had top or near-top scores on a significant number of topics; it is important to know how to take advantage of this within the context of weighting systems. There seem to be several different explanations for vari- ation on the topics. First, there are topics, as we have dis- cussed, that are particularly well suited to Boolean meth- ods (and others that are not well suited at all). Second, there are cases where the training method seems to work particularly well. Third, there are cases where the man- ual approach might work well because there are terms in the topic description that are particularly misleading. Fi- nally, there are many reasons why our approach can fail, particularly on topics with very small numbers of relevant documents and in cases where the topics are very vaguely specified. One of the topics where GE had the best results was Topic 53, "leveraged buy-outs". The topic description specified that relevant documents had to describe an LBO above $100 million in value, and give the terms of the buy-out. Apparently, the $100 million figure is not irn- portant, because most of the LBO's that are reported are major buy-outs. However, the terms (the specification of the dollar amount) are required. Many articles about LBO's do not report dollar amounts. This is similar to the "Airbus subsidies" topic, where many articles that talk about Airbus do not mention subsidies, and they are not relevant. The advantage here seems to be that the hard Boolean outperforms the weighting approaches be- cause weighting, without Booleans, is likely to give an article with many LBO words, but no dollar figures, a high weight, just as it could a high weight to an article about Airbus that doesn't mention subsidies. The effect of training seems to help in the "leveraged buy-out" case as well. The training picked up many names of companies involved in buy-outs, like "Safeway" and "Dart", and these were included in the queries. This perhaps helped to separate articles about specific buy- outs from buy-outs in general. A similar effect came about on Topic 92, "international military equipment sales", where the training pulled in names of many of the weapons typically sold on the international market. Topic 86, "bank failures" was another topic where the GE system outperformed all others on both the 11-point average and precision at 100 documents. This result is hard to explain, but the one conspicuous fact about the topic is that our query does not include the word "bank". It does include the names of many prominent banks, so it may be, like the LBO case, that good performance on this topic depends mainly on distinguishing specific references to failures from general discussions about bank failures, for example, the S&L crisis. On the topics that have very few relevant documents, our approach often failed because, in the absence of train- ing data, it tended to undergenerate; thus very few texts (sometimes none) would match the Boolean query and other systems with good weighting would pull in more relevant documents. In these cases, a system that finds one relevant document scores much better than a system that finds zero, so the penalty for undergeneration is very high. The second class of topics where we seem to go wrong is in those that are vaguely specified. For example, Topic 74, "policy conflict" is a very hard topic, where the de- scription does not include very much information. Texts rarely mention policy conflict, and, when they do, they are rarely relevant. On the other hand, texts about to- bacco policies and health are likely to be relevant. This 197 is probably a case where there is no point in having a Boolean query, and probably systems with the best train- ing and weighting methods do best. Another important point about the TREC-2 results is that the best automatic systems did significantly better than the best manual systems in the routing task, prob- ably because of the success of the training methods used. This raises an important question with respect to the rel- ative contributions of time spent on query construction versus time spent assembling training data. The volume of training data used in TREC-2, with hundreds of thou- sands of relevance judgements, is not realistic for most routing scenarios. Thus, while we should continue to rely oil good training methods, we should be careful to sep- arate out the effects of training and to develop manual routing methods that work with smaller amounts of train- ing data. Given that Boolean and manual methods seem to do best on certain topics, and other approaches that em- phasize weighting do better on others, it makes sense to combine the best from different approaches. ilowever, this raises the issue of how different results can be incor- porated into our model without losing the advantages of the Boolean method, particularly the compatibility with so many existing systems. 5 Future Goals Many 6f our results from TREC-2 suggest areas where the method of generating Boolean queries, especially using corpus data, can be substantially improved. There is a great deal of room for future progress, so we believe that this approach will continue to be viable with respect to other routing and retrieval methods. The main area for research is in continuing to ex- plore new corpus analysis methods. Our corpus analyzer weights single terms. But the Boolean queries depend on combinations of terms, not only in the case of phrases but also to control the effect of ambiguous words. In the con- text of a given query, a single term can often be roughly comparable to the Boolean AND of two or more other terms. Up to this point, we have not quite been able to automate the process of discovering these relationships in the corpus. This is important for both routing and ad hoc retrieval, but especially for routing. Both the routing and ad hoc systems can benefit from the use of new ranking methods, and possibly from ex- ploring hybrid approaches that take advantage of the Boolean method on topics that are well suited to Boolean expression and degrade more gracefully to traditional weighting methods on other topics. In general, the com- bination of methods is something that merits new exper- iments. The routing system could benefit from new training methods. Because the Cornell system did especially well using a training method that produced large numbers of terms and used both positive and negative information, it is possible that this general approach could help our system as well. The ad hoc system suffers mostly from difficulties in handling the topic descriptions; our method of deriving Boolean expressions from the topic descriptions is still ex- tremely crude, and there are many topics for which the approach produced almost useless queries. The perfor- mance of the automatic ad hoc system was even more erratic than that of the manual routing system, but it is very likely that many of the problems can be solved with a lot more work on processing the topic descriptions. Al- though we are loath to direct research at issues that are particular to the formulation of the TREC topics, this work may be necessary to determine the real power of automatically generating Boolean queries. Finally, we are interested in exploring many ways that our corpus analysis and query generation component can be combined with other systems. Because we have fo- cused our attention on query content rather than ranking or retrieval models, we believe that our results could quite likely be used within many other retrieval systems. It is natural to look for such synergy. We have had some pre- liminary collaboration with the UMass team to try to use our queries within the INQUERY system [2], but we still have a long way to go. We will continue to explore such collaborative efforts and to concentrate our own efforts on corpus analysis and building queries. 6 Summary GE's participation in TREC involved the implementa- tion of a number of strategies for creating Boolean queries from the topic descriptions. A statistical corpus analyzer helped to refine queries for both the routing task, and to generate them automatically for the ad hoc task. The simple Boolean retrieval engine performed well, especially in routing. As before, there is tremendous variation in the topic-by-topic results, suggesting that a great deal more research is needed to find how to get the best re- sults in different routing and retrieval scenarios. We are encouraged by the progress of our system, as well as of the overall field, in these experiments, and are hopeful that in the coming years we will learn how to combine our promising results in Boolean approximation and cor- pus analysis with the more mature ranking and retrieval models of some of the other systems. 198 References [1] C. Buckley, J. Allan, and G. Salton. Automatic rout- ing and ad-hoc retrieval using SMART: TREC-2. In Donna ilarman, editor, Proceedings of ihe Second Texi Reirieval Conference (TREC~2); Gaithersburg, MD, 1993. [2] W. Bruce Croft. The University of Massachusetts TIPSTER project. In Donna Harman, editor, Pro- ceedings of ihe Second Texi Reirieval Conference (TREC-2), Gaithersburg, MD, 1993. [3] P. Jacobs, G. Krupka, L. Rau, M. Mauldin, T. Mita- mura, T. Kitani, I. Sider, and L. Childs. The TIP- STER/SIIOGUN project. In Proceedings of ihe TIP- STER Phase I Final Meeling, September 1993. [4] Paul S. Jacobs. Using statistical methods to improve knowledge-based news categorization. IEEE Experi, 8(2):13-24, April 1993. [5] Paul S. Jacobs, George R. Krupka, and Lisa F. Rau. Lexico-semantic pattern matching as a com- panion to parsing in text understanding. In Fourih DARPA Speech and Nalural Language Workshop, pages 337-342, San Mateo, CA, February 1991. Morgan-Kaufmann. [6] George R. Krupka Paul S. Jacobs and Lisa F. Rau. A Boolean approximation method for query construc- tion and topic assignment in TREC. In Second Annual Symposium on Documeni Analysis and Inform alion Reirieval, May 1993. [7] Lisa F. Rau, George R. Krupka, and Paul S. Ja- cobs. GE NLToolset: MUC-4 test results and anal- ysis. In Proceedings of ihe Fourih Message Under- sianding Conference (MUC-~, San Mateo, CA, June 1992. Morgan Kaufmann Publishers. 199