TREC-Il Routing Experiments with the TRW/Paracel Fast Data Finder Matt Mettler TRW Systems Development Division Redondo Beach, CA Fritz Nordby Paracel Inc. Pasadena, CA 1.0 Introduction For TREC-Il, we were interested in experimenting with improved methods of constructing queries for the Fast Data Finder (FDF) text search coprocessor. We learned from TREC-I that while the pattern matching ability of the FDF can sometimes be put to significant advantage (we had the high score on 8 of the 50 routing topics in TREC-I), this wasn't sufficient overall to overcome the weaknesses traditionally associated with the boolean approach to text retrieval. Many of the TREC topics are too abstract and ambiguous to respond well to a boolean query formulation. Our goal for this year therefore, was to apply the FDF hardware to a more statistical or soft boolean retrieval approach while not giving up on our ability to make use of specific features or patterns in the text when they are obviously important. We experimented with two different schemes. In the first scheme, we utilized subquery proximity to rank hit documents. We developed the subqueries manually, then determined the optimum proximity values by test runs on the training data. The most effective values were then used in the official routing queries. The second scheme was an FDF adaptation of the traditional Information Retrieval (IR) term weighting approach. In addition to single word terms, we also included two and three word phrases, and FDF subqueries designed to detect special features in the text. While in the terminology of TREC both are examples of manual query formulation with feedback, we believe these techniques can be evolved to create quenes automatically from samples of relevant text and to also incorporate user knowledge of specific text features of interest when it exists. We also continue to believe that the utilization of a hardware accelerator such as the Fast Data Finder, enables the implementafion of high performance routing or dissemination applications at a far lower cost than can be achieved with conventional general purpose processors. 201 2.0 The FDF Text Retrieval Approach The Fast Data Finder is a hardware device that performs high-speed pattern matching on a stream of 8-bit data. It consists of an array of identical programmable text processing cells connected in series to form a pipeline processor. The cells are implemented using a custom VLSI chip designed and patented by TRW. In the latest implementation, each chip contains 24 processor cells and a typical system will have 3,600 cells. Each cell can match a single character of query or perform all or part of a logical operation. The processors are interconnected with an 8-bit data path and approximately 20-bit control path. To perform a search, a microcode program is first downloaded into the pipeline to direct each processor. The database is then streamed through the pipeline. The data bytes clock through each processor in turn until the whole database has passed through all processors. As the data is clocking through, the processors alter the state of the control lines depending on their program and the data stream values. When the pipeline's processor cells detect that a series of database characters match the desired pattern, a hit is indicated and passed by external circuitry back to the memory of the host processor and to the user. The FDF pipeline runs at a constant speed as it performs character comparisons and logical operations, regardless of query complexity. The queries or patterns are specified in the FDF's Pattern Specification Language (PSL). The hardware directly supports all the features in the PSL query language without the need for software post-processing. The processors in the pipeline may all be used to evaluate a single large query or may be assigned to evaluate numerous smaller queries. The number of pipeline cells a query needs is proportional to the size of the query. PSL provides numerous search functions, which may be nested in any combination, including: * Boolean logic including negative conditions * Proximity on any arbitrary pattern * Wildcards and "don't cares" anywhere in the word * Character alternation * Term counting, thresholds, and sets * Error tolerance (fuzzy matching) * Term weighting * Numeric ranges The Fast Data Finder was originally designed and developed at TRW. In 1992, TRW licensed the FDF technology to Paracel Inc., which now sells a commercial product called the FDF-3. 3.0 Proximity Query Generation Our first set of experiments revolved around the use of subquery proximity to rank hit documents. We began with a simple observation: topics are often a conjunction of ideas or concepts. For example, Topic 51, Airbus Subsidies, is a conjunction of the idea "Airbus" (a particular aircraft manufacturer and European consortium) and the idea "subsidy" (in particular, subsidies from the nations belonging to that consortium). Other articles about Airbus Industrie (new planes, fly-by-wire in the A320, accident reports, financial health 202 reports, etc.) are not relevant to this topic; nor are articles which describe subsidies not directed toward Airbus. Traditional IR term weighting techniques do not give any explicit benefit to articles which conjoin ideas. Articles which include terms relevant to each of the component sub-topics will receive high scores; but so will articles which include many terms relevant to only one sub-topic. Recent efforts implicitly include conjunctions through the use of phrases as terms in otherwise traditional statistical methods. An alternative is the use of boolean operators. This has the desired effect -- an AND of terms forces a conjunction -- but the use of booleans in IR has been viewed with some skepticism and disfavor. Boolean operators often find a conjunction of terms where none truly exists (for example, Airbus and subsidies might be mentioned in two separate and unrelated portions of an article); or, if made sufficiently restrictive to eliminate spurious matches, boolean-based searches often miss relevant articles. We have followed an approach which incorporates both ideas. Rather than focus on specific phrases, we search for terms in proximity to one another. The terms in the query are chosen to represent each of the constituent sub-topics, just as in a boolean search. The specificity of the query is adjusted by varying the required proximity of the terms. Thus, for Airbus subsidies we might search for terms representing "Airbus" in a range of proximities to terms representing "subsidies". This approach allows conjunctions to be graded. A small proximity restriction (say, 3 words) yields results similar to a keyphrase search, indicating that the two concepts are indeed associated in the article and that the article is relevant to the topic. A large proximity restriction (1 article) is analogous to a simple boolean keyword search and retreives articles in which the concept terms may be only loosely associated. Intermediate proximities (1 sentence, 1 paragraph, etc.) indicate intermediate degrees of association and intermediate recall/precision trade-offs. It is also possible to use multiple proximities in a single query with this method, or to use proximities and occurrence frequencies together, to form multi-dimensional arrays of query parameters. For example, for Topic 62, Military Coups Dtetat, the number of conjunctions was traded off against the proximity of the conjunction to form a two-dimensional query set. For the initial experiment, lists of synonyms representative of each idea in a topic were manually built, and one- or two-dimensional query sets were built from these lists. These queries were then run against the training database, and after some feedback, the query sets were finalized. Each finalized query set was run against the training database to determine a ranking of the queries based solely on selectivity. Table I shows a sample proximity query and Table III shows our TREC-Il results. The number of relevant documents retrieved by the proximity method queries are labeled TRW1. 203 4.0 Statistical Query Generation Our second set of experiments revolved around the use of term weighting. Our basic approach was to follow the well researched path of generating term weights proportional to the occurance of words in a sample of relevant text and inversely proportional to the occurance of words in the database as a whole. Since we were doing the routing topics, we gathered stafistics on Volume II and hoped that they would be valid for Volume III. For sample relevant documents we used the NIST TREC-I relevant documents from Volume II. We did not use the topic narratives or descriptions. In addition to single word terms, we also considered two and three word phrases. We used the FDF itself to scan the training corpus and determine the phrase frequencies of interest. To adapt this standard approach to the FDF, we needed to make three algorithmic modifications. First, we needed to adapt to the limitations imposed by the FDF hardware. While extremely effective for pattern matching, the FDF is not a general purpose computer. While the FDF processor cells can perform basic addition/subtraction, the datapath available to accumulate an aggregate score for a document is limited to 8 or 9 bits. Thus we had to restrict the term weights (and the range of their sums) to integer values between 0 and 255 or 0 and 511. This had the effect of truncating most topic's query terms at 10- 20 terms (words, phrases, or special features). We also excluded terms from our queries that did not appear in at least 30% of the relevant sample documents. Second, we were striving to not give up the strengths of the FDF's pattern matching capabilities to pinpoint special features in the text which have a large impact on document relevance. We manually reviewed the topics and prepared special feature subqueries in an attempt to increase the precision for particular topics. For Topic 59, Weather Related Fatalities, we manually prepared a special feature subquery to detect phrases detailing a numeric value of people killed. We determined the frequency of each special feature, both in the sample relevant documents and in the training database as a whole, and just added these into the word list as if they were regular single word terms. In some instances our manually prepared subqueries jumped to the top of the list of statistically relevant terms for a topic; in others they didn't. Third, we observed that some topics had particular words, phrases, or special features that were present in almost all relevant documents. We converted terms that occurred in >90% of the relevant documents to boolean ANDs in our queries. This was intended to improve precision for topics like 62, Military Coups D'etats. The topic narrative specifically stated that the country involved must be named. One of our special feature subqueries was a list of known foreign country names. While of no statistical significance as a term, this subquery did hit on almost every sample document. We thus ANDed it into the query as a required boolean term. Table II shows a sample statistical query. DocCount is the number of documents in the sample that included the term, phrase, or special feature. DbCount is the number of documents in our training sample that included the term. Weight is DocCount divided by DbCount. PslWeight is the integer coefficient based on the Weight. The relevant documents retrieved by the statistically generated queries are labeled TRW2 in Table III. 204 f TABLE I - Sample Proximity Query Query Template for Topic 75, Automation 1* Concept 1: automation *1 define x~auto `automat (elionledlesling)' end 1* Concept 2: change in economics *1 define x_up_dn `[increasidecreas) I `rais[elesiedling) I `ris[elesiing] I `drop[Islped]' I `dip[Islping) I `expan[dldsldedlsion[Isl)' [elesledling) I `lower[Isledling)' I `fall[sjing) I `hike[jsJd) I `cut[Islting)' I `reduc[elesledl tion[Is)]' end define x_econ `cost[Is]' Ipayroll*I I{ 3 words -> `work' and `force') end 1* * Query template used for Topic 75 *1 (proxi) -> @X_auto and (prox2) -> @x~up_dn and @X_econ TABLE II - Sample Statistical Query Term weighting table for Topic 93, NRA Political Backing Doc sample size is 150 Term DocCount DbCount Weight P51Weight nra 87.0 130.0 0.66923077 62 national_rifle_a 145.0 255.0 0.56862745 53 assault~weapons 51.0 161.0 0.31677019 30 semiautomatic 68.0 332.0 0.20481928 19 gun_control 73.0 387.0 0.18863049 18 handguns 51.0 319.0 0.15987461 15 rifle 146.0 1336.0 0.10928144 10 handgun 52.0 550.0 0.09454545 9 firearms 63.0 811.0 0.07768187 7 rifles 46.0 1318.0 0.03490137 3 gun 126.0 3866.0 0.03259183 3 guns 96.0 3020.0 0.03178808 3 law_enforcement 51.0 2509.0 0.02032682 2 assault 67.0 3491.0 0.01919221 2 ban 91.0 5510.0 0.01651543 2 enforcement 54.0 4825.0 0.01119171 1 weapons 99.0 8993.0 0.01100856 1 association 148.0 16780.0 0.00882002 1 legislation 64.0 7984.0 0.00801603 1 laws 45.0 7662.0 0.00587314 1 205 TABLE III - TRW/Paracel TREC-Il Routing Results NIST Relevant Retr @100 TRW1 TRW2 Qry Rel. Best Median Worst Rel Rel Description 51 11 11 10 0 10 10 Airbus Subsidies 52 454 100 97 31 99 99 S. Africa Sanctions 53 154 84 50 1 52 55 Leveraged Buyouts 54 124 65 57 0 21 47 Sat. Launch Contracts 55 320 100 87 0 54 96 Insider Trading 56 395 96 88 3 89 96 Prime Rate Moves 57 319 91 73 0 61 91 MCI 58 76 62 39 4 49 28 Rail Strikes 59 574 97 80 1 95 95 Weather Fatalities 60 18 9 4 0 2 3 Merit-Pay vs. Seniority 61 67 60 24 2 7 24 Israel & Iran-Contra 62 426 92 75 9 82 81 Military Coups D'etat 63 74 40 24 0 24 29 Machine Translation 64 282 68 53 1 29 57 Hostage-Taking 65 214 55 29 0 2 46 Info Retrieval Systems 66 86 40 19 0 20 23 NLP 67 365 64 52 0 10 61 Civil Disturbances 68 76 61 38 0 58 55 Health Hazards 69 1 1 0 0 0 0 Reviving SALT II Treaty 70 34 32 28 0 31 32 Surrogate Motherhood 71 300 73 35 1 21 32 Border Incursions 72 91 43 31 0 17 30 Demographic Shifts/U.S. 73 355 74 49 0 23 74 Demographic Shifts/World 74 323 60 32 S 6 35 Conflicting Policies 75 372 59 28 1 59 31 Automation 76 163 34 24 0 11 34 Original Intent 77 85 49 36 0 36 40 Poaching 78 83 66 58 0 40 57 Greenpeace 79 341 85 75 0 81 77 FRG Party Positions 80 143 54 8 0 54 8 Candidate Platforms 81 4 4 2 0 1 2 PTL Fallout 82 203 89 65 0 30 87 Genetic Engineering 83 235 64 31 2 54 25 Protect the Atmosphere 84 101 27 16 0 22 14 Alternative Energy 85 670 88 69 38 59 84 Official Corruption 86 40 27 16 0 19 12 B~nk Failures 87 151 75 42 2 42 36 S&L Prosecutions 88 32 29 21 0 11 26 Crude Oil Price Trends 89 17 8 2 0 3 7 OPEC Investments 90 75 39 10 0 39 9 Oil & Gas Reserves 91 9 6 3 0 5 5 Acq of Advanced Weapons 92 27 17 6 0 7 9 Military Equip Sales 93 94 65 61 9 61 62 NRA 94 300 76 47 0 57 62 Computer Crime 95 359 77 45 1 63 40 Computer Crime Detection 96 310 86 42 2 23 49 Computer Medical Diag 97 319 59 21 0 20 41 Fiber Optics Appl 98 722 96 67 0 83 84 Fiber Optics Manuf 99 291 99 92 10 91 89 Iran-Contra 100 204 94 84 0 52 89 Controlling High Tech 206 5.0 Analysis of Results The results from our two TREC runs Crable III) are summarized below. The proximity queries CFRWl) scored at or above the median on 28 topics (including three topics which achieved the best score) and below median for 22 topics. While not bad, our proximity queries did not perform as well as we'd hoped. Where the proximity queries did poorly, we attribute this primarily to poor term selection. One such case was Topic 65, Information Retrieval Systems. We made two errors: first, due to an oversight, one of the manually entered query terms was overly broad; second, the query author considered database systems to be "information retrieval systems". We feel that this is a fault of the query formulation, not of the assessments. If we had checked the training assessments, we would have eliminated the terms from our query. Any system which relies solely on the topic statement will run afoul of this problem. Systems which make use of user-supplied relevance information will achieve better performance. Our results again demonstrate the inherent problems with basically boolean query formulations. Our efforts at using proximity to soften the boolean were not sufficient to overcome this weakness. Run High Above Med Median Below Med Low 11-pt avg TRW1 3 19 6 22 0 0.2525 TRW2 5 27 5 13 0 0.3459 The statistical queries (TRW2) did much better, scoring at or above the median on 37 of the topics. The 11-pt average and total relevant documents retrieved figures were excellent and are close to the best academic groups. The adaptations to run the statistical queries on the FDF hardware evidently did not hurt performance. We again observed that the dominant factor in achieving good performance is proper term selection. The details of the term weight calculations didn't seem to make much difference except to influence which terms were selected. We tried a number of different schemes for generating term weights including using various statistical parameters, log weighted coefficients, and converting terms present in all sample documents to boolean ANDs in the query. For a given set of terms, we did not find much difference in performance between these schemes. The scheme used for TRW2 was one of the simpler ones we tried. We were also interested in evaluating the use of phrases and special features as additional terms in our statistical queries. They seemed to help, but not dramatically. This was a disappointment. Looking at the results topic by topic however, we observed a lot of variation. For some topics, the addition of a key phrase or special feature helped a great deal. This indicates that use of phrases and special features has promise for improving performance, but that we just have not learned how and when to employ them. For example, our term weighting scheme this year didn't account for term interdependence. Particularly when we start mixing single word terms with phrases and special features that contain those same terms, it would seem the algorithm could be improved by explicitly accounting for this redundancy. 207 6.0 Conclusions and Future Plans Overall, our results for `fl~EC-2 are encouraging. We were successfull in adapting the FDF hardware to running a soft statistical information retrieval algorithm. Not only did the FDF run all the final searches with no significant post-processing, we also found it a useful tool for experimenting with different proximity windows and deriving database frequencies for phrases and special features. We plan to continue our work and to participate in TREC-hI next year. The lessons we learned from TREC-I have proven very valuable, and so too should the lessons of TREC-Il. For TREC-hI, we will continue to examine the range of available query formulation, execution and evaluation models, with an eye to adapting those models for use with the FDF search hardware. We are looking at methods to use the raw horsepower of the FDF to expand existing techniques in ways which had previously been considered too computationally expensive. We are also struck by the observation that different query generation techniques are effective on different topics. A system that could execute a range of methods might be able to match the query generation approach to the topic type. We plan to consider possible schemes for characterizing the topic in advance so that the system might be able to use the best method. 208