Text Retrieval with the TRW Fast Data Finder Matt Mettler TRW Systems Development Division One Space Park R212194 Redondo Beach, CA 90278 (310) 814-4925 mat~wilbur.coyote.trw.com 1.0 TRW has been building high performance text processing and retrieval systems for a number of years. Most of these systems have involved the application of the TRW Fast Data Finder (FDF) text search hardware and have been designed to meet the requirements of specific government customers. Our goal for the TREC conference has been to consider and experiment with the FDF as a tool for more general purpose information retrieval, and to determine the FDF's strengths and weakness compared to conventional information retrieval techniques. Introduction Our experience with the TREC conference has left us encouraged about the ability of a text scanning approach to be competitive with the more involved information retrieval techniques. The inherent limitations of the FDF hardware do not prevent competitive precision and recall for general information retrieval applications when the user topics are properly understood and the topic queries are properly tuned to the dataset. 2.0 FDF Text Retrieval Approach The Fast Data Finder is a hardware device that performs high-speed pattern matching on a stream of 8-bit data. It consists of an array of identical programmable text processing cells connected in series to form a pipeline processor. The cells are implemented using a custom VLSI chip designed and patented by TRW. In the latest implementation, each chip contains 24 processor cells and a typical system will have 3,600 of cells. Each cell can match a single character of query or perform all or part of a logical operation. The processors are interconnected with an 8-bit data path and approximately 20-bit control path. To perform a search, a microcode program is first downloaded into the pipeline to direct each processor. The database is then streamed through the pipeline. The data bytes clock through each processor in t~ until the whole database has passed through all processors. As the data is clocking through, the processors alter the state of the control lines depending on their program and the data stream values. 309 When the pipeline's processor cells detect that a series of database characters have passed by that match the desired pattern, a hit is indicated and passed by external circuiuy back to the memory of the host processor and to the user. The FDF pipeline runs at a constant speed as it performs character comparisons and logical operations, regardless of query complexity. The system we used for the TREC conference searched at 10 MB/sec. The queries or patterns are specified in the FDF's Pattern Specification Language (PSL). The hardware directly supports all the features in the PSL query language without the need for software post-processing. The processors in the pipeline may all be used to evaluate a single large query or may assigned to evaluate numerous smaller queries. The number of pipeline cells a query needs is proportional to the size of the query. PSL provides numerous search functions, which may be nested in any combination, including: * Boolean logic including negative conditions * Proximity on any arbitrary pattern * Wildcards and "don't cares" * Character alternation * Term counting, thresholds, and sets * Error t6lerance (fuzzy matching) * Term weighting * Numeric ranges 2.1 Advantages and Disadvantages of Hardware Scanning There are four principle advantages in using a hardware scanning approach for information retrieval. First, the FDF can perform pattern matching functions much faster and more cost-effectively that a general purpose CPU. This benefit comes in part because of the parallelism of the FDF architecture. Second, a hardware scanner like the FDF can begin processing the data immediately upon its receipt. There is no need to wait for the data to be preprocessed or~ndexed before it can be searched. This is especially important for dissemination (routing) applications. Third, no extra disk space is needed to store inverted index data or other vector data beyond the text itself. Finally, the system's response time in evaluating a query is independent of the query's complexity and thus easily predictable. The FDF can perform fuzzy pattern matching on a term like "krasnoyarsk" with three missing, incorrect, or extra characters, as easily as performing an exact string match. There are two disadvantages to the FDF scanning approach. First, it is moderately expensive to buy the hardware and adapt application programs to work with it. The approach is therefore not cost effective for low-end applications or systems that don't do significant amounts of text processing. Second, since the search is not complete until the entire database has been scanned, the time to complete a single simple query will be greater than with indexing methods. 3.0 Query Generation Approach Our objective for the TREC conference was to see if we could utilize the pattern matching power of the FDF to achieve superior recall and precision. This in turn revolved around how well we could construct the queries for the FDF to execute. We were able to identify 310 at least three possible approaches that could be used in preparing queries for execution by the Fast Data Finder. * Parse the topic narratives, extract key terms and phrases, expand the terms where possible, and generate queries to find documents with the same combinations of terms. * Take a sample of relevant documents, extract common keywords and phrases, especially those the occur multiple times, and generate queries to fmd documents with at least some of the same phrases and keywords within a sliding window of text about the size of a paragraph. Construct the initial queries manually and refine them iteratively. We elected to try both methods (ii) and (iii). To supply the relevant documents for the statistical trials, we used the sample relevance judgments supplied by NIST in late May and early June. 3.1 Automatic Query Generation Our plan was to take sample documents for a particular topic, merge them together, and build a PSL query that would find similar documents. Using the single document WSJ870320-()()62 as a seed, the query would be something like: (30 words -~ 5+ (`cola'; `coca'; `coca cola'; `bottling'; enterprises'; `cola bottling'; `cola enterprises'; coca cola enterprises' ; ` coca cola bottling'; `atlanta')) This query finds a document which contains a 30 word sliding window with 5 or more of the specified terms or phrases. The term list is determined by removing stopwords and counting the number of occurrences for each term, 2 word phrase, and 3 word phrase in the seed document. The top 10 terms/phrases with the highest counts are selected. The "30 words" and "5 or more" values were selected arbitrarily and we'd planned to run a series of trials to determine the optimal values. The initial experiments with this inethod of query construction were not encouraging. ran into three difficulties. * The May/June NIST sample relevance judgments seemed incomplete and inaccurate and were not giving us the statistical base we'd hoped for. * This method assumes that the whole document in all on one subject. Longer seed documents were contributing terms that had little to do with the topic. Some method to segment the documents and indicate the interesting section is required. * This method wasn't capturing the subtlety of the topics. The query shown above does an excellent job of finding documents about Coca Cola Enterprises or bottling units in Atlanta but completely misses the part about antitrust violations because it is only mentioned once in the article. 311 We 3.2 The poor results from our initial trials with statistical query generation led us to fall back on a purely manual (with feedback) approach. We extracted key concepts from the topic description, added additional terms from outside knowledge or by observing them in database documents. In building our multiple queries to provide a coarse grain ranking, we favored documents where the subqueries matched in lead sentences or paragraphs. We mostly ignored the May/June NIST sample judgements. Manual Query Generation Refinement of the queries was done manually by executing them, reviewing the results, and modifying the queries. The easier topics required only a few iterations, while on some of the more difficult topics we iterated several dozen times. We stopped working on a topic when it seemed that the results were converging to practical limit for our approach, i.e. when adding additional synonym keywords or altering the query structure wasn't producing more reasonable results. 4.0 Table I shows our results for the TREC routing queries. Since our system doesn't really rank the reujeved documents, we think the Table I presentation is more representative of our performance than the 1 lpt averages. The ~rrst and last columns are the topic number and description. The second column, "# Rel", is the number of relevant documents as judged by NIST in the Volume II Corpus. The next three columns give an indication of how the field did on the topic. "TRW Rel" is the number of relevant documents we submitted out of the "TRW Submit" we sent in for each topic. Our scores are summarized as follows: Results and Analysis High Above Med Median Below Med Low 8 15 4 19 2 Unlike most of the TREC participants, we did not submit the full 200 allowed documents for each topic. This turned out to be a major blunder because the TREC scoring procedure did not reward this self restraint. Many of our queries were too restrictive, achieving high precision at the expense of recall and a good score for the conference. This problem comes about because of the binary nature of the FDF's evaluation of a query against a document. To operate properly against a routing data stream, it is necessary to execute several queries for each topic, with each successive query aiming for higher recall. When our queries were "tuned" properly, the results were quite good. Considering only those queries where we made the full submission, the distiibution is well above the median. High Above Med Median Below Med Low 6 8 3 5 0 Our analysis shows that we did well on topics where the ability to find phrases, acronyms, numbers, and alphanumerics were important. We had the high score on topics 28 and 29, both which involved finding references to AT&T. Since we retain and scan the full data stream, we didn't have to worry about an indexing parser splitting "AT&T" into "AT" and "T" and then throwing them both away. Our PSL subquery to find AT&T was 312 TABLE I - TRW TREC Results for the Routing Topics * Relevant Retr. 200 TRW TRW Qry Rel. Best Median Worst Rel Submit Description ol 216 62 30 0 49 200 Pending Antitrust 02 384 72 43 1 72 200 Foreign Acquistions 03 431 167 84 8 161 200 US-Japan Joint Vent. 04 48 33 18 2 10 49 Debt Rescheduling Os 150 116 38 10 80 141 Japanese Dumping 06 137 78 45 15 44 200 Debt Relief 07 169 87 63 1 63 200 US Budget Deficit 08 159 43 18 3 28 72 Econimic Projection 09 638 117 87 8 91 200 Candidate Sightings 10 233 153 110 15 153 188 AIDS Treatments 11 196 89 52 7 35 128 Space Program 12 262 103 54 4 75 200 Water Pollution 13 112 111 46 5 111 113 Mitsubishi Heavy 14 203 85 48 0 46 56 Drug Approval 15 624 114 80 17 89 200 CEOs 16 88 44 24 1 1 13 Mkt Agrochemicals 17 303 154 81 0 33 58 Agrochemical Cntls 18 147 61 31 2 41 147 Japan Stock Trends 19 985 161 1Q2 74 85 140 Global Stock Trends 20 403 178 124 5 161 178 Patent Infringement 21 47 44 35 0 12 29 Superconductors 22 466 162 120 14 32 45 Counternarcotics 23 100 74 41 5 37 54 Legal Problems 24 345 113 59 11 11 21 Medical Technology 25 71 34 14 0 15 36 Chernobyl Effects 26 313 122 49 1 47 122 Multimedia Stds 27 232 109 91 10 80 200 Al in business 28 332 89 47 14 89 200 ATT in Comp/Comm 29 142 79 13 0 79 200 Foreign Acq ATT Tech 30 269 92 48 17 57 88 OS/2 Problems 31 156 66 31 0 36 200 OS/2 Advantages 32 119 52 15 0 6 12 Outsourcing 33 462 147 71 17 71 200 Doc Mngt Capable 34 303 129 104 6 107 200 ISDN Entities 35 270 139 113 0 98 200 Postscript Alts 36 158 110 50 0 10 11 Optical Disk Tech 37 409 189 158 19 158 200 SAA Components 38 810 169 120 37 98 156 Mini/Main Roles 39 501 184 117 24 142 156 Client-Server Plans 40 800 150 121 16 87 200 IS Impact on Orgs 41 144 34 11 0 31 200 Comp/Coinm Upgrade 42 696 131 92 10 92 112 End User Computing 43 125 Al Conferences 44 241 105 35 1 105 200 Layoffs at Companies 45 304 103 71 0 103 200 CASE Suceed/Fail 46 51 40 31 9 30 200 Virus Outbreaks 47 237 80 35 2 80 200 Contract > $1 mil 48 189 48 28 2 17 40 Purch Comm Equip 49 139 65 56 4 44 131 Who's in Supercomp 50 26 12 1 0 4 5 Virtual Reality 313 define ATT `[AT\&[Iamp\;)TIAT and TI\ American Telephone [and I\&) Telegraph)' end The "[lamp;]" notation means to allow an optional "amp;" as was present in the Ziff database. We had the high score on Topic 10 "AIDS Treatments". This may be due to our ability to easily find phrases like "acquired immune deficiency syndrome" or "AIDS related complex" in close proximity to drug names like "ThA", "5-fluorouracil", or "AZT". We had the high score on topic 13 to find documents about Mitsubishi Heavy Industries. Our query that found 111 of the 112 documents the NIST judged relevant, was simply to find the two word phrase "Mitsubishi Heavy". Apparently the other TREC participants had trouble either finding phrases or determining the need to find phrases during query generation. The following sections discuss in detail topic 47 where we achieved the high score, and topic 36 where we achieved a low score. 4.1 Example of Good Performance - Topic 47 Topic 47 was to find documents discussing new contracts for computer systems in excess of $1 million. We found 80 good documents out of 200 submitted, the high score for this topic. We believe we did well on this topic because we were able to look for various numeric representations of $1 million in close proximity to keywords for new contracts and computer systems. To be relevant, a document must identify the selection of a source for the development or delivery of information systems products or services valued at more than $1 million dollars. The PSL query for this topic used three subqueries: one each for the "selection of a source", information systems products or services", and "more than $1 million dollars". The PSL definitions were: define award [3 words -> `[siqnjaward)*I and "contract") end define computer "[computer I communic network phone telecommi mainframe StarlanlpBxlcyber IBM 30901X\-MPIY\-MPISCS\-4olinformation)" end define million [1 word -> "[millionIbillion)~dollar" or I [j [0-9)) [I [0-9)) [0-9] [ j\. ] [[0-9)] [I [0-9)) [I [0-9] ]\ [millionibillion)" or I) [I [0-9)] [1(0-9]) [0-9) [I\,] [0-9] [0-9) [0-9) [ I\,)\ [0-9)[0-9)[0-9]"} end The "award" definition requires the root words "sign" or "award" to be within 3 words of contract" in the text. This word count includes stop words, acronyms, or any other alphanumerics that were in the original text. This definition will find phrases like: a contract was awarded AT&T signed a new contract Bellcore was awarded three new contracts. The "computer" definition looks for any of the root terms shown. Note that looking for alphanumerics such "X-MP" or "IBM 3090", which may include multiple character white 314 space; is no problem. The "million" subquery uses proximity and an alphanumeric sequence pattern and will find items like the following: a million dollar contract a $2.3 million system a $ 12 billion program a $2000000 machine a $ 2,000,000 machine Note that the phrase "a 2,000,000 dollar award" would not be found by this definition. This was an oversight. The winning query was then simply [50 words -> award and computer and million) This finds documents which contain a 50 word sliding window in which all three subqueries match. Note how the "award" subquery that uses a 3 word sliding window can be nested inside a query using a 50 word sliding window. 4.2 Example of Bad Performance - Topic 36 Topic 36 was to find documents discussing how rewritable optical disks work. To be relevant, a document must describe how rewritable optical disk technology works at length and in significant and comprehensive technical detail. This topic was particularly challenging because the topic narrative describes attributes the documents must have rather than specific concepts or keywords. We started by defining a subquery to find documents mentioning rewritable optical disks. define optical disk [10 word ->~"rewrit" and "optical [disk I drivel technolog]") end To find documents that describe the technology "at length", we wrote a subquery to find places where there were at least 5OOO characters between the definition and the no TEXTEND) end To find documents that contained "significant and comprehensive technical detail" we manually extracted a list of keywords (Table II), and required that the documents to have at least 10 or more of these terms present. The tightest query (intended for high precision) was [1 document -> optical~disk and LONG and 30+ I The loosest (intended for high recall) was [1 document -> optical~disk and 10+ ) 315 TABLE II - Subterms used for Topic 36 "amorphous"; "[ISO CCITT]"; "Kerr effect" SCSI fl; "bias"; "binary"; "capacity"; "chemical"; "states"; cycles "density"; spatial "High Sierra"; "dye[ f\-Jpolymer"; "Curie temperature"; "gadolinium"; "lanthanide"; "birefringence"; "emerging technolog"; erasable "fatigue"; "field"; "[frequency~Mhz]"; inductance "[jukebox autochanger] "; "laser"; 4.3 Failure Analysis - Topic 36 "crystalline"; "operation"; "phase(l j\- "phenomenon" "polarit"; "polarized"; "principle"; ``reflect'' "[sector %rack~cylinder]"; ``[silver gold]''; "Qersted "surface reflectance"; "phase[ J\-]change"; "thin film"; "terbium"; "magnetization"; "substrate"; "speed"; "transfer"; "transluscent"; "Winchester"; "[mega[ Ibyte [A[az]]MB[A[az]]]II; "[giga[ j.]byte [A[az]]CB[A[az]]]~~; "magneto [II \- Joptical"; "media"; "magnet" change"; Unfortunately, even our high recall queiy retrieved only 11 documents in the Volume II Corpus of which 10 were judged relevant. (The 11th was discussing WORM technology and only mentioned "rewritable optical drives" in passing.) Upon examination of the NIST judgements, we made several observations about the relevant documents. First, we missed the keywords "erasable" as a synonym for rewritable" and "video" as a synonym for "optical". Second, the assessor accepted articles about "optical recorders" and "optical image processing" systems. To pick up these corrections we would change the optical~disk subquery to read as follows: define optical disk [10 words -> "[rewritf erasable]" and "[videojoptical]" and "[disk I drivel technolog recorder image processing]") end We then threw out the length restriction and reran the query requiring differing numbers of the technical terms to be present. The results from these runs are shown in Table III. This table shows two things. First, for this topic, the number of technical terms is an excellent "knob" to adjust the precision and recall. Second, the assessor was making a loose interpretation of "comprehensive technical detail". If we'd completely ignored this part of 316 the query, we would have had 135 good documents out of 262. Turning in only 200, wed expect to have around 103 relevant, which would have been near the high score. Table III - Topic 36 Results as a Function of the Number of Technical Terms Num Tech Rel Docs Terms Ret Ret Prec Recall 10 26 31 0.84 0.17 9 34 42 0.81 0.22 8 41 59 0.69 0.26 7 50 77 0.65 0.32 6 58 100 0.58 0.37 5 77 139 0.55 0.55 4 96 176 0.55 0.62 3 111 210 0.53 0.71 2 120 222 0.54 0.77 1 131 240 0.55 0~. 84 0 135 262 0.52 0.87 5.0 Future Plans During 1993 we hope to continue researching and evaluating better methods for query construction. Our objectives will be: * Design and test a method of sequencing the execution of FDF queries to insure that 200 documents will be retrieved for each topic, * Develop methods and algorithms to semi-automate manual query construction, * Use the extensive relevance judgements from TREC-I to test techniques to generate FDF queries from statistical analysis of the relevant documents for each topic, and * Examine the feasibility of using the FDF's term weighting capability to allow it to act as a back-end processor for other text retrieval techniques. 6.0 Acknowledgments The FDF system is the result of extensive development by many people over the last 8 years. My role has been that of a reporter on the basic system's capabilities and the manner in which they might be applied to a TREC-like problem. 317