Text Retrieval with the TRW Fast Data Finder

                            Matt Mettler
                   TRW Systems Development Division
                       One Space Park R212194
                       Redondo Beach, CA 90278
                           (310) 814-4925
                      mat~wilbur.coyote.trw.com

1.0
TRW has been building high performance text processing and retrieval systems for a
number of years. Most of these systems have involved the application of the TRW Fast
Data Finder (FDF) text search hardware and have been designed to meet the requirements
of specific government customers. Our goal for the TREC conference has been to consider
and experiment with the FDF as a tool for more general purpose information retrieval, and
to determine the FDF's strengths and weakness compared to conventional information
retrieval techniques.

    Introduction

Our experience with the TREC conference has left us encouraged about the ability of a text
scanning approach to be competitive with the more involved information retrieval
techniques. The inherent limitations of the FDF hardware do not prevent competitive
precision and recall for general information retrieval applications when the user topics are
properly understood and the topic queries are properly tuned to the dataset.

2.0   FDF Text Retrieval Approach
The Fast Data Finder is a hardware device that performs high-speed pattern matching on a
stream of 8-bit data. It consists of an array of identical programmable text processing cells
connected in series to form a pipeline processor. The cells are implemented using a custom
VLSI chip designed and patented by TRW. In the latest implementation, each chip contains
24 processor cells and a typical system will have 3,600 of cells. Each cell can match a
single character of query or perform all or part of a logical operation. The processors are
interconnected with an 8-bit data path and approximately 20-bit control path. To perform
a search, a microcode program is first downloaded into the pipeline to direct each
processor.  The database is then streamed through the pipeline. The data bytes clock
through each processor in t~ until the whole database has passed through all processors.
As the data is clocking through, the processors alter the state of the control lines depending
on their program and the data stream values.


                                   309

When the pipeline's processor cells detect that a series of database characters have passed
by that match the desired pattern, a hit is indicated and passed by external circuiuy back to
the memory of the host processor and to the user. The FDF pipeline runs at a constant speed
as it performs character comparisons and logical operations, regardless of query
complexity. The system we used for the TREC conference searched at 10 MB/sec.

The queries or patterns are specified in the FDF's Pattern Specification Language (PSL).
The hardware directly supports all the features in the PSL query language without the need
for software post-processing. The processors in the pipeline may all be used to evaluate a
single large query or may assigned to evaluate numerous smaller queries. The number of
pipeline cells a query needs is proportional to the size of the query. PSL provides numerous
search functions, which may be nested in any combination, including:

      *  Boolean logic including negative conditions
      *  Proximity on any arbitrary pattern
      *  Wildcards and "don't cares"
      *  Character alternation
      *  Term counting, thresholds, and sets
      *  Error t6lerance (fuzzy matching)
      *  Term weighting
      *  Numeric ranges

2.1   Advantages and Disadvantages of Hardware Scanning
There are four principle advantages in using a hardware scanning approach for information
retrieval. First, the FDF can perform pattern matching functions much faster and more
cost-effectively that a general purpose CPU. This benefit comes in part because of the
parallelism of the FDF architecture. Second, a hardware scanner like the FDF can begin
processing the data immediately upon its receipt. There is no need to wait for the data to
be preprocessed or~ndexed before it can be searched. This is especially important for
dissemination (routing) applications. Third, no extra disk space is needed to store inverted
index data or other vector data beyond the text itself. Finally, the system's response time
in evaluating a query is independent of the query's complexity and thus easily predictable.
The FDF can perform fuzzy pattern matching on a term like "krasnoyarsk" with three
missing, incorrect, or extra characters, as easily as performing an exact string match.

There are two disadvantages to the FDF scanning approach.  First, it is moderately
expensive to buy the hardware and adapt application programs to work with it.  The
approach is therefore not cost effective for low-end applications or systems that don't do
significant amounts of text processing. Second, since the search is not complete until the
entire database has been scanned, the time to complete a single simple query will be greater
than with indexing methods.

3.0   Query Generation Approach
Our objective for the TREC conference was to see if we could utilize the pattern matching
power of the FDF to achieve superior recall and precision. This in turn revolved around
how well we could construct the queries for the FDF to execute. We were able to identify


                                 310

at least three possible approaches that could be used in preparing queries for execution by
the Fast Data Finder.

      *  Parse the topic narratives, extract key terms and phrases, expand the
         terms where possible, and generate queries to find documents with the
         same combinations of terms.

      *  Take a sample of relevant documents, extract common keywords and
         phrases, especially those the occur multiple times, and generate queries
         to fmd documents with at least some of the same phrases and keywords
         within a sliding window of text about the size of a paragraph.

         Construct the initial queries manually and refine them iteratively.

We elected to try both methods (ii) and (iii). To supply the relevant documents for the
statistical trials, we used the sample relevance judgments supplied by NIST in late May and
early June.

3.1   Automatic Query Generation
Our plan was to take sample documents for a particular topic, merge them together, and
build a PSL query that would find similar documents.  Using the single document
WSJ870320-()()62 as a seed, the query would be something like:

      (30 words -~
         5+ (`cola';  `coca'; `coca cola'; `bottling';
              enterprises'; `cola bottling'; `cola enterprises';
              coca cola enterprises' ; ` coca cola bottling';   `atlanta'))

This query finds a document which contains a 30 word sliding window with 5 or more of
the specified terms or phrases. The term list is determined by removing stopwords and
counting the number of occurrences for each term, 2 word phrase, and 3 word phrase in the
seed document. The top 10 terms/phrases with the highest counts are selected. The "30
words" and "5 or more" values were selected arbitrarily and we'd planned to run a series of
trials to determine the optimal values.

The initial experiments with this inethod of query construction were not encouraging.
ran into three difficulties.

      *  The May/June NIST sample relevance judgments seemed incomplete
         and inaccurate and were not giving us the statistical base we'd hoped for.

      *  This method assumes that the whole document in all on one subject.
         Longer seed documents were contributing terms that had little to do with
         the topic. Some method to segment the documents and indicate the
         interesting section is required.

      *  This method wasn't capturing the subtlety of the topics. The query
         shown above does an excellent job of finding documents about Coca
         Cola Enterprises or bottling units in Atlanta but completely misses the
         part about antitrust violations because it is only mentioned once in the
         article.


                                  311

                                                              We

3.2
The poor results from our initial trials with statistical query generation led us to fall back
on a purely manual (with feedback) approach. We extracted key concepts from the topic
description, added additional terms from outside knowledge or by observing them in
database documents. In building our multiple queries to provide a coarse grain ranking, we
favored documents where the subqueries matched in lead sentences or paragraphs. We
mostly ignored the May/June NIST sample judgements.

     Manual Query Generation

Refinement of the queries was done manually by executing them, reviewing the results, and
modifying the queries. The easier topics required only a few iterations, while on some of
the more difficult topics we iterated several dozen times. We stopped working on a topic
when it seemed that the results were converging to practical limit for our approach, i.e.
when adding additional synonym keywords or altering the query structure wasn't producing
more reasonable results.

4.0
Table I shows our results for the TREC routing queries. Since our system doesn't really
rank the reujeved documents, we think the Table I presentation is more representative of
our performance than the 1 lpt averages. The ~rrst and last columns are the topic number
and description. The second column, "# Rel", is the number of relevant documents as
judged by NIST in the Volume II Corpus. The next three columns give an indication of
how the field did on the topic. "TRW Rel" is the number of relevant documents we
submitted out of the "TRW Submit" we sent in for each topic. Our scores are summarized
as follows:

     Results and Analysis

        High    Above Med  Median  Below Med Low
         8        15          4       19       2

Unlike most of the TREC participants, we did not submit the full 200 allowed documents
for each topic. This turned out to be a major blunder because the TREC scoring procedure
did not reward this self restraint. Many of our queries were too restrictive, achieving high
precision at the expense of recall and a good score for the conference.

This problem comes about because of the binary nature of the FDF's evaluation of a query
against a document. To operate properly against a routing data stream, it is necessary to
execute several queries for each topic, with each successive query aiming for higher recall.
When our queries were "tuned" properly, the results were quite good. Considering only
those queries where we made the full submission, the distiibution is well above the median.

        High    Above Med  Median  Below Med Low
         6         8          3         5      0

Our analysis shows that we did well on topics where the ability to find phrases, acronyms,
numbers, and alphanumerics were important. We had the high score on topics 28 and 29,
both which involved finding references to AT&T. Since we retain and scan the full data
stream, we didn't have to worry about an indexing parser splitting "AT&T" into "AT" and
"T" and then throwing them both away. Our PSL subquery to find AT&T was


                                312

TABLE I - TRW TREC

Results for the Routing Topics

        *     Relevant Retr.    200    TRW    TRW
Qry    Rel.   Best   Median   Worst    Rel    Submit  Description

ol     216     62       30      0      49      200    Pending Antitrust
02     384     72       43      1      72      200    Foreign Acquistions
03     431     167      84      8     161      200    US-Japan Joint Vent.
04     48      33       18      2      10       49    Debt Rescheduling
Os     150     116      38      10     80      141    Japanese Dumping
06     137     78       45      15     44      200    Debt Relief
07     169     87       63      1      63      200    US Budget Deficit
08     159     43       18      3      28       72    Econimic Projection
09     638     117      87      8      91      200    Candidate Sightings
10     233     153      110     15    153      188    AIDS Treatments
11     196     89       52      7      35      128    Space Program
12     262     103      54      4      75      200    Water Pollution
13     112     111      46      5     111      113    Mitsubishi Heavy
14     203     85       48      0      46       56    Drug Approval
15     624     114      80      17     89      200    CEOs
16     88      44       24      1       1       13    Mkt Agrochemicals
17     303     154      81      0      33       58    Agrochemical Cntls
18     147     61       31      2      41      147    Japan Stock Trends
19     985     161      1Q2     74     85      140    Global Stock Trends
20     403     178      124     5     161      178    Patent Infringement
21     47      44       35      0      12       29    Superconductors
22     466     162      120     14     32       45    Counternarcotics
23     100     74       41      5      37       54    Legal Problems
24     345     113      59      11     11       21    Medical Technology
25     71      34       14      0      15       36    Chernobyl Effects
26     313     122      49      1      47      122    Multimedia Stds
27     232     109      91      10     80      200    Al in business
28     332     89       47      14     89      200    ATT in Comp/Comm
29     142     79       13      0      79      200    Foreign Acq ATT Tech
30     269     92       48      17     57       88    OS/2 Problems
31     156     66       31      0      36      200    OS/2 Advantages
32     119     52       15      0       6       12    Outsourcing
33     462     147      71      17     71      200    Doc Mngt Capable
34     303     129      104     6     107      200    ISDN Entities
35     270     139      113     0      98      200    Postscript Alts
36     158     110      50      0      10       11    Optical Disk Tech
37     409     189      158     19    158      200    SAA Components
38     810     169      120     37     98      156    Mini/Main Roles
39     501     184      117     24    142      156    Client-Server Plans
40     800     150      121     16     87      200    IS Impact on Orgs
41     144     34       11      0      31      200    Comp/Coinm Upgrade
42     696     131      92      10     92      112    End User Computing
43                                             125    Al Conferences
44     241     105      35      1     105      200    Layoffs at Companies
45     304     103      71      0     103      200    CASE Suceed/Fail
46     51      40       31      9      30      200    Virus Outbreaks
47     237     80       35      2      80      200    Contract > $1 mil
48     189     48       28      2      17       40    Purch Comm Equip
49     139     65       56      4      44      131    Who's in Supercomp
50     26      12       1       0       4        5    Virtual Reality


                                    313

      define ATT `[AT\&[Iamp\;)TIAT and TI\
                       American Telephone [and I\&) Telegraph)' end

The "[lamp;]" notation means to allow an optional "amp;" as was present in the Ziff
database. We had the high score on Topic 10 "AIDS Treatments". This may be due to our
ability to easily find phrases like "acquired immune deficiency syndrome" or "AIDS
related complex" in close proximity to drug names like "ThA", "5-fluorouracil", or "AZT".

We had the high score on topic 13 to find documents about Mitsubishi Heavy Industries.
Our query that found 111 of the 112 documents the NIST judged relevant, was simply to
find the two word phrase "Mitsubishi Heavy". Apparently the other TREC participants had
trouble either finding phrases or determining the need to find phrases during query
generation. The following sections discuss in detail topic 47 where we achieved the high
score, and topic 36 where we achieved a low score.

4.1   Example of Good Performance - Topic 47
Topic 47 was to find documents discussing new contracts for computer systems in excess
of $1 million. We found 80 good documents out of 200 submitted, the high score for this
topic. We believe we did well on this topic because we were able to look for various
numeric representations of $1 million in close proximity to keywords for new contracts and
computer systems.

      To be relevant, a document must identify the selection of a
      source for the development or delivery of information systems
      products or services valued at more than $1 million dollars.

The PSL query for this topic used three subqueries: one each for the "selection of a source",
 information systems products or services", and "more than $1 million dollars". The PSL
definitions were:

      define award [3 words -> `[siqnjaward)*I and "contract") end

      define computer
      "[computer I communic network phone telecommi mainframe
      StarlanlpBxlcyber IBM 30901X\-MPIY\-MPISCS\-4olinformation)" end

      define million      [1 word -> "[millionIbillion)~dollar" or
            I [j [0-9)) [I [0-9)) [0-9] [ j\. ] [[0-9)] [I [0-9)) [I [0-9] ]\
         [millionibillion)" or
          I) [I [0-9)] [1(0-9]) [0-9) [I\,] [0-9] [0-9) [0-9) [ I\,)\
         [0-9)[0-9)[0-9]"} end

The "award" definition requires the root words "sign" or "award" to be within 3 words of
 contract" in the text. This word count includes stop words, acronyms, or any other
alphanumerics that were in the original text. This definition will find phrases like:

      a contract was awarded
      AT&T signed a new contract
      Bellcore was awarded three new contracts.

The "computer" definition looks for any of the root terms shown. Note that looking for
alphanumerics such "X-MP" or "IBM 3090", which may include multiple character white


                                     314

space; is no problem.  The "million" subquery uses proximity and an alphanumeric
sequence pattern and will find items like the following:
         a million dollar contract
         a $2.3 million system
         a $ 12 billion program
         a $2000000 machine
         a $ 2,000,000 machine

Note that the phrase "a 2,000,000 dollar award" would not be found by this definition. This
was an oversight. The winning query was then simply
         [50 words -> award and computer and million)

This finds documents which contain a 50 word sliding window in which all three
subqueries match. Note how the "award" subquery that uses a 3 word sliding window can
be nested inside a query using a 50 word sliding window.

4.2   Example of Bad Performance - Topic 36
Topic 36 was to find documents discussing how rewritable optical disks work.
      To be relevant, a document must describe how rewritable
      optical disk technology works at length and in significant and
      comprehensive technical detail.

This topic was particularly challenging because the topic narrative describes attributes the
documents must have rather than specific concepts or keywords. We started by defining a
subquery to find documents mentioning rewritable optical disks.
      define optical disk
         [10 word ->~"rewrit" and "optical [disk I drivel technolog]") end

To find documents that describe the technology "at length", we wrote a subquery to find
places where there were at least 5OOO characters between the <FEXT> definition and the
<TEXTh marker.
      define LONG TEXT [5000 char -> no TEXTEND) end

To find documents that contained "significant and comprehensive technical detail" we
manually extracted a list of keywords (Table II), and required that the documents to have
at least 10 or more of these terms present.
The tightest query (intended for high precision) was
         [1 document -> optical~disk and LONG and 30+ <technical terms> I

The loosest (intended for high recall) was
         [1 document -> optical~disk and 10+ <technical terms> )


                                    315

                TABLE II - Subterms used for Topic 36

      "amorphous";
      "[ISO CCITT]";
      "Kerr effect"
       SCSI fl;
      "bias";
      "binary";
      "capacity";
      "chemical";
      "states";
       cycles
      "density";
       spatial
      "High Sierra";
      "dye[  f\-Jpolymer";
      "Curie temperature";
      "gadolinium";
      "lanthanide";
      "birefringence";
      "emerging technolog";
       erasable
      "fatigue";
      "field";
      "[frequency~Mhz]";
       inductance
      "[jukebox autochanger] ";
      "laser";


4.3   Failure Analysis - Topic 36

"crystalline";
"operation";
"phase(l    j\-
"phenomenon"
"polarit";
"polarized";
"principle";
``reflect''

"[sector %rack~cylinder]";
``[silver gold]'';
"Qersted
"surface reflectance";
"phase[     J\-]change";
"thin film";
"terbium";
"magnetization";
"substrate";
"speed";
"transfer";
"transluscent";
"Winchester";
"[mega[     Ibyte [A[az]]MB[A[az]]]II;
"[giga[ j.]byte [A[az]]CB[A[az]]]~~;
"magneto [II \- Joptical";
"media";    "magnet"

change";

Unfortunately, even our high recall queiy retrieved only 11 documents in the Volume II
Corpus of which 10 were judged relevant. (The 11th was discussing WORM technology
and only mentioned "rewritable optical drives" in passing.)

Upon examination of the NIST judgements, we made several observations about the
relevant documents.   First, we missed the keywords "erasable" as a synonym for
rewritable" and "video" as a synonym for "optical". Second, the assessor accepted articles
about "optical recorders" and "optical image processing" systems.  To pick up these
corrections we would change the optical~disk subquery to read as follows:

     define optical disk [10 words ->
         "[rewritf erasable]" and
         "[videojoptical]"      and
         "[disk I drivel technolog recorder image processing]") end

We then threw out the length restriction and reran the query requiring differing numbers of
the technical terms to be present. The results from these runs are shown in Table III. This
table shows two things. First, for this topic, the number of technical terms is an excellent
"knob" to adjust the precision and recall.  Second, the assessor was making a loose
interpretation of "comprehensive technical detail". If we'd completely ignored this part of


                                     316

the query, we would have had 135 good documents out of 262. Turning in only 200, wed
expect to have around 103 relevant, which would have been near the high score.

          Table III - Topic 36 Results as a Function of the
                     Number of Technical Terms

                Num
                Tech    Rel    Docs
                Terms   Ret    Ret     Prec   Recall

                10      26     31      0.84   0.17
                 9      34     42      0.81   0.22
                 8      41     59      0.69   0.26
                 7      50     77      0.65   0.32
                 6      58    100      0.58   0.37
                 5      77    139      0.55   0.55
                 4      96    176      0.55   0.62
                 3     111    210      0.53   0.71
                 2     120    222      0.54   0.77
                 1     131    240      0.55   0~. 84
                 0     135    262      0.52   0.87


5.0    Future Plans
During 1993 we hope to continue researching and evaluating better methods for query
construction. Our objectives will be:

       * Design and test a method of sequencing the execution of FDF queries to
         insure that 200 documents will be retrieved for each topic,

       * Develop methods and algorithms to semi-automate manual query
         construction,

       * Use the extensive relevance judgements from TREC-I to test techniques
         to generate FDF queries from statistical analysis of the relevant
         documents for each topic, and

       * Examine the feasibility of using the FDF's term weighting capability to
         allow it to act as a back-end processor for other text retrieval techniques.


6.0    Acknowledgments
The FDF system is the result of extensive development by many people over the last 8
years. My role has been that of a reporter on the basic system's capabilities and the manner
in which they might be applied to a TREC-like problem.


                                317