A Boolean Approximation Method for Query
  Construction and Topic Assignment in TREC*
      Paul S. Jacobs, George B~. Krupka, and Lisa F. R.au
             GE ~esearch and Development Center
            Schenectady, NY 12301 (518) 387 - 5059


Abstract
Word-based search is the simplest and most widely-available method for process-
ing and retrieving volumes of information from free text. However, this common
method of searching is generally cumbersome and inaccurate. The method de-
scribed here automatically generates complex Boolean queries for word-based
search, so that the same mechanism can be used with higher accuracy and effi-
ciency. This practical approach worked, with good results, on the entire TREC
corpus on both the "routing" and "ad hoc retrieval" tasks, with the official
averages in the top group for both tasks.


1   Introduction
Full-text search is currently the simplest and most commonly-used method for
locating information in large volumes of free text. Because users are accustomed
to describing what they are looking for with specific words, and those words are
often found in the texts, searching the text for selected words or word combi-
nations is a natural and easy-to-implement method for information retrieval.
However, it can be very inaccurate.  In some experiments, the percentage of
relevant texts retrieved has been shown to be lower than 10%, and a high per-
centage of material that is retrieved can be irrelevant. Also, it can be very
difficult for searchers to compose "queries" that combine the words that are ef-
fective in locating relevant material without finding large quatities of irrelevant
information as well.
  *Special thank~ to Michael Charbonnean, Mark Freeman, Geoff Gordon, John Mnnq.on and

IJn 7ernik, who all helped participate in the design and implementation of the GE TREC
system.
This research was sponsored in part by the l,efense Advanced Research Project Agency. The
views and concli'sions contained in this doci'ment are those of the anthors and shoi'ld not
be interpreted as representing the official policies, either expressed or implied, of the l,efense
Advanced Research Project Agency or the IJ~ Government.


                           297

  There are many alternatives to full-text search that can produce much
higher accuracy, including statistical methods that weight matches based on
relative word frequencies, automatic indexing strategies, and knowledge-based
approaches that give very high accuracy for repetitive searches at the cost of a
large amount of work in constructing a knowledge base. The ideal search strat-
egy would be one with the accuracy of knowledge-based approaches, but with
the simple efficiency of word searches. This is the motivation for the method
described here.
  The selection of this method for the TREC evaluation combined a sense of
the practice of information retrieval with a particular interpretation of what
the evaluation is about. For example, the strategy makes several assumptions
about the task that are apparently different from those made by other sites, and
which perhaps take a looser and less academic view of the experiment. These
key assumptions are:

  * Relevance over ranking. The focus of text interpretation is to assign to
    each text, with the highest possible accuracy, the set of topics or content
    indicators that apply to that text (i.e., particularly for routing, to treat
    the topic or relevance of each text individually rather than as a relative
    measure against other texts in the corpus).

  * Technology over engineering. The goal of the experiment is to show high-
    accuracy, practical results, avoiding the threatening limitations of disk
    size, memory management, and tractability.  (The system did no pre-
    indexing or analysis of any portion of the corpus.)

  * Tezi inie~reiai:on over query inieryrelalion. As there may not be any
    principled structure or methodology in the sample queries, the emphasis on
    matching texts to an internal representation of each query, rather than on
    automatic query processing, pushes the limits of the routing and retrieval
    engine instead of the interface to the user.

  These choices were made for various reasons, and are presented here not as
the righi way to view the task but as one software system's implicit focus. A
project with a different choice of focus could, for example, produce lower accu-
racy but have the benefit of fully automatic query processing; another project
might have higher accuracy on the top few texts in each category but lower recall
on a general routing task. We view our (preliminary) results as very satisfying
as a test in a high-throughput, high-accuracy, routing-style interpretation.
  The sections that follow cover the motivation for the design and implemen-
tation of this method, some specific details of the experiment, and an analysis
of the results.


                          298

2   Background
There are several different methods for general-purpose information retrieval
from text, as mentioned, including full-text search, statistical and probabilistic
techniques, and those using automatic or manual indexing. Of these, word-
based full-text search is generally believed to be the least accurate, but by far
the most widely practiced. The method is used in spite of its inaccuracy because
it is so easy to implement, and because it gives a direct and natural way for
users to describe what they are looking for.
  Knowledge-based approaches-those that rely on formal descriptions of con-
cepts rather than combinations of words-can be the most accurate for certain
limited tasks, but are the most difficult to implement and apply to general
information retrieval problems. While there are a variety of knowledg~based
methods, they share the common framework of looking for structured informa-
tion in a text, while other methods look for simple combinations of words. For
example, a simple query looking for stories about takeovers by GE in a typical
word-based system might be (GE AND ACQUISITION), while in a knowledge-
based system it would look more like ($GE * $acquire * *compauy) where
the terms marked by $ indicate classes of words or phrases that the system
can recognize; for example, GE might include GE and General Electric but not
General Electric company of Britain, and $comp~y would match any company
named or described.
  The knowledge-based approach is more accurate, in general, for two rea-
sons. First, it allows the program scanning the texts to look for many ways of
expressing the same concept, enhancing recal~the percentage of relevant docu-
ments that the program retrieves. Second, it focuses search on specific pieces
of information, thus enhancing p~cision-the percentage of retrieved texts that
are relevant. In this case, improved recall would come from spotting the variety
of ways of expressing the concept of acquiring, and improved precision comes
from eliminating texts that are about the acquisition of real estate, products,
and so forth, as well as acquisitions by other companies of GE's products and
services.
  Most full-text search systems are driven by Boolean queries-terms joined
together by AND, OR, and NOT, such as (GE AND ACQUISITION) above.
These can be arbitrarily complex, and one can start to approximate the opera-
tion of a knowledg~based approach by combining large numbers of terms. For
example, the GE query above could be expressed as:

(AND (OR GE (AND GENERAL ELECThIC (NOT (AND OF BRITAIN))))
     (OR ACQUIRED ACQUISITION BOUGHT (AND (OR TAKE TOOK) OVER)
     (OR COMPANY CORP INC CO LTD SA AG ...))

  There are several problems with this approach. First, it is very difficult
for users to formulate queries of this sort, so, in practice, Boolean queries are


                           299

quite simple. Second, it is easier for a system to recognize most or all company
names than it is to list all the possible words that could appear in the name
of a company. Third, the amount of information captured in the query is still
limited; for example, it ignores the order that words appear as well as their ad-
jacency or proximity. Many text search systems allow augmentations to queries
that express these constraints, but this makes the queries still more difficult
to construct and makes searches less efficient. Hence Boolean retrieval systems
remain, in practice, awkward and inaccurate.
   The experiment that this team performed in TREC hatched under the pres-
sure of a very difficult task with very limited resources.  While a number of
individuals participated (including the three authors and five others), the pr~
gram described here resulted from less than a week of programming over and
above the existing software tools we had in hand to apply to the task.  The
constraints thus forced upon the effort included the realization that not much
could be salvaged from the queries as distributed, that the sample relevance
judgements were incomplete and the training samples too small for most statis-
tical tests, and that indexing the corpus by any practical method could delay the
project by weeks, overflow the disks, and prevent any corrections to the method.
This led to a "bare bones" strategy that takes advantage of two strategies: (1)
treating boolean queries as an approximation to more detailed, structured rep-
resentations, and (2) using the co~us, rather than the queries, as the main
source of information for formulating the topics.


3    Our Approach
The fundamental idea behind the method is to take a knowledge-based descrip-
tion of a query or topic, and convert it to a Boolean form that can be efficiently
applied by a text-search engine. This Boolean form, furthermore, must be an ap-
proxima~ion of the knowledge based query, in the formal sense that the Boolean
expression should match all texts that the knowledge-based query would, but
perhaps can admit more texts.
   There are several key advantages to this approach of generating a Boolean
query from a knowledge-based description. The simplest benefit is that it makes
building queries easier, because much of the work in forming complex Boolean
expressions is done automatically. A second major advantage is that the Boolean
queries, in approximating a knowledge-based approach, are more likely to give
accurate results. Finally, because retrieval using the automatically~generated
Boolean query approximates the knowledge-based query, the knowledge-based
system can run on the results of Boolean retrieval, thus enhancing precision
without having to apply the more computationally~intensive knowledge-based
processing to very much text.
   One of the obvious problems to overcome in TREC was, with a limited
amount of time to formulate 100 queries, with a small amount of training data


                                 300

(none for the ad hoc test), how to make the queries practical and accurate.
Our choice here was to keep the initial queries relatively simple, and to run the
results of a "first pass" retrieval against the entire corpus through a statistical
filter to pull out terms that would help to augment or refine the query.  In
addition, the matching engine would display the exact portion of each text that
(correctly or incorrectly) matched the query, making it easy to correct glaring
errors and refine ambiguous terrns. This amounts to a peculiar sort of feedback
mechanism that relies on detailed analysis of portions of the corpus instead of
user input.

3.1  Detailed Method
The method brings together four key elements:  (1) a language for express-
ing knowledge-based topics or queries, developed at GE and described in the
open literature, (2) a new program to generate Boolean expressions that ap-
proximate these queries (called the riLic compiler), (3) a program to match
the automatically-generated expressions against text to be retrieved (called the
pre-flUer), and (4) a knowledge-based pattern matcher, described in the open
literature [2], that takes the results of the first match and rejects texts that do
not satisfy the more constrained, knowledge-based query.
   Because the pattern matcher is designed as an efficient "trigger" mechanism
and an aid in parsing, the knowledge-based queries are mostly simple combina-
tions of lexical categories. The patterns largely adopt the language of regular
expressions, including the following terms and operators:

   * Lexical features that can be tested in a pattern:
        - token "name" (e.g. "AK-47")
        - lexical category (e.g. "adj")
        - root (e.g. "shoot")
        - conceptual category (e.g. "human")

   * Logical combination of lexical feature tests
        - OR, AND ,and NOT

   * Wild cards
        $ - 0 or 1 tokens
        * - 0 or more tokens
        + - 1 or more tokens

   * Variable assignment from pattern components
           =

   * Grouping operators:
        <>for grouping
        []for disjunctive grouping


                            301

  * Repetition
        * - 0 or more
                       + - 1 or more

  * Range
           - 0 to N
                         - 1 to N

   In practice, certain of these features are used more than others, and most
queries rely most heavily on different lexical categories, grouping, and wildcards.
For example, a simple description of the query looking for texts describing sanc-
tions against South Africa is the following:

sanction == [(member sanction sanctions disinvestment)
     <Sullivan Principles>
     <punitive *2 measures>]

sa~rica == [(member Buthelezi Pretoria anti-apartheid apartheid)
          <De Kierk> <South (member Airica A~rican)> ]

 ;; rule 1
*sanction *50 $sa~rica => (mark-topic 52)

;;; rule 2
*sa~rica *20 $sanction => (mark-topic 52)

   This description says that any matching text must have bo~h an indicator of
South Africa ($safrica) and one of sanctions ($sanction), and that the sanction
phrase must occur within 50 words of the South Africa phrase, except if it only
comes afterwards, in which case it must come within 20 words.
   A sanction phrase can be any of the simple words sanction, sanctions, or
disinvestment, or any phrase including punitive measures with no more than two
intervening words (like punitive economic measures). A South Africa phrase can
also be either one of a group of simple words, or a phrase, like De Kierk, South
Africa, or South African.
   These queries or topic descriptions can be quite complex, and the method
has been designed to handle many queries simultaneously, so the rule compiler
is designed to produce expressions that can be efficiently applied within a l&ge
set of queries.  This is important because many queries can share the same
simple terms or combinations of terms, and because the pre-filter must match
the simplest expressions first.
   For the topic description given above, the output of the rule compiler will
include the following tests:

52        TERN       AFRICAN

2029      TERN       NEASURES


                                302

2134       TERN    SANCTION
2135       TERN    SANCTIONS
2136       TERN    DISINVESTNENT
2138       TERN    SULLIVAN
2139       TERN    PRINCIPLES
2141       TERN    PUNITIVE
2144       TERN    BUTHELEZI
2145       TERN    PRETORIA
2146       TERN    ANTI-APARTHEID
2147       TERN    APARTHEID
2149       TERN    DE
2150       TERN    KLERK
2152       TERN    SOUTH
2153       TERN    AFRICA

2137       OR      2134 2135 2136
2140       AND     2138 2139
2142       AND     2141 2029
2143       OR      2137 2140 2142
2148       OR      2144 2145 2146 2147
2151       AND     2149 2150
2154       OR      2153 52
2155       AND     2152 2154
2156       OR      2148 2151 2155
2157       AND     2143 2156
2158       AND     2156 2143

T0P1C052   OR       2157 2158

   Each line in the above data gives a unique number (or topic designator) to
the test, a test identifier (either TERM for a simple word test, OR, or AND),
and a list of simple terms or previous tests. For example, test 2137 depends on
tests 2134, 2135, and 2136, and is true if any of those tests is true, namely, if
the text includes any of the words sanction, san ci ions, or disinv~stment. The
tests are automatically ordered so that all tests that are dependent on other
tests will have higher numbers than the tests they depend on; thus all TERM
tests appear first. In this case, the TERM test AFRICAN appears with a much
lower number simply because it is used in many different queries.
   The pr~-fili~r, which can work either on complete documents or paragraphs,
goes through every word in its input and, using a fast table look-up, sets the
TERM tests to true for every word it encounters. At the end of input, either
the end of the paragraph or end of each document, it runs through the table of
possible tests from low numbers to high numbers and sets tests to true if their


                               303

conditions are satisfied. A topic test produces a match if it has become true at
the end of this process, meaning that the paragraph or document has passed
the pre-filter for that query. A single paragraph, of course, can satisfy multiple
queries.
   The pattern matcher then loads in only those texts that have passed the
pre-filter, and uses the original queries to apply more stringent tests to those
texts, such as ordering, proximity, and more complex lexical tests. In typical
cases, the amount of text the pattern matcher must scan is only 2-3% of the
words in the original, and the pattern matcher discards 2({30% of the matches
of the pre-filter. Hence, well over 95% of the text filtering is done by the pre-
filter, making the whole process efficient as well as showing that the pre-filter
closely approximates the knowledge-based pat tern matcher.


4   Implementation and Results
The lexical analyzer and the matching method described above were programmed
in the C programming language, compiled, and tested on a corpus of about 1.2
gigabytes of text (about 200 million words) with a set of 50 topics, as part
of the government-sponsored Text R~trieval Conference (TREC). The program
was then tested on an additional gigabyte of text (the "second" disk) with an
additional 50 topics, and the results were submitted to the government for com-
parative evaluation purposes. The results, in this case, were a ranked list of up
to 200 documents that matched each query.
   The final test, run on two SUN workstations, took about 10 hours of elapsed
time for the entire corpus.  The pre-filter ran at a speed of several hundred
thousand words per minute, thus covering the entire contents of each disk in
a few hours.  This would reduce the text to 3% of the original data, with
each paragraph marked according to the queries that matched in the pre-filter.
During query refinement, a statistical filter ran through the entire contents of
this marked portion of the corpus, pulling out words and phrases with a high
correlation (a variation of the metric for categorization reported in [1)) to a
particular topic. For every topic, then, the statistical filter listed terms not in
the (Boolean) query that occurred with a disproportionate frequency in text
that matched the query. This, for example, produced the terms Buthelezi and
puni~ive measures in the South Africa query above.
   Figure 1 summarizes some of the official results from TREC, including the
number of times the systems achieved the best score, and were above, below
and equal to the median score. The system was very close to the top in both
the ad hoc and routing tasks, and was well above median on most topics. The
pre-filter, surprisingly, was better than the pattern matcher almost across the
board, indicating that with this style of evaluation there seemed to be no real
benefit to deeper analysis than a simple Boolean match. The logical conclusion
is that a Boolean routing or retrieval engine can perform as well as much more


                                  304

       complex and sophisticated methods, so long as enough knowledge is used in
       constructing the Boolean expressions.

                      AD HOC TEST
                11-pt.  Rel.@
                ava.   lOOdocs.

                               Rel./
                             Retrieved

                                            ROUTING TEST
                                       11-pt.  Rel.@
                                       ava.   lOOdocs.

                                                      Rel.I
                                                    Retrieved

Boolean pre-lilter     .2029     47.2       .46      .2078     35.6       .34

Pattern matcher        .1961     46.2       .46      .1851     34.6       .37
Median for all runs    .1585     39.7                .1246     28.6

                     top    above   below         top     above  below
                     score median median equal score median median         equal
Boolean pre-tilter     5     28     15       2       5     31      6        7

Pattern matcher        5     26     17       2       6     27      10       6
                           Figure 1: GE TREC test results


       5   Limitations and Future Work
       While the overall, relative results were generally strong, the system as it was
       implemented had some basic flaws that should be easily corrected. One char-
       acteristic of this method is that it seems to produce excellent results on the
       queries that produce large volumes of data, and tends to produce almost noth-
       ing on some of the narrowest queries. It also produces both high precision and
       high recall for the texts that match, but makes it difficult to "loosen up" to
       let in more texts. Since the match relies on at least some exact match between
       query terms and texts, there are some queries (for example, about the details of
       rewriteable disks), that produced no hits. By contrast, in one configuration, the
       system assigned over 4,700 texts to one query ("sightings of 1988 presidential
       candidates"), of which only 200 were included in the submitted results. This
       might be desirable behavior for a routing system, but in the TREC style of
       evaluation the system did badly on those queries where it failed to produce at
       least 200 documents. On the other hand, the system produced at least 6000
       responses in even its strictest configuration, or at least 120 texts, on average,
       per query.


                                    305

   The problerns with undergeneration (and the related problem of not doing
a very good job of ranking the documents) were due to the fact that our sys-
tem was designed for routing, while TREC used traditional retrieval evaluation
methods, along with a 200-document cutoff, effectively counting recall on the
harder topics much more heavily than overall recall. Our approach can correct
for this by using a more flexible statistical method to expand the queries and
by performing a more sophisticated ranking (the document ranking as reported
was implemented post hoc in one line of Unix code).
   More important than the problems to correct, there is an important result
here to build on.  Our experience has been that pattern matching can be a
close approximation for this sort of task to natural language processing, so it
might seem that advanced methods are much more critical for finding what
to put in the queries than they are for the detailed analysis of the texts. The
general framework of this approach means that, with the continued development
of advanced methods for natural-language based corpus analysis, substantial
performance improvements can come within the context of almost any current
text retrieval systems.


6    Evaluation Methods
One unusual characteristic of our method is that it assumes that each relevance
judgement that the system makes is made independently of all other texts, as in
a routing task where the system processes each incoming message in turn and
assigns topics or actions for filing or routing that message. Certainly, this style
has certain advantages-it is simple, clear, and makes parallel processing easy-
and it reflects some real assumptions about the nature of the task. However,
although it seems to have done very well relative to other systems, it is not
consistent with the instructions for submitting results in TREC, and certainly
doesn't lead to the best possible showing on some of the results.
   Topic 77, about poaching techniques, is one example of the different (naive,
perhaps) perspective toward evaluation that our system adopts. The query
specifies:

       A relevant document will identify the type of wildlife being poached,
     the poachzng technique or method of killing the wildlife which is used
     by the poacher, and the reason for the poaching (e.g. for a trophy,
     meat, or money).

This is a very specific query. Our test (bootstrapping) sample produced a good
number of hits, but most of them failed to include one of the required pieces of
information, usually the technique or method of killing. So, we narrowed the
query. The result is that, for this query, the system produced 9 total documents,
6 of which were judged relevant. This is high precision (.67), but it doesn't help
the overall results, since for this topic the precision at 200 documents is treated


                                306

as .03. To us, narrowing the query seemed a good idea because the precision
on this topic otherwise would have been low, but we did not realize that the
documents that the system didn'i re~rievc were still treated as incorrect in this
calculation.
  On Topic 43 (1991 Al conferences), our system produced 3 documents, all
of which were irrelevant. This "routing" topic was later discarded because no
relevant documents were found in the corpus, but there is nothing inherently
wrong with testing topics for which there is no data. In fact, the ideal routing
system should produce 0 hits for such a topic, not 200 hits as dictated in TREC.
  Certainly ranking and routing don't go together in any real task on a gigabyte
sample. One way that future evaluations can test routing is to use a random (or
otherwise fair) sample of the collections as a test, judge every document in that
sample with respect to every query, and then measure each system's recall and
precision on the basis of the sample. This would probably require less hand-
work in judging relevance, but would require that each system produce topic
assignments for every document in the collection (from which the assignments
for the test sample would be extracted post hoc). This could be impossible for
some systems.  On the other hand, the strategy would give real numbers for
both recall and precision, and would be much truer to the routing task.


7   Utility
The main purpose of this method is as a front-end for computation-intensive
natural language processing of large bodies of text. Because the pre-filter closely
approximates more in-depth processing with a very fast, efficient process, it
permits detailed processing of large volumes of text by discarding most of the
irrelevant material and by producing a rough approximation of the more detailed
processing.
  The method is more broadly applicable to problems in information dissemi-
nation and retrieval. Accuracy is only one appealing characteristic of the tech-
nique, since the main innovation is that it allows for improved accuracy within
the context of traditional word-based full-text search.
  In addition to the programs described here, the method was tested with a
statistical corpus analyzer that helps to identify candidate words and phrases
to include in queries. This method helps to overcome some of the limitations
of word-based methods in cases where statistical approaches clearly seem to do
better.  As an additional experiment, this automated corpus analysis can be
used to reduce further the amount of labor involved in building queries.


                              307

8    Summary
GE's participation in TREC involved a small implementation of a simple strat-
egy for compiling knowledge based pattern matcher rules into the language of
Boolean expressions. A statistical corpus analyzer helped to formulate and re-
fine queries for both the ad hoc and routing tasks, and the resulting matching
engine ran on the entire 2.3 gigabytes of text. The simple Boolean retrieval
engine performed very well on both tasks. These results are promising, both
from the perspective of accuracy and for the simplicity with which they seem to
bring knowledge-based techniques to bear within the rudimentary framework of
word-based retrieval.


References
[1] Paul S. Jacobs.   Joining statistics with NLP for text categorization. In
   Proceedings of ~he 3rd Conference on Applied Na~ural Lang~age Processing,
   April 1992.

[2] Paul S. Jacobs, George R. Krupka, and Lisa F. Rau. Lexico-semantic pat-
   tern matching as a companion to parsing in text understanding. In Four~h
   DARPA Speech and Natural Language Workshop, pages 337-342, San Ma-
   teo, CA, February 1991. Morgan-Kaufmann.


                                 308