HNC's MatchFlits System

              Stephen I. Gallant*    William R. Caidt Joel Carletont
              Robert Hecht~Nielsent  Kent Pu Qingt   David Sudbeckt
                                      HNC Inc

 1   Introduction
 HNC is developing a neural network related approach
 to document retrieval called MatchPlus1. Goals of
 this approach include high precision/recall perfor-
 mance, ease of use, incorporation of machine learning
 algorithms, and seilsitivity to similarity of use.
   To understand our notion of sensitivity to similar-
 ity of use, consider the four words: `car', `automobile',
 `driving', and `hippopotamus'. `Car' and `automo-
 bile' are synonyms and they very often occur together
 in documents; `car' and `driving' are related words
 (but not synonyms) that sometimes occur together
 in documents; and `car' and `hippopotamus' are es-
 sentially unrelated words that seldom occur within
 the same document. We want the system to be sen-
 sitive to such similarity of use, much like a built-in
 thesaurus, yet without the drawbacks of a thesaurus,
 such as domain dependence or the need for hand-
 entry of synonyms. In particular we want a query on
 `car' to prefer a document containing `drive' to one
 containing `hippopotamus', and we want the system
 itself to be able to figure this out from the corpus.
   The implementation of MatchPlus is motivated by
 neural networks, and designed to interface with neu-
 ral network learning algorithms. High-dimensional
 (  300) vectors, called context vectors, represent
 word stems, documents, and queries in the same vec-
 tor space. This representation permits one type of
 neural network learning algorithm to generate stem
 context vectors that are sensitive to similarity of use,
 and a more standard neural network algorithm to per-
 form routing and automatic query modification based
 upon user feedback.
   Queries can take the form of terms, full documents,
 parts of documents, and/or conventional Boolean cx-
 pressions. Optional weights may also be included.

                                               The following sections give a brief overview of our
                                              implementation, and look at some preliminary test
                                              results.  For a previous description of the approach
                                              and comments on complexity considerations see
                                              a longer journal article is in preparation.


                                              2   The         Context      Vector      Ap-
                                                  proach
                                              One of the most important aspects of MatchPlus is
                                              its representation of words (stems), documents, and
                                              queries by high (   300) dimensional vectors called
                                              context vectors. By representing all objects in the
                                              same high dimensional space we can easily:
                                               1. Form    a   document context  vector as the
                                                  (weighted) vector sum of the context vectors for
                                                  those words (stems) contained in the document.

                                               2. Form a query context vector as the (weighted)
                                                  vector sum of the context vectors for those words
                                                  (stems) contained in the query.

                                               3. Compute the distance of a query Q to any doc-
                                                  ument.   Moreover if document context vectors
                                                  are normalized, the closest document d (in Eu-
                                                  clidean distance) has the context vector Vd that
                                                  gives highest dot product with the query context
                                                  vector ~

                                                  <closest d> - {dIVd.VQ is maximized for d E D}

                                                  (proof: 11Vd vQII2 - IIvdII2+IIvQII2~2(vd.vQ) =
                                                  const - 2(Vd . VQ).)

                                               4. Find the closest document to a given document
                                                  d by treating Vd as a query vector.
                                               5. Perform relevance feedback. If d is a relevant
                                                  document for query Q, form a new query vector

  ~124 Mt Auburn St, Suite 200, Canibridge, MA 02138
  t5501 Oberlin Drive, San Diego, CA 92121.
  1 Patents pending.                                                         - ~ + aVd


                                                     107

    where ~ is some suitable positive number (eg 3).
    (See also [6].)   Note that search with ~ takes
    the same amount of time as search with ~

2.1     Context Vector Representations
Context vector representations (or feature space rep-
resentations) have a long history in cognitive science.
Work by Waltz & Pollack [7] had an especially strong
influence on the work reported here. They described a
neural network model for word sense disambiguation
and developed context vector representations (which
they termed micro-feature representations). See Gal-
lant [2] for more background on context vector repre-
sentations and word sense disambiguation.
  We use context vector representations for docu-
ment retrieval, with most of the representation being
learned from an unlabeled corpus. A main constraint
for all of this work is to keep computation and storage
reasonable, even for very large corpora.
  For defining context vectors, we begin by specifying
a set of n features that are useful for differentiating
among terms and contexts. These may be chosen in
an ad hoc manner using `common sense", or they
may consist of the high frequency terms in a given
corpus after removal of stopwords (a, the, and,  ..).
Figure 1 gives some typical examples that might be
suitable for general news stories. Experiments sug-
gest that the precise selection of features is not criti-
cal to system performance.

    human                         machine
    art       science   play
    walk      lie-down  motion    speak
    research  fun       sad       exciting
    friend    family    baby      country
    cold      hard      soft      sbarp
    light     big       small     red
    white     blue      yellow    animal
    insect    plant     tree      flower
    fruit     fragrant  stink     past
    future    high      low       wood
    paper     metal     building  house
    work      early     late      day
    afternoon morning   sunny     cloudy
              hot       told      humid
    smart     dumb                truck
    write     type      cook      eat


Figure 1:   Some typical features.
some features that apply to star
to the top of the list.

    politics
    entertainment
    yell
    boring
    hot
    heavy
    black
    mammal
    bush
    present
    plastic
    factory
    night
    rain
    bright
    bike
    spicy


 For convenience
have been moved

 For any word stem k we now define its context vec-
tor, Vk, to be an n-dimensional vector, and interpret
each component of Vk as follows:
 *      {strongly} positive if word k is {strongly}
   associated with feature j

 *      0 if word k is not associated with feature j


                                           108

   *      {strongly} negative if word k {strongly}
    contradicts feature j.

As an example, vastronomer might be

          <+2   +1     +1  -1     -1
              o +2       0     0   0
              o   0    +1  +1     +1
            +2  +1     -1  +1     -1


using the features in figure 1. Note that the inter-
pretation of components of context vectors is exactly
the same as the interpretation of weights in neural
networks.
   In addition to the "word features" there are other
"learned features" that primarily serve to increase the
dimensionality of context vectors.
   Features are deliberately chosen so that they over-
lap. This makes context vectors less dependent upon
any individual feature.  Context vector representa-
tions give built-in sensitivity to similarity of meaning
between terms. For example it is likely that the con-
text vector for `car' would be very close to the context
vector for `auto', somewhat close to the context vector
for `driving', and less close to the context vector for
`hippopotamus' for any "reasonable" set of features
and for any "reasonable" person enterin9 the context
vectors. For more on this point see the plausibility
argument in Waltz & Pollack [7]. Note that overlap-
ping features provide a distributed representation [3],
and therefore help insulate against a small number of
questionable entries for context vectors.

2.2     Bootstrap Learning
Bootstrapping is a machine learning technique that
begins with hand-entered context vectors for a small
set of core stems, and then uses an unlabeled train-
ing corpus to create context vectors for all remaining
stems.
   The basic idea is to define the context vector for a
new stem by making it similar to the context vectors
of its neighbor stems. Note that bootstrapping takes
into account local word positioning when assigning the
context vector representation for stems. Moreover it
is nearly invariant with respect to document divisions
within the training corpus. This contrasts with those
methods where stem representations are determined
solely by those documents in which the stem lies.
   We have also developed a fully-automated method
for bootstrapping that requires no initial hand entry.
This capability is very useful for specialized domains
such as the tests on traditional IR corpora presented
in the next section.   We are currently researching

                                                        Context Vector
                                                        for Document

                                      V
                                V = IIVII
             ~ž~z1ž~


..e     ~     x              x

                             A

~Log( counNt(w )


                 of

human   evenil;;)

Normalize


Vector Sum


Inverse
Frequency
Weighting


Context Vector for
Individual Word~~


Sloplist

Document

Figure 2: Generating the context vector for a docunient.

whether the fully-automated method can perform as
well as standard bootstrapping.

2.3  Context Vectors for Documents
Once we have generated context vectors for stems it
is easy to compute the context vector for a document.
We simply take a weighted sum of context vectors for
all stems appearing in the document2 and then nor-
malize the sum. See figure 2. This procedure applies
to documents in the training corpus as well as to new
documents. When adding up stem context vectors
we can use term frequency weights similar to conven-
tional IR systems.

2.4  Context Vectors for Queries; Rel-
     evance Feedback
Query context vectors are formed similarly to docu-
ment context vectors. For each stem in the query we
  2Stopwords are discarded


                                          109

can apply a user-specified weight (default 1.0). Then
we can sum the corresponding context vectors and
normalize the result.
 Note that it is easy to implement relevance feed-
back. The user can specify documents (with weights)
and the document context vectors are merely added
in with the context vectors from the other terms. We
can also find documents close to a given document by
using the document context vector as a query context
vector.


2.5   Retrieval
The basic retrieval operation is simple; we find the
document context vector closest to the query context
vector and return it.  There are several important
points to note.

 1. As many documents as desired may be retrieved,
    and the distances from the query context vector
    give some measure of retrieval quality.

 2. Because document context vectors are normal-
   ized, we may simply find the document d that
   maximized the dot product with the query con-
   text vector, ~

                 max{Vd . ~Q}
                  d


 3. It is easy to combine keyword match with con-
   text vectors. We first use the match as a filter
   for documents and return documents in order by
   closeness to the query vectors. If all matching
   documents have been retrieved, MatchPlus can
   revert to context vectors for finding the closest
   remaining document.

 4. MatchPlus requires only about 300 multiplica-
   tions and additions to search a document. More-
   over it is easy to decompose the search for a cor-
   pus of documents with either parallel hardware
   or, less expensively several networked conven-
   tional machines (or chips).  Each machine can
   search a subset of the document context vectors
   and return the closest distances and document
   numbers in its subset. The closest from among
   the distances returned by all the processors then
   determines the documents chosen for retrieval.
   We are also investigating a cluster tree prun-
   ing procedure that finds nearest neighbor docu-
   ment context vectors without having to compute
   dot products for all document context vectors.
   This data organization affects retrieval speed,
   but does not change the order in which docu-
   ments are retrieved.


3   Preliminary Results
Our system is very `young'. It has been able to handle
large corpora (1,000,000+ documents) only since July
1992. Nevertheless we have some promising results.
  In figure 3, we see that MatchPlus gives comparable
performance to Salton's SMART system [5] on small,
traditional IR test corpora when corresponding term
weighting schemes are used. Salton reports signifi-
cantly improved performance (10~50%) with other
term weighting methods, and we are in the process of
running the corresponding tests with MatchPlus.
  These experiments used fully automated boot-
strapping with no hand entry of context vectors. Ex-
periments on other corpora with a hand-entered set of
core stems show 3% to 15% improvement, with larger
improvement on smaller corpora.
  Bootstrapping for the tests in figure 3 was on the
target corpus only, with maximum size being 3200


                                                110

                 C's'     CACM      MED
      MatchPlus ~          .1749   T .5013
      SMART      .1410     .2535    .5062

Notes:
      MatchPlus: $Match(3) filter
    - SMART figures from Salton [5].
    - Comparisons use classical idf term weighting
    for both systems.


Figure 3:  MatchPlus results are comparable to
SMART system results on traditional IR corpora us-
mg a corresponding term weighting method.


documeuts. We have found a significant advantage
to bootstrapping with larger corpora, as shown in
figure 4. We also plan experiments where bootstrap-
ping begins with stem context vectors generated from
a larger corpus.


 Bootstrap corpus size  50,000  200,000  500,000
 (# docs)
 Performance              36.5     39.0     39.9
 Improvement                        7%       9%

 Notes:
    - Performance was average relevant for 200 re-
    trievals using Tipster corpus.


Figure 4: Improvement with size of bootstrap corpus.


  Experiments with the Tipster/TREC corpus are in
progress.


4     Concluding Comments
The key feature of MatchPlus is its uniform represen-
tation of all objects by context vectors. This makes
possible a large number of interesting experiments for
the next year, such as

  * use of neural network learning algorithms to per-
    form fast automated query modification based
    upon user feedback

  * word sense disambiguation as described in [2]

 * cluster tree speedup for retrieval

 * different term weighting methods for document
   and query context vector creation, as well as for
   bootstrapping.

 Although young, we believe MatchPlus has already
shown encouraging results, and we look forward to
growing it.


References
      [~ Gallant    S. I. Context Vector Representa-
         tions for Document Retrieval. AAAI-91 Natu-
         ral Language Text Retrieval Workshop, Ana-
         heim, CA, July 15, 1991.

      [2] Gallant,  S. I. A Practical Approach for
         Representing Context And for Performing
         Word Sense Disambiguation Using Neural
         Networks. Neural Computation, Vol.3, No.
         3, 1991, 293-309.

      [3] Hinton, G. E. Distributed Representations.
         Technical Report CMU-CS-84-157, Carnegie-
         Mellon University, Department of Computer
         Science. Revised version in Rumelhart, D.
         E. & McClellaiid, J. L. (Eds.) Parallel Dis-
         tributed Processing: Explorations in the Mi-
         crostructures of Co9nition, Vol.1. MIT Press,
         1986.

      [4] Salton, G. The SMART retrieval s~tem - Ex-
         perirnents in automatic automatic document
         processing. Englewood Cliffs, NJ: Prentice-
         Hall, 1971.

      [5] Salton, G. & Buckley, C. Term-Weighting Ap-
         proaches in Automatic Text Retrieval. Infor-
         mation Processing & Management, Vol.24,
         No.5, 1988, pp.513-523.

      [6] Salton, G. & Buckley, C. Improving Retrieval
         Performance by Relevance Feedback. Journal
         of the American Society for Information Sci-
         ence, 41(4):288-297, 1990.

      [7] Waltz, D. L. & Pollack, J. B. Massively Par-
         allel Parsing: A Strongly Interactive Model
         of Natural Language Interpretation. Cogni-
         tive Science 9, 51-74 (1985).


                                                111