Feedbacl( and Mixing Experiments with MatchPlus

Stephen I. Gallant

William R. Caid~  Joel Carletont

Todd W. Gutschowt

     Robert Hecht~Nie1sent  Kent P11 Qjngf    David Sudbeck~

                                       HNC Inc

Abstract
We briefly review the MatchPlus system and describe
recent developments with learning word representa-
tions, experiments with relevance feedback using neu-
ral network learning algorithms, and methods for
combining different output lists.


1   Introduction
HNC is developing a neural network related approach
to document retrieval called MatchPlus'.  Goals of
this approach include high precision/recall perfor-
mance, ease of use, incorporation of machine learning
algorithms, and sensitivity to similarity of use.
  To understand our notion of sensitivity to similar-
ity of use, consider the four words: `car', `automobile',
`driving', and `hippopotamus'.  `Car' and `autom~
bile' are synonyms and they very often occur together
in documents; `car' and `driving' are related words
(but not synonyms) that sometimes occur together
in documents; and `car' and `hippopotamus' are es-
sentially unrelated words that seldom occur within
the same document. We want the system to be sen-
sitive to such similarity of use, much like a built-in
thesaurus, yet without the drawbacks of a thesaurus,
such as domain dependence or the need for hand-
entry of synonyms. In particular we want a query on
`car' to prefer a document containing `drive' to one
containing `hippopotamus', and we want the system
itself to be able to figure this out from the corpus.
  The implementation of AlatchPlus is motivated by
neural networks, and designed to interface with neu-
ral network learning algorithms. High-dimensional
(~ 300) vectors, called context vectors, represent
word stems, documents, and queries in the same vec-
tor space.  This representation permits one type of
  *124 Mt Auburn St, Suite 200. Cambridge, MA 02138
  t5501 Oberlin Drive, San Diego, CA 92121.
  1Patents pending.


                                                 101

neural network learning algorithm to generate stem
context vectors that are sensitive to similarity of use,
and a more standard neural network algorithm to per-
form routing and automatic query modification based
upon user feedback, as described below
 Queries can take the form of terms, full documents,
parts of documents, and/or conventional Boolean ex-
pressions. Optional weights may also be included.
 The following sections give a brief overview of our
implementation, and look at some recent improve-
ments and experiments. For a previous description of
the approach and comments on complexity considera-
tions see [1]; a longer journal article is in preparation.


2   The         Context      Vector     Ap-
    proach
One of the most important aspects of MatchPlus is
its representation of words (stems), documents, and
queries by high (~ 300) dimensional vectors called
context vectors. By representing all objects in the
same high dimensional space we can easily:

 1.

  Form   a    document  context vector as the
  (weighted) vector sum of the context vectors for
  those words (stems) contained in the document.

2. Form a query context vector as the (weighted)
  vector sum of the context vectors for those words
  (stems) contained in the query.

3. Compute the distance of a query Q to any doc-
  ument.  Moreover if document context vectors
  are normalized, the closest document d (in Eu-
  clidean distance) has the context vector Vd that
  gives highest dot product with the query context
  vector ~

  <closest d> = {dIvd.VQ is maximized for d E D}

  (proof: 11Vd  vQ112 - 1Vd1j2+11VQj12~2(Vd ~Q) -
  const - 2(Vd .

 4. Find the closest document to a given document
    d by treating Vd as a query vector.

 5. Perform relevance feedback.  If d is a relevant
    document for query Q,form a new query vector

                         = ~ + aVd

    where ~ is some suitable positive number (eg 3).
    (See also [8].)  Note that search with ~ takes
    the same amount of time as search with V'?.


2.1   Context Vector Representations
Context vector representations (or feature space rep-
resentations) have a long history in cognitive science.
Work by Waltz & Pollack [10] had an especially strong
influence on the work reported here. They described a
neural network model for word sense disambiguation
and developed context vector representations (which
they termed micr~feature representations). See Gal-
lant [2] for more background on context vector repre-
sentations and word sense disambiguation.
 We use context vector representations for docu-
ment retrieval, with all of the representation being
learned from an unlabeled corpus. A main constraint
for all of this work is to keep computation and storage
reasonable, even for very large corpora.


2.2   Bootstrap Learning
Bootstrapping is a machine learning technique that
begins with vectors having randomly generated p05-
itive and negative components, and then uses an
unlabeled training corpus to modify the vectors so
that similarly used terms have similar representa-
tions. Previously we had used partially hand-entered
components as described in [2], but we have dispensed
with all hand entry in current implementations.
 Although there are important proprietary details,
the basic idea for bootstrapping is to make a stem's
vector more like its neighbors by adding a fraction
of their vectors to the stem in question. We make
use of a key property of high-dimensional vectors:
the ability to be `similar to' a multitude of vectors.
This is the same property that allows the vector sum
that represents a document to be similar to individual
term vector summands. (Similarity between normal-
ized vectors is measured by their inner product.)
 Note that bootstrapping takes into account local
word posi~ioning when assigning the context vector
representation for stems.  Moreover it is nearly in-
variant with respect to document divisions within the
training corpus. This contrasts with those methods


                                                  102

where stem representations are determined solely by
those documen~s in which the stem lies.


2.3  Context Vectors for Documents
Once we have generated context vectors for stems it is
easy to compute the context vector for a document.
We simply take a weighted sum of context vectors
for all stems appearing in the document2 and then
normalize the sum. This procedure applies to docu-
ments in the training corpus as well as to new docu-
ments. When adding up stem context vectors, we can
use term frequency weights similar to conventional IR
systems.


2.4  Context Vectors for Queries; Rel-
     evance Feedback
Query context vectors are formed similarly to docu-
ment context vectors. For each stem in the query we
can apply a user-specified weight (default 1.0). Then
we can sum the corresponding context vectors and
normalize the result.
 Note that it is easy to implement traditional rel-
evance feedback. The user can specify documents
(with weights) and the document context vectors are
merely added in with the context vectors from the
other terms. We can also find documents close to a
given document by using the document context vec-
tor as a query context vector.


2.5  Retrieval
The basic retrieval operation is simple; we find the
document context vector closest to the query context
vector and return it.   There are several important
points to note.

 1. As many documents as desired may be retrieved,
   and the distances from the query context vector
   give some measure of retrieval quality.

 2. Because document context vectors are normal-
   ized, we may simply find the document d that
   maximized the dot product with the query con-
   text vector, ~

                     m~\{Vd .
                        d


 3. It is easy to combine keyword match with con-
   text vectors. We first use the match as a filter
   for documents and return documents in order by
   closeness to the query vectors.  If all matching
 2stopwords are discarded.

   documents have been retrieved, AIa~chPlus can
   revert to context vectors for finding the closest
   remaining document.

 4. MaichPlus requires only about 300 multiplica-
   tions and additions to search a document. More-
   over it is easy to decompose the search for a cor-
   pus of documents with either parallel hardware
   or, less expensively, several networked conven-
   tional machines (or chips). Each machine can
   search a subset of the document context vectors
   and return the closest distances and document
   numbers in its subset. The closest from among
   the distances returned by all the processors then
   determines the documents chosen for retrieval.

   We also plan to investigate a clusier iree prun-
   ing procedure that finds nearest neighbor docu-
   ment context vectors without having to compute
   dot products for all document context vectors.
   This data organization affects retrieval speed,
   but does not change the order in which docu-
   ments are retrieved.


3   Experiments and Runs Sub-
    mitted
The four runs (two ad-hoc, two routing) submitted
to TREC-Il are described in the following sections.

3.3   Routing Using Neural Network
      Learning and Output Mixing
Both routing runs used two types of neural network
learning.

3.3.1  Stem Weight Learning

The Stem Weight Learning approach uses neural net-
work learning to compute weights for terms in a
query; there is one neural network input for every
query term.3
 Every judged document for a query provides a
training example as follows. Let V~ be the context
vector for judged document n.  If query term i has
context vector V~, then the ~th network input for
training example n is given by V~ V~, the inner prod-
uct of V~ and V~. For training example n, the desired
network output is ::::i, according to the relevance of
document n.
 A fast single-cell learning algorithm, the pocket al-
gorithm with ratchet [3, 4], generated weights. (More
complex algorithms did not produce better results.)
These weights were then used as term weights for nor-
mal query processing.
 Note that Wong, Yao, el al [9] previously used a
similar approach, applying a variant of perceptron
learning to learn weights with the SMART vector
space model.

                                              3.3.2  Full Context Vector Learning

3.1   Totally Automated
This run used the entire topic as a query, with weight
of 2 applied to the topic section. We also employed
a $Match filter that put documents having at least 4
of the concept terms before other documents. Con-
text vectors determined the actual order of the docu-
ments (as well as which documents were in the 1000-
document submission list.)  Note that these ad hoc
retrieval runs were totally automated from topic to
retrieval list.


3.2   Relevance    Feedback         From    the
      First 20 Retrievals
Here we took the top 20 documents from the previous
section, read them to estimate relevance, and then
modified the initial query for the remaining 980 re-
trievals. To change a query we added and normalized
all relevant document context vectors from the first
20 retrievals and then added this vector to 0.7 times
the original query vector.  (The 0.7 factor was ob-
tained from experiments with a different corpus.)


                                               103

This approach is similar to the previous Stem Weight
approach, except we compute an entire query con-
text vector rather than weights for stems.  Here we
use document context vectors directly as inputs to
the network; for example the ~th network input for
training example n is given by ~   The network
weights produced by learning are directly interpreted
as a query context vector.
 For most topics, this approach was not as good as
the previous approach for two apparent reasons. First
the Stem Weight approach makes use of the original
query terms, a valuable piece of user input. Second,
the Stem Weight approach uses fewer (5-50) trainable
parameters, possibly avoiding a tendency with the
Full Context Vector approach (~ 300 parameters) to
`overfit the data'.

3.3.3  Routing Run #1: Best Candidate

Our first routing run was to take the best candidate
from 4 sources: the two neural network approaches,

       terms in the concept section of the topic were used
for routing experiments.

a fully automated query run (as with the first ad hoc
submission), and a top 20 feedback run (as with the
second ad hoc submission). For each topic, the best
method was estimated from retrievals on a separate
test corpus (not the corpus used for submissions).
The winning method's list of 1000 retrievals was then
selected as the retrievals for this topic.

3.3.4  Routing Run #2: Mix and Match

Our second routing run was to mix retrievals from the
same 4 sources, using a `quality estimate' consisting
of 11-point recall/precision scores determined from a
run on a separate test corpus. Each document in each
of the four approaches was given points proportional
to the run quality estimate and inversely proportional
to its position number on an output list. Documents
appearing on more than one list received points for
each appearance.
 Mix and Match worked better than the previous
Best Candidate approach.


4    Comments
We are generally pleased with the performance from
our one-year-old system's results. (For final figures,
see the appendix of this proceedings.)
 In examining the data, one interesting aspect is
that Ma~chPlus does better when measured by 11-
point averages than by number of relevant documents
retrieved. This means a comparatively higher per-
centage of documents in early re~rievals were judged
relevant .4
 We are now running initial experiments with word
sense disambiguation using context vectors and clus-
tering. It will be interesting to see whether word sense
disambiguation can further improve retrieval perfor-
mance.


References
       [1] Gallant, S. I. Context Vector Representa-
           tions for Document Retrieval. AAAI-91 Natu-
           ral Language Text Retrieval Workshop, Ana-
           heim, CA, July 15,1991.

   Networks. Neural Computation, Vol.3, No.
   3, 1991, 293-309.

[3) Gallant, S. I. Perceptron-Based Learning Al-
   gorithms. IEEE Transactions on Neural Net-
   works, Volume 1, Number 2.

[4] Gallant, S. I. Neural Network Learning and
   Expert Systems. MIT Press, 1993.

[5) Hinton, G. E. Distributed Representations.
   Technical Report CMU-CS-84-157, Carnegie-
   Mellon University, Department of Computer
   Science. Revised version in Rumelhart, D.
   E. & McClelland, J. L. (Eds.) Parallel Dis-
   tributed Processing: Explorations in the Mi-
   crostructures of Cognition, Vol.1. MIT Press,
   1986.

[6] Salton, G. The SMART retrieval system - Ex-
   periments in automatic automatic document
   processing. Englewood Cliffs, NJ: Prentice-
   Hall, 1971.

[7] Salton, G. & Buckley, C. Term-Weighting A~
   proaches in Automatic Text Retrieval. Infor-
   mation Processing & Management, Vol.24,
   No.5,1988, pp. 513-523.

[8] Salton, G. & Buckley, C. Improving Retrieval
   Performance by Relevance Feedback. Journal
   of the American Society for Information Sci-
   ence, 41(4):288-297, 1990.

[9] Wong, S.K.M, Yao, Y.Y, Salton, G. & Buck-
   ley, C. Evaluation of an Adaptive Linear
   Model. Journal of the American Society for
   Information Science, 42(10):723-730, 1991.

[10] Waltz, D. L. & Pollack, J. B. Massively Par-
   allel Parsing: A Strongly Interactive Model
   of Natural Language Interpretation. Cogni-
   tive Science 9, 51-74 (1985).

[2] Gallant,   S.
  Representing
  Word Sense

I. A Practical Approach for
Context And for Performing
Disambiguation Using Neural

  4T1'is also raises the question as to whether comparatively
fewer total documents were scored by the readers, perhaps due
to the AfaichPlus system's different approach. Computing the
median number of documents scored per topic would resolve
this question, as weU as providing an indication of "noveltv"
for TREC participants.


                                             104