Feedbacl( and Mixing Experiments with MatchPlus Stephen I. Gallant William R. Caid~ Joel Carletont Todd W. Gutschowt Robert Hecht~Nie1sent Kent P11 Qjngf David Sudbeck~ HNC Inc Abstract We briefly review the MatchPlus system and describe recent developments with learning word representa- tions, experiments with relevance feedback using neu- ral network learning algorithms, and methods for combining different output lists. 1 Introduction HNC is developing a neural network related approach to document retrieval called MatchPlus'. Goals of this approach include high precision/recall perfor- mance, ease of use, incorporation of machine learning algorithms, and sensitivity to similarity of use. To understand our notion of sensitivity to similar- ity of use, consider the four words: `car', `automobile', `driving', and `hippopotamus'. `Car' and `autom~ bile' are synonyms and they very often occur together in documents; `car' and `driving' are related words (but not synonyms) that sometimes occur together in documents; and `car' and `hippopotamus' are es- sentially unrelated words that seldom occur within the same document. We want the system to be sen- sitive to such similarity of use, much like a built-in thesaurus, yet without the drawbacks of a thesaurus, such as domain dependence or the need for hand- entry of synonyms. In particular we want a query on `car' to prefer a document containing `drive' to one containing `hippopotamus', and we want the system itself to be able to figure this out from the corpus. The implementation of AlatchPlus is motivated by neural networks, and designed to interface with neu- ral network learning algorithms. High-dimensional (~ 300) vectors, called context vectors, represent word stems, documents, and queries in the same vec- tor space. This representation permits one type of *124 Mt Auburn St, Suite 200. Cambridge, MA 02138 t5501 Oberlin Drive, San Diego, CA 92121. 1Patents pending. 101 neural network learning algorithm to generate stem context vectors that are sensitive to similarity of use, and a more standard neural network algorithm to per- form routing and automatic query modification based upon user feedback, as described below Queries can take the form of terms, full documents, parts of documents, and/or conventional Boolean ex- pressions. Optional weights may also be included. The following sections give a brief overview of our implementation, and look at some recent improve- ments and experiments. For a previous description of the approach and comments on complexity considera- tions see [1]; a longer journal article is in preparation. 2 The Context Vector Ap- proach One of the most important aspects of MatchPlus is its representation of words (stems), documents, and queries by high (~ 300) dimensional vectors called context vectors. By representing all objects in the same high dimensional space we can easily: 1. Form a document context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the document. 2. Form a query context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the query. 3. Compute the distance of a query Q to any doc- ument. Moreover if document context vectors are normalized, the closest document d (in Eu- clidean distance) has the context vector Vd that gives highest dot product with the query context vector ~ = {dIvd.VQ is maximized for d E D} (proof: 11Vd vQ112 - 1Vd1j2+11VQj12~2(Vd ~Q) - const - 2(Vd . 4. Find the closest document to a given document d by treating Vd as a query vector. 5. Perform relevance feedback. If d is a relevant document for query Q,form a new query vector = ~ + aVd where ~ is some suitable positive number (eg 3). (See also [8].) Note that search with ~ takes the same amount of time as search with V'?. 2.1 Context Vector Representations Context vector representations (or feature space rep- resentations) have a long history in cognitive science. Work by Waltz & Pollack [10] had an especially strong influence on the work reported here. They described a neural network model for word sense disambiguation and developed context vector representations (which they termed micr~feature representations). See Gal- lant [2] for more background on context vector repre- sentations and word sense disambiguation. We use context vector representations for docu- ment retrieval, with all of the representation being learned from an unlabeled corpus. A main constraint for all of this work is to keep computation and storage reasonable, even for very large corpora. 2.2 Bootstrap Learning Bootstrapping is a machine learning technique that begins with vectors having randomly generated p05- itive and negative components, and then uses an unlabeled training corpus to modify the vectors so that similarly used terms have similar representa- tions. Previously we had used partially hand-entered components as described in [2], but we have dispensed with all hand entry in current implementations. Although there are important proprietary details, the basic idea for bootstrapping is to make a stem's vector more like its neighbors by adding a fraction of their vectors to the stem in question. We make use of a key property of high-dimensional vectors: the ability to be `similar to' a multitude of vectors. This is the same property that allows the vector sum that represents a document to be similar to individual term vector summands. (Similarity between normal- ized vectors is measured by their inner product.) Note that bootstrapping takes into account local word posi~ioning when assigning the context vector representation for stems. Moreover it is nearly in- variant with respect to document divisions within the training corpus. This contrasts with those methods 102 where stem representations are determined solely by those documen~s in which the stem lies. 2.3 Context Vectors for Documents Once we have generated context vectors for stems it is easy to compute the context vector for a document. We simply take a weighted sum of context vectors for all stems appearing in the document2 and then normalize the sum. This procedure applies to docu- ments in the training corpus as well as to new docu- ments. When adding up stem context vectors, we can use term frequency weights similar to conventional IR systems. 2.4 Context Vectors for Queries; Rel- evance Feedback Query context vectors are formed similarly to docu- ment context vectors. For each stem in the query we can apply a user-specified weight (default 1.0). Then we can sum the corresponding context vectors and normalize the result. Note that it is easy to implement traditional rel- evance feedback. The user can specify documents (with weights) and the document context vectors are merely added in with the context vectors from the other terms. We can also find documents close to a given document by using the document context vec- tor as a query context vector. 2.5 Retrieval The basic retrieval operation is simple; we find the document context vector closest to the query context vector and return it. There are several important points to note. 1. As many documents as desired may be retrieved, and the distances from the query context vector give some measure of retrieval quality. 2. Because document context vectors are normal- ized, we may simply find the document d that maximized the dot product with the query con- text vector, ~ m~\{Vd . d 3. It is easy to combine keyword match with con- text vectors. We first use the match as a filter for documents and return documents in order by closeness to the query vectors. If all matching 2stopwords are discarded. documents have been retrieved, AIa~chPlus can revert to context vectors for finding the closest remaining document. 4. MaichPlus requires only about 300 multiplica- tions and additions to search a document. More- over it is easy to decompose the search for a cor- pus of documents with either parallel hardware or, less expensively, several networked conven- tional machines (or chips). Each machine can search a subset of the document context vectors and return the closest distances and document numbers in its subset. The closest from among the distances returned by all the processors then determines the documents chosen for retrieval. We also plan to investigate a clusier iree prun- ing procedure that finds nearest neighbor docu- ment context vectors without having to compute dot products for all document context vectors. This data organization affects retrieval speed, but does not change the order in which docu- ments are retrieved. 3 Experiments and Runs Sub- mitted The four runs (two ad-hoc, two routing) submitted to TREC-Il are described in the following sections. 3.3 Routing Using Neural Network Learning and Output Mixing Both routing runs used two types of neural network learning. 3.3.1 Stem Weight Learning The Stem Weight Learning approach uses neural net- work learning to compute weights for terms in a query; there is one neural network input for every query term.3 Every judged document for a query provides a training example as follows. Let V~ be the context vector for judged document n. If query term i has context vector V~, then the ~th network input for training example n is given by V~ V~, the inner prod- uct of V~ and V~. For training example n, the desired network output is ::::i, according to the relevance of document n. A fast single-cell learning algorithm, the pocket al- gorithm with ratchet [3, 4], generated weights. (More complex algorithms did not produce better results.) These weights were then used as term weights for nor- mal query processing. Note that Wong, Yao, el al [9] previously used a similar approach, applying a variant of perceptron learning to learn weights with the SMART vector space model. 3.3.2 Full Context Vector Learning 3.1 Totally Automated This run used the entire topic as a query, with weight of 2 applied to the topic section. We also employed a $Match filter that put documents having at least 4 of the concept terms before other documents. Con- text vectors determined the actual order of the docu- ments (as well as which documents were in the 1000- document submission list.) Note that these ad hoc retrieval runs were totally automated from topic to retrieval list. 3.2 Relevance Feedback From the First 20 Retrievals Here we took the top 20 documents from the previous section, read them to estimate relevance, and then modified the initial query for the remaining 980 re- trievals. To change a query we added and normalized all relevant document context vectors from the first 20 retrievals and then added this vector to 0.7 times the original query vector. (The 0.7 factor was ob- tained from experiments with a different corpus.) 103 This approach is similar to the previous Stem Weight approach, except we compute an entire query con- text vector rather than weights for stems. Here we use document context vectors directly as inputs to the network; for example the ~th network input for training example n is given by ~ The network weights produced by learning are directly interpreted as a query context vector. For most topics, this approach was not as good as the previous approach for two apparent reasons. First the Stem Weight approach makes use of the original query terms, a valuable piece of user input. Second, the Stem Weight approach uses fewer (5-50) trainable parameters, possibly avoiding a tendency with the Full Context Vector approach (~ 300 parameters) to `overfit the data'. 3.3.3 Routing Run #1: Best Candidate Our first routing run was to take the best candidate from 4 sources: the two neural network approaches, terms in the concept section of the topic were used for routing experiments. a fully automated query run (as with the first ad hoc submission), and a top 20 feedback run (as with the second ad hoc submission). For each topic, the best method was estimated from retrievals on a separate test corpus (not the corpus used for submissions). The winning method's list of 1000 retrievals was then selected as the retrievals for this topic. 3.3.4 Routing Run #2: Mix and Match Our second routing run was to mix retrievals from the same 4 sources, using a `quality estimate' consisting of 11-point recall/precision scores determined from a run on a separate test corpus. Each document in each of the four approaches was given points proportional to the run quality estimate and inversely proportional to its position number on an output list. Documents appearing on more than one list received points for each appearance. Mix and Match worked better than the previous Best Candidate approach. 4 Comments We are generally pleased with the performance from our one-year-old system's results. (For final figures, see the appendix of this proceedings.) In examining the data, one interesting aspect is that Ma~chPlus does better when measured by 11- point averages than by number of relevant documents retrieved. This means a comparatively higher per- centage of documents in early re~rievals were judged relevant .4 We are now running initial experiments with word sense disambiguation using context vectors and clus- tering. It will be interesting to see whether word sense disambiguation can further improve retrieval perfor- mance. References [1] Gallant, S. I. Context Vector Representa- tions for Document Retrieval. AAAI-91 Natu- ral Language Text Retrieval Workshop, Ana- heim, CA, July 15,1991. Networks. Neural Computation, Vol.3, No. 3, 1991, 293-309. [3) Gallant, S. I. Perceptron-Based Learning Al- gorithms. IEEE Transactions on Neural Net- works, Volume 1, Number 2. [4] Gallant, S. I. Neural Network Learning and Expert Systems. MIT Press, 1993. [5) Hinton, G. E. Distributed Representations. Technical Report CMU-CS-84-157, Carnegie- Mellon University, Department of Computer Science. Revised version in Rumelhart, D. E. & McClelland, J. L. (Eds.) Parallel Dis- tributed Processing: Explorations in the Mi- crostructures of Cognition, Vol.1. MIT Press, 1986. [6] Salton, G. The SMART retrieval system - Ex- periments in automatic automatic document processing. Englewood Cliffs, NJ: Prentice- Hall, 1971. [7] Salton, G. & Buckley, C. Term-Weighting A~ proaches in Automatic Text Retrieval. Infor- mation Processing & Management, Vol.24, No.5,1988, pp. 513-523. [8] Salton, G. & Buckley, C. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Sci- ence, 41(4):288-297, 1990. [9] Wong, S.K.M, Yao, Y.Y, Salton, G. & Buck- ley, C. Evaluation of an Adaptive Linear Model. Journal of the American Society for Information Science, 42(10):723-730, 1991. [10] Waltz, D. L. & Pollack, J. B. Massively Par- allel Parsing: A Strongly Interactive Model of Natural Language Interpretation. Cogni- tive Science 9, 51-74 (1985). [2] Gallant, S. Representing Word Sense I. A Practical Approach for Context And for Performing Disambiguation Using Neural 4T1'is also raises the question as to whether comparatively fewer total documents were scored by the readers, perhaps due to the AfaichPlus system's different approach. Computing the median number of documents scored per topic would resolve this question, as weU as providing an indication of "noveltv" for TREC participants. 104