HNC's MatchFlits System Stephen I. Gallant* William R. Caidt Joel Carletont Robert Hecht~Nielsent Kent Pu Qingt David Sudbeckt HNC Inc 1 Introduction HNC is developing a neural network related approach to document retrieval called MatchPlus1. Goals of this approach include high precision/recall perfor- mance, ease of use, incorporation of machine learning algorithms, and seilsitivity to similarity of use. To understand our notion of sensitivity to similar- ity of use, consider the four words: `car', `automobile', `driving', and `hippopotamus'. `Car' and `automo- bile' are synonyms and they very often occur together in documents; `car' and `driving' are related words (but not synonyms) that sometimes occur together in documents; and `car' and `hippopotamus' are es- sentially unrelated words that seldom occur within the same document. We want the system to be sen- sitive to such similarity of use, much like a built-in thesaurus, yet without the drawbacks of a thesaurus, such as domain dependence or the need for hand- entry of synonyms. In particular we want a query on `car' to prefer a document containing `drive' to one containing `hippopotamus', and we want the system itself to be able to figure this out from the corpus. The implementation of MatchPlus is motivated by neural networks, and designed to interface with neu- ral network learning algorithms. High-dimensional ( 300) vectors, called context vectors, represent word stems, documents, and queries in the same vec- tor space. This representation permits one type of neural network learning algorithm to generate stem context vectors that are sensitive to similarity of use, and a more standard neural network algorithm to per- form routing and automatic query modification based upon user feedback. Queries can take the form of terms, full documents, parts of documents, and/or conventional Boolean cx- pressions. Optional weights may also be included. The following sections give a brief overview of our implementation, and look at some preliminary test results. For a previous description of the approach and comments on complexity considerations see a longer journal article is in preparation. 2 The Context Vector Ap- proach One of the most important aspects of MatchPlus is its representation of words (stems), documents, and queries by high ( 300) dimensional vectors called context vectors. By representing all objects in the same high dimensional space we can easily: 1. Form a document context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the document. 2. Form a query context vector as the (weighted) vector sum of the context vectors for those words (stems) contained in the query. 3. Compute the distance of a query Q to any doc- ument. Moreover if document context vectors are normalized, the closest document d (in Eu- clidean distance) has the context vector Vd that gives highest dot product with the query context vector ~ - {dIVd.VQ is maximized for d E D} (proof: 11Vd vQII2 - IIvdII2+IIvQII2~2(vd.vQ) = const - 2(Vd . VQ).) 4. Find the closest document to a given document d by treating Vd as a query vector. 5. Perform relevance feedback. If d is a relevant document for query Q, form a new query vector ~124 Mt Auburn St, Suite 200, Canibridge, MA 02138 t5501 Oberlin Drive, San Diego, CA 92121. 1 Patents pending. - ~ + aVd 107 where ~ is some suitable positive number (eg 3). (See also [6].) Note that search with ~ takes the same amount of time as search with ~ 2.1 Context Vector Representations Context vector representations (or feature space rep- resentations) have a long history in cognitive science. Work by Waltz & Pollack [7] had an especially strong influence on the work reported here. They described a neural network model for word sense disambiguation and developed context vector representations (which they termed micro-feature representations). See Gal- lant [2] for more background on context vector repre- sentations and word sense disambiguation. We use context vector representations for docu- ment retrieval, with most of the representation being learned from an unlabeled corpus. A main constraint for all of this work is to keep computation and storage reasonable, even for very large corpora. For defining context vectors, we begin by specifying a set of n features that are useful for differentiating among terms and contexts. These may be chosen in an ad hoc manner using `common sense", or they may consist of the high frequency terms in a given corpus after removal of stopwords (a, the, and, ..). Figure 1 gives some typical examples that might be suitable for general news stories. Experiments sug- gest that the precise selection of features is not criti- cal to system performance. human machine art science play walk lie-down motion speak research fun sad exciting friend family baby country cold hard soft sbarp light big small red white blue yellow animal insect plant tree flower fruit fragrant stink past future high low wood paper metal building house work early late day afternoon morning sunny cloudy hot told humid smart dumb truck write type cook eat Figure 1: Some typical features. some features that apply to star to the top of the list. politics entertainment yell boring hot heavy black mammal bush present plastic factory night rain bright bike spicy For convenience have been moved For any word stem k we now define its context vec- tor, Vk, to be an n-dimensional vector, and interpret each component of Vk as follows: * {strongly} positive if word k is {strongly} associated with feature j * 0 if word k is not associated with feature j 108 * {strongly} negative if word k {strongly} contradicts feature j. As an example, vastronomer might be <+2 +1 +1 -1 -1 o +2 0 0 0 o 0 +1 +1 +1 +2 +1 -1 +1 -1 using the features in figure 1. Note that the inter- pretation of components of context vectors is exactly the same as the interpretation of weights in neural networks. In addition to the "word features" there are other "learned features" that primarily serve to increase the dimensionality of context vectors. Features are deliberately chosen so that they over- lap. This makes context vectors less dependent upon any individual feature. Context vector representa- tions give built-in sensitivity to similarity of meaning between terms. For example it is likely that the con- text vector for `car' would be very close to the context vector for `auto', somewhat close to the context vector for `driving', and less close to the context vector for `hippopotamus' for any "reasonable" set of features and for any "reasonable" person enterin9 the context vectors. For more on this point see the plausibility argument in Waltz & Pollack [7]. Note that overlap- ping features provide a distributed representation [3], and therefore help insulate against a small number of questionable entries for context vectors. 2.2 Bootstrap Learning Bootstrapping is a machine learning technique that begins with hand-entered context vectors for a small set of core stems, and then uses an unlabeled train- ing corpus to create context vectors for all remaining stems. The basic idea is to define the context vector for a new stem by making it similar to the context vectors of its neighbor stems. Note that bootstrapping takes into account local word positioning when assigning the context vector representation for stems. Moreover it is nearly invariant with respect to document divisions within the training corpus. This contrasts with those methods where stem representations are determined solely by those documents in which the stem lies. We have also developed a fully-automated method for bootstrapping that requires no initial hand entry. This capability is very useful for specialized domains such as the tests on traditional IR corpora presented in the next section. We are currently researching Context Vector for Document V V = IIVII ~¾~z1¾~ ..e ~ x x A ~Log( counNt(w ) of human evenil;;) Normalize Vector Sum Inverse Frequency Weighting Context Vector for Individual Word~~ Sloplist Document Figure 2: Generating the context vector for a docunient. whether the fully-automated method can perform as well as standard bootstrapping. 2.3 Context Vectors for Documents Once we have generated context vectors for stems it is easy to compute the context vector for a document. We simply take a weighted sum of context vectors for all stems appearing in the document2 and then nor- malize the sum. See figure 2. This procedure applies to documents in the training corpus as well as to new documents. When adding up stem context vectors we can use term frequency weights similar to conven- tional IR systems. 2.4 Context Vectors for Queries; Rel- evance Feedback Query context vectors are formed similarly to docu- ment context vectors. For each stem in the query we 2Stopwords are discarded 109 can apply a user-specified weight (default 1.0). Then we can sum the corresponding context vectors and normalize the result. Note that it is easy to implement relevance feed- back. The user can specify documents (with weights) and the document context vectors are merely added in with the context vectors from the other terms. We can also find documents close to a given document by using the document context vector as a query context vector. 2.5 Retrieval The basic retrieval operation is simple; we find the document context vector closest to the query context vector and return it. There are several important points to note. 1. As many documents as desired may be retrieved, and the distances from the query context vector give some measure of retrieval quality. 2. Because document context vectors are normal- ized, we may simply find the document d that maximized the dot product with the query con- text vector, ~ max{Vd . ~Q} d 3. It is easy to combine keyword match with con- text vectors. We first use the match as a filter for documents and return documents in order by closeness to the query vectors. If all matching documents have been retrieved, MatchPlus can revert to context vectors for finding the closest remaining document. 4. MatchPlus requires only about 300 multiplica- tions and additions to search a document. More- over it is easy to decompose the search for a cor- pus of documents with either parallel hardware or, less expensively several networked conven- tional machines (or chips). Each machine can search a subset of the document context vectors and return the closest distances and document numbers in its subset. The closest from among the distances returned by all the processors then determines the documents chosen for retrieval. We are also investigating a cluster tree prun- ing procedure that finds nearest neighbor docu- ment context vectors without having to compute dot products for all document context vectors. This data organization affects retrieval speed, but does not change the order in which docu- ments are retrieved. 3 Preliminary Results Our system is very `young'. It has been able to handle large corpora (1,000,000+ documents) only since July 1992. Nevertheless we have some promising results. In figure 3, we see that MatchPlus gives comparable performance to Salton's SMART system [5] on small, traditional IR test corpora when corresponding term weighting schemes are used. Salton reports signifi- cantly improved performance (10~50%) with other term weighting methods, and we are in the process of running the corresponding tests with MatchPlus. These experiments used fully automated boot- strapping with no hand entry of context vectors. Ex- periments on other corpora with a hand-entered set of core stems show 3% to 15% improvement, with larger improvement on smaller corpora. Bootstrapping for the tests in figure 3 was on the target corpus only, with maximum size being 3200 110 C's' CACM MED MatchPlus ~ .1749 T .5013 SMART .1410 .2535 .5062 Notes: MatchPlus: $Match(3) filter - SMART figures from Salton [5]. - Comparisons use classical idf term weighting for both systems. Figure 3: MatchPlus results are comparable to SMART system results on traditional IR corpora us- mg a corresponding term weighting method. documeuts. We have found a significant advantage to bootstrapping with larger corpora, as shown in figure 4. We also plan experiments where bootstrap- ping begins with stem context vectors generated from a larger corpus. Bootstrap corpus size 50,000 200,000 500,000 (# docs) Performance 36.5 39.0 39.9 Improvement 7% 9% Notes: - Performance was average relevant for 200 re- trievals using Tipster corpus. Figure 4: Improvement with size of bootstrap corpus. Experiments with the Tipster/TREC corpus are in progress. 4 Concluding Comments The key feature of MatchPlus is its uniform represen- tation of all objects by context vectors. This makes possible a large number of interesting experiments for the next year, such as * use of neural network learning algorithms to per- form fast automated query modification based upon user feedback * word sense disambiguation as described in [2] * cluster tree speedup for retrieval * different term weighting methods for document and query context vector creation, as well as for bootstrapping. Although young, we believe MatchPlus has already shown encouraging results, and we look forward to growing it. References [~ Gallant S. I. Context Vector Representa- tions for Document Retrieval. AAAI-91 Natu- ral Language Text Retrieval Workshop, Ana- heim, CA, July 15, 1991. [2] Gallant, S. I. A Practical Approach for Representing Context And for Performing Word Sense Disambiguation Using Neural Networks. Neural Computation, Vol.3, No. 3, 1991, 293-309. [3] Hinton, G. E. Distributed Representations. Technical Report CMU-CS-84-157, Carnegie- Mellon University, Department of Computer Science. Revised version in Rumelhart, D. E. & McClellaiid, J. L. (Eds.) Parallel Dis- tributed Processing: Explorations in the Mi- crostructures of Co9nition, Vol.1. MIT Press, 1986. [4] Salton, G. The SMART retrieval s~tem - Ex- perirnents in automatic automatic document processing. Englewood Cliffs, NJ: Prentice- Hall, 1971. [5] Salton, G. & Buckley, C. Term-Weighting Ap- proaches in Automatic Text Retrieval. Infor- mation Processing & Management, Vol.24, No.5, 1988, pp.513-523. [6] Salton, G. & Buckley, C. Improving Retrieval Performance by Relevance Feedback. Journal of the American Society for Information Sci- ence, 41(4):288-297, 1990. [7] Waltz, D. L. & Pollack, J. B. Massively Par- allel Parsing: A Strongly Interactive Model of Natural Language Interpretation. Cogni- tive Science 9, 51-74 (1985). 111