Vector Expansion in a Large Collection Ellen M. Voorhees and Yuan-Wang Hou Siemens Corporate Research, Inc. 755 College Road East Princeton, New Jersey 08540 Abstract This paper investigates whether a completely automatic, statistical expansion technique that uses a general-purpose thesaurus as a source of related concepts is viable for large collections. The retrieval results indicate that the particular expansion technique used here improves the performance of some queries, but degrades the performance of other queries. The overall effectiveness of the method is com- petitive with other systems. The variability of the method is attributable to two main factors: the choice of concepts that are expanded and the confounding effects expansion has on concept weights. Addressing these problems will require both a better method for determining the important concepts of a text and a better method for determining the correct sense of an ambiguous word. 1 Introduction In many retrieval systems the similarity between two texts is a function of the number of word stems that appear in both texts. While these systems are often efficient and robust, their effectiveness is depressed by the presence of homographs (words that are spelled the same but mean different things) and synonyms (different words that mean the same thing) in the texts. Homographs depress precision by causing false matches. Synonyms depress recall by causing conceptual matches to be missed. That is, if a query and a document are about the same topic, but use different words to express the idea, the document will not be retrieved in response to the query. We are investigating how concep~ spaces, data structures that define semantic relationships among ideas, can be used to mitigate the effects of synonymy and homography in retrieval systems designed to satisfy large-scale information needs. We impose two constraints on our research with the goal of making the resulting methods more applicable to retrieving documents from large corpora. First, we want to keep human intervention in the indexing and retrieval processes at a minimum; therefore, we use strictly automatic procedures. Second, even automatic procedures need to be relatively efficient. We believe this efficiency requirement precludes the use of deep analyses of document content for the foreseeable future, and we restrict ourselves to statistical processing of the text and concept space. There are effectiveness concerns when dealing with large corpora as well as efficiency concerns. Large corpora usually imply a diverse vocabulary, and thus the synonym and homograph problems are exacerbated. In this paper we investigate vector expansion as a solution to the synonymy problem for the large TREC collection. As the name "vector expansion implies, we are working within the vector space model of information retrieval [5): both documents and topics are represented as weighted vectors and the similarity between two texts is computed as the inner product of their respective vectors. The vectors are expanded by terms related to original text words in our concept space. In particular, since we are using the WordNet lexical database as our concept space, vectors are expanded by adding selected synonyms of original text words. Although using thesauri to expand vectors has been done before (see, for example, [7], [2), [4)), it has always been done on small collections. We are interested in investigating whether comparatively simple 343 statistical expansion techniques are viable for large collections. The results so far indicate that our expansion technique can improve the performance of some queries, but is equally likely to degrade the performance of others. The sources of this variability are described in detail below. A description of WordNet and the expansion algorithm is given first to provide the appropriate context. 2 WordNet WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the Cognitive Science Laboratory at Princeton University [3]. Originating from a project whose goal was to produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into a system that reflects current psycholinguistic theories about how humans organize their lexical memories. The basic object in WordNet is a set of strict synonyms, called a synset. By definition, each synset a word appears in is a different sense of that word. There are three main divisions in WordNet, one each for nouns, verbs, and adjectives. Within a division, synsets are organized by the lexical relationships defined on them. For nouns, the only division used in this study, the lexical relationships include antonymy, hypernymy/hyponymy (IS-A relation) and three different meronym/holonym (PART-OF) relations. The IS-A relation is the dominant relationship, and organizes the synsets iiito a set of approximately ten hierarchies1. Examples of synsets that are the heads of hierarchies are { entity, thin g}, {psychologicaljeature}, { abstraction}, and {possession}. The developers of WordNet specifically avoided including specialized vocabularies within WordNet; the coverage of "standard" English is quite good. The April, 1992 version of WordNet (the version used in this study) contains 35,155 synonym sets and 67,293 senses in the noun division. The majority of synonym sets are quite small (one or two members), but the more common nouns (i.e., those nouns that actually get used in documents and topics) tend to belong to the larger synsets. Example synsets from the noun division are shown in Figure 1. The lexical relationships that the synsets participate in, especially their parents in the IS-A hierarchy, differentiate among the senses. We developed our own routine to access the WordNet information that differs somewhat from the access code distributed with WordNet. In our version, the access routine takes a word (a string of characters), converts it to lower case, and checks if the converted string occurs in the noun portion of WordNet. If the string is found, the routine returns either the number of synsets in which the string appears, the fact that the string is a known irregular morphological variant of a member of a synset (e.g., `women' is an inflection of `woman'), or both (e.g., `media' is both a member of {media, mass~media} and an inflection of `medium'). If the string is not found, several simple (regular) morphological variants of the word are tried. If none are found, the routine reports the string as not found. Otherwise, the routine returns the base form. A consequence of this simple strategy is that regular plural forms that are members of their own synsets do not return the synsets of the base word. For example, `arms' returns the synsets { coaLoLarms, arms, blazon, blazonry} and {weaponry, arms, implements~oJLwar}, but not the four synsets for arm 3 Vector Expansion Procedure For the retrieval results reported in this paper, both document and query vectors were expanded using synonyms of original text words. The particular expansion method we used is one of the most effective vector expansion methods among a wide variety of expansion schemes we tried on smaller collections. However, the TREC collection is much more diverse than those collections and some other scheme may be more effective on it. We intend to test some of those methods on the TREC collection in the near future. We use the SMART retrieval system developed at Cornell as the basis for our retrieval system [1). The SMART system is designed to facilitate information retrieval research by making it easy to substitute 1The actuai structure is not quite a hierarchy since a few synsets have more than one parent. 344 surrogate (3 senses) {depu~~, surroga~e} { surrogaie} { alierna~e, proxy, siand-in, subs~i~u~e, surrogaie, replacement} opinion (5 senses) {opinion, ruling} { opinion} { opinion, sen~imen~, persuasion, view} {judgment, judgemenL, Opifliofl} { opinion, view} motherhood (1 sense) {moiherhood, maiernity} decision (4 senses) { decision} {judgmen~, judgement, decision, judiciaLdecision} { decision, deciding, decision~making} { decision, firmness} court (4 senses) { cour~, cour~yard} { couvt ~ennis_cour~} { couvt, cour~room} {cour~, tnbunal} Figure 1: Example Synonym Sets customized pieces of the program without needing to modify other components. For this work, we changed only part of the indexing module of the standard SMART system. Indexing a piece of text proceeds as follows: 1. The text is broken into tokens by the standard SMART tokenizer. 2. Each token is passed in turn to a parser. The parser eliminates tokens designated as numbers, white space, or punctuation; the remaining tokens are assumed to be "words". 3. Each word is looked up in the standard SMART stop word list and is eliminated if it is found there. If the word is not a stop word, it is stemmed (using the SMART triestem stemming algorithm), assigned a concept number, and added to the list of concepts that will form the vector. 4. A word that is not a stop word is also looked up in the noun portion of WordNet before it is stemmed. If the word is in WordNet, the set of synonyms from all the synsets the word is a member of is produced. The elements of this set are also stemmed and assigned concept numbers. Instead of the concepts being inserted into the vector list, however, they are inserted into a different rela~ive liSt. Relatives that come from original text words that have only one sense in WordNet (appear in exactly one synset) are flagged as such when they are entered into the list. 345 5. After all the words from a text have been processed, relatives that are flagged as being from a single- sense text word and relatives that have been added to the relative list at least twice are added to the vector list. The requirement to appear in the list at least twice if the relative is from a text word that has multiple senses is a poor-man's attempt at sense disambiguation. The idea is that is that if two original text terms agree on a relative, the relative is probably related to the correct sense of those text terms. 6. To produce the final weighted vector, the term frequency of each of the concepts in the vector list produced in step five is computed. Concepts that were added as relatives have their term frequency weight multiplied by .8 to emphasize the original terms. The term frequency weight of each concept is then multiplied by an inverse document frequency factor, and those weights are further normalized by the square root of the sum of the squares of the weights (cosine normalization). This weighting scheme is the "tfc" weights described by Salton and Buckley in [6]. As an example, take the text coud opinions and decisions on surrogaie mo~herhood (a paraphrase of topic 70). The vector produced for this text using the synsets shown in Figure 1 would contain the stems of coud, opinion, decision, surroga~e, mo~herhood (original text words), ma~erni~y (synonym of `motherhood', which has only one sense), judgmen~, and judgemen~ (synonyms of both `decision' and `opinion'). Both documents and topic statments were indexed by the procedure described above; no special manual processing of the topic statements was performed. The manually assigned keywords associated with some documents were not used in the indexing. For the topic statements only the Concepts, Description, Factors, Narrative, Nationality, and Title sections were indexed. Among those sections, no distinctions were made regarding what section a term appeared in. The decision to expand both documents and query vectors, as opposed to only query vectors, is based on several factors. First, the WordNet synsets contain collocations such as `judicial decision', but the tokenizer used recognizes only single words. For the collocations to participate in matches, both documents and vectors need to be expanded. Second, documents are frequently longer than topic statements. Since we require agreement on a relative before it is added to the vector, the longer documents provide more opportunities for a concept to be added to the vector. Third, in the experiments on smaller collections, expanding both documents and queries was consistently more effective than expanding only queries (although usually less effective than expanding neither). Document expansion has its costs, however, even excluding the obvious additional expense at indexing time. Longer vectors also increase storage costs and processing time at retrieval as well. Table 1 gives a histogram of the percentage increase in vector length as compared to an unexpanded collection for the TREC documents. 4 Experimental Results We performed one retrieval run on the entire TREC database, retrieving documents for the 50 ad hoc queries. The official evaluation table for this run is given in Table 2. Using a Sun IPX with 64 megabytes of RAM, it took approximately 42 hours of processing to produce the inverted index of the document collection. The resulting inverted index takes 947 megabytes of disk storage. It took approximately one CPU second on average to index a topic statement and produce a query vector. The average retrieval time per query was 15 CPU seconds. An analysis of the retrieval results show that the expanded collection is more effective than a correspond- ing unexpanded collection for some queries2. However, the effectiveness of the the expansion procedure is very variable, and the performance of other queries was degraded by the expansion. This variability is attributable to two main factors: the process of selecting which new concepts to add, and the confounding effects expansion has on concept weights. 2Evaluation results for the unexpanded collection were made available through the courtesy of the SMART group at Cornell University. 346 % Increase # Docs 0-10 15656 10-20 52568 20-30 116766 30~0 162862 40-50 163235 50~0 123943 60-70 64029 70-80 24564 80-90 8872 90-100 3333 > 100 6926 > 200 207 Mean increase: 42% Table 1: Histogram of Percentage Increase in Document Vector Length Expansion affects both the inverse document frequency (IDF) and the term frequency (TF) components of the concept weights. A concept that is frequently added to documents is downweighted by its IDF factor relative to its weight in an unexpanded collection. Such a concept is often a very general concept and the downweighting is likely to be beneficial. Similarly, a concept that is occasionally added to documents, and occurs infrequently in the collection otherwise, is emphasized by its IDF component. This may or may not be beneficial, depending on the quality of the term. The aggregate TF component of a concept can be relatively larger in an expanded collection if the concept has many synonyms. This effect is common because the same word will frequently cause the same synonyms to be added in both the document and query vectors. Unfortunately, this effect is usually detrimental because the words occurring in large synsets are common words that contain little content. For example, if either `couple' and `pair' or the Roman numeral `II' appears in a text, then the entire synset {two, ~ ii, twain, couple, pair, twosome, duo, duet, brace, span, yoke, couplet, distich, dyad, duad, deuce, doubleton, craps, snake~eyes} is added. The effects of the changes in weights is illustrated by the performance of topics 95 and 70, the texts of which are given in Figures 2 and 3. Portions of the corresponding query vectors for both expanded and unexpanded collections are given in Figure 4. Topic 95 retrieved 28 relevant documents in the expanded col- lection; in the corresponding unexpanded collection, 17 relevant documents were retrieved. The improvement is due to increasing the weights of central themes of the topic, both by adding additional concepts (outlaw, constabI) and emphasizing existing concepts (law, sleuth). On the other hand, topic 70 retrieved only 32 relevant documents in the top 200 in the expanded collection while in the unexpanded collection 41 relevant documents were retrieved. The degradation is due to the downweighting of `surrogate', which was added to many documents and thus has a smaller IDF weight in the expanded collection, and the increased weight for `mother' (compounded by the addition of `matern'), resulting in a marked preference for documents that contain mother, whether or not they also contain surrogate. The major difficulty of the expansion process is controlling which original terms get expanded and which terms they are expanded by. In our algorithm, any word can be expanded if it occurs only once in WordNet or if there is another word that has a common synonym. Although the agreement criterion is imposed to prevent synonyms of the wrong senses of words from being added, it is not sufficient for the task. Furthermore, to save processing time we do not tag a word with its part of speech prior to looking it up in WordNet, so many words that are used as verbs and adjectives in the text are nonetheless found in the noun division of WordNet (frequently in only one sense!) and add spurious relatives. The consequence of these factors is that in addition to the concepts that are added for marginally useful words, concepts that have no bearing on the content of the text may also be added to its vector. 347 Top ranked evaluation Run number: Ium~queries: Total number of Retrieved: Relevant: Rel~ret: Recall - Precision Averages: at 0.00 0.7524 at 0.10 0.4670 at 0.20 0.3242 at 0.30 0.1973 at 0.40 0.1179 at 0.50 0.0672 at 0.60 0.0241 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all 11-pt Avg: 0.1773 siemsl all 50 documents 9992 16400 3393 Recall: at 5 docs: at 15 docs: at 30 docs: at 100 docs: at 200 docs: Precision: 0.0131 0.0344 0.0650 0.1687 0.2564 over all queries points At 5 docs: 0.5200 At 15 docs: 0.4827 At 30 docs: 0.4687 At 100 docs: 0.4100 At 200 docs: 0.3393 Table 2: Official Evaluation Results 348 Computer-aided Crime Detection Document must describe a computer application to crime solving. To be relevant, a document must describe either an actual or a theoretical computer application to detective work, by the police or by another law enforcement orga- nization. A relevant document could include techniques such as profiling criminals and their methods of operation, identifying finger prints, spotting anomalies, etc. 1. police, detective, sleuth, enforcement agency 2. clue, records, fingerprints, methods Figure 2: Text of Topic 95 Surrogate Motherhood Document will report judicial proceedings and opinions on contracts for surrogate motherhood. A relevant document will report legal opinions, judgments, and decisions regarding surrogate motherhood and the custody of any children which result from surrogate motherhood. To be relevant, a document must identify the case, state the issues which are being decided and report at least one ethical or legal question which arises from the case. 1. surrogate, mothers, motherhood 2. judge, lawyer, court, lawsuit, custody, hearing, opinion, finding Figure 3: Text of Topic 70 Query Expanded Weight .20626 .18793 .16088 .40100 .10375 .08165 .08233 .13792 .25840 .16667 .08366 Vectors for Topic 95 Unexpanded Weight .25682 .23399 .11148 .27738 .08331 Concept enforc fingerprint print sleuth law felon crook constabl espial catch outlaw Query Expanded Weight .15901 .11605 .50365 .14090 .56104 .24603 .14946 .10880 .18264 .10257 Vectors for Topic 70 Unexpanded Weight .08546 .06181 .48892 .08597 .69887 .12226 .20529 Concept decid proceed mother case surrog judg custod matern judicial_decid legal~act Figure 4: Query Vectors for Sample Topics 349 As an example of these effects, consider document FR89512-0147, President Bush's 1989 Mother's Day Proclamation. This document was retrieved in response to topic 70 because it mentioned `mother' or `motherhood' 16 times. Figure 5 contains an excerpt of the document and a sampling of the concepts that were added. (Words that are in boldface in the excerpt are words that caused additional concepts to be added.) Approximately 60 of the 250 concepts in the vector were added by the expansion process. About half of the added concepts are the result of a wrong sense or a wrong part of speech being used in support of its addition, and another 20 of the added concepts are correct, but unimportant. Unfortunately, the ratio of unimportant and mistaken additions to reasonable additions exhibited by document FR89512-0147 is not unusual. WordNet - and English - are rich enough such that it is likely for two words in a text to be synonyms of (different senses of) a third word. Using additional lexical relations compounds this problem: the experiments we conducted on smaller collections show a marked degradation in effectiveness if any any of the other relations represented in WordNet are used in addition to synonymy to expand a concept. 5 Conclusion We have demonstrated a fully automatic, statistical expansion technique that is capable of improving the effectiveness of some queries relative to a corresponding unexpanded collection for a large, full-text collection. The overall effectiveness of the technique is competitive with other retrieval methods. However, the technique is hampered by its unpredictability, which has at least three sources: * errors in selecting the correct sense, and therefore the correct relatives, of a text word, * no determination of the relative importance of a word to the text before deciding to expand it, and * the complex interaction between expansion and term weighting. Since we believe the disambiguation of word senses to be the most fundamental of these three problems, and also useful in its own right, our current research lies in this direction. References [1] Chris Buckley. Implementation of the SMART information retrieval system. Technical Report 85-686, Computer Science Department, Cornell University, Ithaca, New York, May 1985. [2) Edward A. Fox. Lexical relations: Enhancing effectiveness of information retrieval systems. SIGIR Newsletter, 15(3), 1981. [3) George Miller. Special Issue, WordNet: An on-line lexical database. International Journal of Lexicogra- phy, 3(4), 1990. [4) G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. In Gerard Salton, editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 143-180. Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971. [5) G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613-620, November 1975. [6] Gerard Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Information Processing and Management, 24:513-523,1988. [7) Yih-Chen Wang, James Vandendorpe, and Martha Evens. Relational thesauri in information retrieval. Journal of the American Society for Information Science, 36(1):15-27, January 1985. 350 Excerpt of document: A mother's love, while demonstrated daily in acts of tenderness and generosity, is always a source of wonder. Who can fathom the quiet thoughts of one who keeps in her heart a constant vigil over the child she has carried in her womb, rocked in her arms, and watched grow, with eyes full of worry, joy, and pride? Her devotion never fails to fill us with gratitude and awe. Today, we honor all those women who, by virtue of giving birth, or through adoption or marriage, are mothers. Each of us should let our mother know that she is ever close in our hearts, and that her many gifts to us are cherished and remembered~not only on Mother's Day, but throughout the year. In recognition of the contributions of all mothers to their families and to the Nation, the Congress, by a joint resolution approved May 8,1914 (38 Stat. 770), has designated the second Sunday in May each year as Mother's Day and requested the President to call for its appropriate observance. IN WITNESS WHEREOF, I have hereunto set my hand this tenth day of May, in the year of our Lord nineteen hundred and eighty-nine, and of the Independence of the United States of America the two hundred and thirteenth. Sample added concepts: Reasonable additions womb uterus women woman, adultjemale birth & families parentage Unimportant additions eyes eye, eyeball, oculus, optic, peeper, organ~o~sight hundred hundred, 100, C, century, one_C centred Sunday sunday, sabbath, lord's~day, day~oLrest Mistaken additions acts acts~oLthe~apostles daily -4 gazette set & families category, class, type watched & heart ticker source & witness informant source & birth beginning Figure 5: Concepts Added by the Expansion Procedure 351