Vector Expansion in a Large Collection
                           Ellen M. Voorhees and Yuan-Wang Hou
                              Siemens Corporate Research, Inc.
                                   755 College Road East
                                Princeton, New Jersey 08540


                                            Abstract

      This paper investigates whether a completely automatic, statistical expansion technique that uses a
    general-purpose thesaurus as a source of related concepts is viable for large collections. The retrieval
    results indicate that the particular expansion technique used here improves the performance of some
    queries, but degrades the performance of other queries. The overall effectiveness of the method is com-
    petitive with other systems. The variability of the method is attributable to two main factors: the choice
    of concepts that are expanded and the confounding effects expansion has on concept weights. Addressing
    these problems will require both a better method for determining the important concepts of a text and
    a better method for determining the correct sense of an ambiguous word.


1   Introduction
In many retrieval systems the similarity between two texts is a function of the number of word stems that
appear in both texts. While these systems are often efficient and robust, their effectiveness is depressed
by the presence of homographs (words that are spelled the same but mean different things) and synonyms
(different words that mean the same thing) in the texts. Homographs depress precision by causing false
matches. Synonyms depress recall by causing conceptual matches to be missed. That is, if a query and
a document are about the same topic, but use different words to express the idea, the document will not
be retrieved in response to the query. We are investigating how concep~ spaces, data structures that define
semantic relationships among ideas, can be used to mitigate the effects of synonymy and homography in
retrieval systems designed to satisfy large-scale information needs.
  We impose two constraints on our research with the goal of making the resulting methods more applicable
to retrieving documents from large corpora. First, we want to keep human intervention in the indexing and
retrieval processes at a minimum; therefore, we use strictly automatic procedures. Second, even automatic
procedures need to be relatively efficient. We believe this efficiency requirement precludes the use of deep
analyses of document content for the foreseeable future, and we restrict ourselves to statistical processing of
the text and concept space.
  There are effectiveness concerns when dealing with large corpora as well as efficiency concerns. Large
corpora usually imply a diverse vocabulary, and thus the synonym and homograph problems are exacerbated.
In this paper we investigate vector expansion as a solution to the synonymy problem for the large TREC
collection.  As the name "vector expansion implies, we are working within the vector space model of
information retrieval [5): both documents and topics are represented as weighted vectors and the similarity
between two texts is computed as the inner product of their respective vectors. The vectors are expanded
by terms related to original text words in our concept space. In particular, since we are using the WordNet
lexical database as our concept space, vectors are expanded by adding selected synonyms of original text
words.
  Although using thesauri to expand vectors has been done before (see, for example, [7], [2), [4)), it has
always been done on small collections. We are interested in investigating whether comparatively simple


                                              343

statistical expansion techniques are viable for large collections. The results so far indicate that our expansion
technique can improve the performance of some queries, but is equally likely to degrade the performance
of others. The sources of this variability are described in detail below. A description of WordNet and the
expansion algorithm is given first to provide the appropriate context.


2   WordNet
WordNet is a manually-constructed lexical system developed by George Miller and his colleagues at the
Cognitive Science Laboratory at Princeton University [3]. Originating from a project whose goal was to
produce a dictionary that could be searched conceptually instead of only alphabetically, WordNet evolved into
a system that reflects current psycholinguistic theories about how humans organize their lexical memories.
The basic object in WordNet is a set of strict synonyms, called a synset. By definition, each synset a word
appears in is a different sense of that word.
  There are three main divisions in WordNet, one each for nouns, verbs, and adjectives. Within a division,
synsets are organized by the lexical relationships defined on them. For nouns, the only division used in this
study, the lexical relationships include antonymy, hypernymy/hyponymy (IS-A relation) and three different
meronym/holonym (PART-OF) relations. The IS-A relation is the dominant relationship, and organizes
the synsets iiito a set of approximately ten hierarchies1. Examples of synsets that are the heads of hierarchies
are { entity, thin g}, {psychologicaljeature}, { abstraction}, and {possession}.
  The developers of WordNet specifically avoided including specialized vocabularies within WordNet; the
coverage of "standard" English is quite good. The April, 1992 version of WordNet (the version used in this
study) contains 35,155 synonym sets and 67,293 senses in the noun division. The majority of synonym sets
are quite small (one or two members), but the more common nouns (i.e., those nouns that actually get used
in documents and topics) tend to belong to the larger synsets. Example synsets from the noun division are
shown in Figure 1. The lexical relationships that the synsets participate in, especially their parents in the
IS-A hierarchy, differentiate among the senses.
  We developed our own routine to access the WordNet information that differs somewhat from the access
code distributed with WordNet. In our version, the access routine takes a word (a string of characters),
converts it to lower case, and checks if the converted string occurs in the noun portion of WordNet. If the
string is found, the routine returns either the number of synsets in which the string appears, the fact that
the string is a known irregular morphological variant of a member of a synset (e.g., `women' is an inflection
of `woman'), or both (e.g., `media' is both a member of {media, mass~media} and an inflection of `medium').
If the string is not found, several simple (regular) morphological variants of the word are tried. If none
are found, the routine reports the string as not found. Otherwise, the routine returns the base form. A
consequence of this simple strategy is that regular plural forms that are members of their own synsets do not
return the synsets of the base word. For example, `arms' returns the synsets { coaLoLarms, arms, blazon,
blazonry} and {weaponry, arms, implements~oJLwar}, but not the four synsets for arm


3   Vector Expansion Procedure
For the retrieval results reported in this paper, both document and query vectors were expanded using
synonyms of original text words. The particular expansion method we used is one of the most effective vector
expansion methods among a wide variety of expansion schemes we tried on smaller collections. However, the
TREC collection is much more diverse than those collections and some other scheme may be more effective
on it. We intend to test some of those methods on the TREC collection in the near future.
  We use the SMART retrieval system developed at Cornell as the basis for our retrieval system [1).
The SMART system is designed to facilitate information retrieval research by making it easy to substitute

  1The actuai structure is not quite a hierarchy since a few synsets have more than one parent.


                                       344

                 surrogate (3 senses)
                       {depu~~, surroga~e}
                       { surrogaie}
                       { alierna~e, proxy, siand-in, subs~i~u~e, surrogaie, replacement}

                 opinion (5 senses)
                       {opinion, ruling}

                       { opinion}
                       { opinion, sen~imen~, persuasion, view}
                       {judgment, judgemenL, Opifliofl}
                       { opinion, view}

                 motherhood (1 sense)
                       {moiherhood, maiernity}

                 decision (4 senses)
                       { decision}
                       {judgmen~, judgement, decision, judiciaLdecision}
                       { decision, deciding, decision~making}
                       { decision, firmness}

                 court (4 senses)
                       { cour~, cour~yard}
                       { couvt ~ennis_cour~}
                       { couvt, cour~room}
                       {cour~, tnbunal}


                            Figure 1: Example Synonym Sets


customized pieces of the program without needing to modify other components. For this work, we changed
only part of the indexing module of the standard SMART system.
  Indexing a piece of text proceeds as follows:

  1. The text is broken into tokens by the standard SMART tokenizer.

  2. Each token is passed in turn to a parser. The parser eliminates tokens designated as numbers, white
    space, or punctuation; the remaining tokens are assumed to be "words".

  3. Each word is looked up in the standard SMART stop word list and is eliminated if it is found there. If
    the word is not a stop word, it is stemmed (using the SMART triestem stemming algorithm), assigned
    a concept number, and added to the list of concepts that will form the vector.

  4. A word that is not a stop word is also looked up in the noun portion of WordNet before it is stemmed. If
    the word is in WordNet, the set of synonyms from all the synsets the word is a member of is produced.
    The elements of this set are also stemmed and assigned concept numbers. Instead of the concepts
    being inserted into the vector list, however, they are inserted into a different rela~ive liSt. Relatives
    that come from original text words that have only one sense in WordNet (appear in exactly one synset)
    are flagged as such when they are entered into the list.


                                       345

  5. After all the words from a text have been processed, relatives that are flagged as being from a single-
    sense text word and relatives that have been added to the relative list at least twice are added to the
    vector list. The requirement to appear in the list at least twice if the relative is from a text word that
    has multiple senses is a poor-man's attempt at sense disambiguation. The idea is that is that if two
    original text terms agree on a relative, the relative is probably related to the correct sense of those text
    terms.

  6. To produce the final weighted vector, the term frequency of each of the concepts in the vector list
    produced in step five is computed. Concepts that were added as relatives have their term frequency
    weight multiplied by .8 to emphasize the original terms. The term frequency weight of each concept is
    then multiplied by an inverse document frequency factor, and those weights are further normalized by
    the square root of the sum of the squares of the weights (cosine normalization). This weighting scheme
    is the "tfc" weights described by Salton and Buckley in [6].

  As an example, take the text coud opinions and decisions on surrogaie mo~herhood (a paraphrase of
topic 70). The vector produced for this text using the synsets shown in Figure 1 would contain the stems of
coud, opinion, decision, surroga~e, mo~herhood (original text words), ma~erni~y (synonym of `motherhood',
which has only one sense), judgmen~, and judgemen~ (synonyms of both `decision' and `opinion').
  Both documents and topic statments were indexed by the procedure described above; no special manual
processing of the topic statements was performed. The manually assigned keywords associated with some
documents were not used in the indexing. For the topic statements only the Concepts, Description, Factors,
Narrative, Nationality, and Title sections were indexed. Among those sections, no distinctions were made
regarding what section a term appeared in.
  The decision to expand both documents and query vectors, as opposed to only query vectors, is based on
several factors. First, the WordNet synsets contain collocations such as `judicial decision', but the tokenizer
used recognizes only single words. For the collocations to participate in matches, both documents and vectors
need to be expanded. Second, documents are frequently longer than topic statements.  Since we require
agreement on a relative before it is added to the vector, the longer documents provide more opportunities
for a concept to be added to the vector. Third, in the experiments on smaller collections, expanding both
documents and queries was consistently more effective than expanding only queries (although usually less
effective than expanding neither). Document expansion has its costs, however, even excluding the obvious
additional expense at indexing time.  Longer vectors also increase storage costs and processing time at
retrieval as well. Table 1 gives a histogram of the percentage increase in vector length as compared to an
unexpanded collection for the TREC documents.


4   Experimental Results
We performed one retrieval run on the entire TREC database, retrieving documents for the 50 ad hoc queries.
The official evaluation table for this run is given in Table 2. Using a Sun IPX with 64 megabytes of RAM,
it took approximately 42 hours of processing to produce the inverted index of the document collection.
The resulting inverted index takes 947 megabytes of disk storage. It took approximately one CPU second
on average to index a topic statement and produce a query vector. The average retrieval time per query
was 15 CPU seconds.
  An analysis of the retrieval results show that the expanded collection is more effective than a correspond-
ing unexpanded collection for some queries2.  However, the effectiveness of the the expansion procedure
is very variable, and the performance of other queries was degraded by the expansion. This variability is
attributable to two main factors: the process of selecting which new concepts to add, and the confounding
effects expansion has on concept weights.

  2Evaluation results for the unexpanded collection were made available through the courtesy of the SMART group at
Cornell University.


                                                 346

                                % Increase          # Docs
                                   0-10              15656
                                  10-20              52568
                                  20-30             116766
                                  30~0              162862
                                  40-50             163235
                                  50~0              123943
                                  60-70              64029
                                  70-80              24564
                                  80-90               8872
                                  90-100              3333
                                  > 100               6926
                                  > 200                207
                                Mean increase: 42%


             Table 1: Histogram of Percentage Increase in Document Vector Length


  Expansion affects both the inverse document frequency (IDF) and the term frequency (TF) components
of the concept weights. A concept that is frequently added to documents is downweighted by its IDF factor
relative to its weight in an unexpanded collection. Such a concept is often a very general concept and the
downweighting is likely to be beneficial. Similarly, a concept that is occasionally added to documents, and
occurs infrequently in the collection otherwise, is emphasized by its IDF component.  This may or may
not be beneficial, depending on the quality of the term. The aggregate TF component of a concept can
be relatively larger in an expanded collection if the concept has many synonyms. This effect is common
because the same word will frequently cause the same synonyms to be added in both the document and
query vectors. Unfortunately, this effect is usually detrimental because the words occurring in large synsets
are common words that contain little content. For example, if either `couple' and `pair' or the Roman
numeral `II' appears in a text, then the entire synset {two, ~ ii, twain, couple, pair, twosome, duo, duet,
brace, span, yoke, couplet, distich, dyad, duad, deuce, doubleton, craps, snake~eyes} is added.
  The effects of the changes in weights is illustrated by the performance of topics 95 and 70, the texts of
which are given in Figures 2 and 3. Portions of the corresponding query vectors for both expanded and
unexpanded collections are given in Figure 4. Topic 95 retrieved 28 relevant documents in the expanded col-
lection; in the corresponding unexpanded collection, 17 relevant documents were retrieved. The improvement
is due to increasing the weights of central themes of the topic, both by adding additional concepts (outlaw,
constabI) and emphasizing existing concepts (law, sleuth). On the other hand, topic 70 retrieved only 32
relevant documents in the top 200 in the expanded collection while in the unexpanded collection 41 relevant
documents were retrieved. The degradation is due to the downweighting of `surrogate', which was added to
many documents and thus has a smaller IDF weight in the expanded collection, and the increased weight
for `mother' (compounded by the addition of `matern'), resulting in a marked preference for documents that
contain mother, whether or not they also contain surrogate.
  The major difficulty of the expansion process is controlling which original terms get expanded and which
terms they are expanded by. In our algorithm, any word can be expanded if it occurs only once in WordNet or
if there is another word that has a common synonym. Although the agreement criterion is imposed to prevent
synonyms of the wrong senses of words from being added, it is not sufficient for the task. Furthermore, to
save processing time we do not tag a word with its part of speech prior to looking it up in WordNet, so
many words that are used as verbs and adjectives in the text are nonetheless found in the noun division of
WordNet (frequently in only one sense!) and add spurious relatives. The consequence of these factors is that
in addition to the concepts that are added for marginally useful words, concepts that have no bearing on
the content of the text may also be added to its vector.


                                                347

          Top ranked evaluation

Run number:
Ium~queries:
Total number of
   Retrieved:
   Relevant:
   Rel~ret:
Recall - Precision Averages:
   at 0.00        0.7524
   at 0.10        0.4670
   at 0.20        0.3242
   at 0.30        0.1973
   at 0.40        0.1179
   at 0.50        0.0672
   at 0.60        0.0241
   at 0.70        0.0000
   at 0.80        0.0000
   at 0.90        0.0000
   at 1.00        0.0000
Average precision for all
   11-pt Avg:     0.1773

  siemsl  all
    50
documents
  9992
  16400
  3393

Recall:

   at  5 docs:
   at 15 docs:
   at 30 docs:
   at 100 docs:
   at 200 docs:

Precision:

0.0131
0.0344
0.0650
0.1687
0.2564

over all queries

points

   At  5 docs:   0.5200
   At 15 docs:   0.4827
   At 30 docs:   0.4687
   At 100 docs:   0.4100
   At 200 docs:   0.3393


     Table 2: Official Evaluation Results


                  348

 Computer-aided Crime Detection
 Document must describe a computer application to crime solving.
 To be relevant, a document must describe either an actual or a theoretical computer
 application to detective work, by the police or by another law enforcement orga-
 nization. A relevant document could include techniques such as profiling criminals
 and their methods of operation, identifying finger prints, spotting anomalies, etc.
 1. police, detective, sleuth, enforcement agency
 2. clue, records, fingerprints, methods

                       Figure 2: Text of Topic 95


 Surrogate Motherhood
 Document will report judicial proceedings and opinions on contracts for surrogate
 motherhood.
 A relevant document will report legal opinions, judgments, and decisions regarding
 surrogate motherhood and the custody of any children which result from surrogate
 motherhood. To be relevant, a document must identify the case, state the issues
 which are being decided and report at least one ethical or legal question which arises
 from the case.
 1. surrogate, mothers, motherhood
 2. judge, lawyer, court, lawsuit, custody, hearing, opinion, finding

                       Figure 3: Text of Topic 70

    Query
Expanded
 Weight
 .20626
 .18793
 .16088
 .40100
 .10375
 .08165
 .08233
 .13792
 .25840
 .16667
 .08366

Vectors for Topic 95
          Unexpanded
             Weight
             .25682
             .23399
             .11148
             .27738
             .08331

Concept
  enforc
fingerprint
  print
  sleuth
   law
  felon
  crook
constabl
  espial
  catch
  outlaw

     Query
Expanded
 Weight
 .15901
 .11605
 .50365
 .14090
 .56104
 .24603
 .14946
 .10880
 .18264
 .10257

Vectors for Topic 70
           Unexpanded
              Weight
              .08546
              .06181
              .48892
              .08597
              .69887
              .12226
              .20529

  Concept
   decid
  proceed
  mother
   case
  surrog
   judg
  custod
  matern
judicial_decid
  legal~act

                 Figure 4: Query Vectors for Sample Topics


                                349

   As an example of these effects, consider document FR89512-0147, President Bush's 1989 Mother's Day
Proclamation.  This document was retrieved in response to topic 70 because it mentioned `mother' or
`motherhood' 16 times.  Figure 5 contains an excerpt of the document and a sampling of the concepts
that were added. (Words that are in boldface in the excerpt are words that caused additional concepts to be
added.) Approximately 60 of the 250 concepts in the vector were added by the expansion process. About
half of the added concepts are the result of a wrong sense or a wrong part of speech being used in support
of its addition, and another 20 of the added concepts are correct, but unimportant.
   Unfortunately, the ratio of unimportant and mistaken additions to reasonable additions exhibited by
document FR89512-0147 is not unusual. WordNet - and English - are rich enough such that it is likely for
two words in a text to be synonyms of (different senses of) a third word. Using additional lexical relations
compounds this problem: the experiments we conducted on smaller collections show a marked degradation
in effectiveness if any any of the other relations represented in WordNet are used in addition to synonymy
to expand a concept.


5   Conclusion
We have demonstrated a fully automatic, statistical expansion technique that is capable of improving the
effectiveness of some queries relative to a corresponding unexpanded collection for a large, full-text collection.
The overall effectiveness of the technique is competitive with other retrieval methods. However, the technique
is hampered by its unpredictability, which has at least three sources:

   * errors in selecting the correct sense, and therefore the correct relatives, of a text word,

   * no determination of the relative importance of a word to the text before deciding to expand it, and

   * the complex interaction between expansion and term weighting.

Since we believe the disambiguation of word senses to be the most fundamental of these three problems, and
also useful in its own right, our current research lies in this direction.


References
[1] Chris Buckley. Implementation of the SMART information retrieval system. Technical Report 85-686,
   Computer Science Department, Cornell University, Ithaca, New York, May 1985.

[2) Edward A. Fox.   Lexical relations: Enhancing effectiveness of information retrieval systems. SIGIR
   Newsletter, 15(3), 1981.

[3) George Miller. Special Issue, WordNet: An on-line lexical database. International Journal of Lexicogra-
   phy, 3(4), 1990.

[4) G. Salton and M. E. Lesk.  Computer evaluation of indexing and text processing.  In Gerard Salton,
   editor, The SMART Retrieval System: Experiments in Automatic Document Processing, pages 143-180.
   Prentice-Hall, Inc. Englewood Cliffs, New Jersey, 1971.

[5) G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of
   the ACM, 18(11):613-620, November 1975.

[6] Gerard Salton and Chris Buckley. Term weighting approaches in automatic text retrieval. Information
   Processing and Management, 24:513-523,1988.

[7) Yih-Chen Wang, James Vandendorpe, and Martha Evens. Relational thesauri in information retrieval.
   Journal of the American Society for Information Science, 36(1):15-27, January 1985.


                                                 350

Excerpt of document:


    A mother's love, while demonstrated daily in acts of tenderness and
    generosity, is always a source of wonder. Who can fathom the quiet
    thoughts of one who keeps in her heart a constant vigil over the child
    she has carried in her womb, rocked in her arms, and watched grow,
    with eyes full of worry, joy, and pride? Her devotion never fails to fill
    us with gratitude and awe.


    Today, we honor all those women who, by virtue of giving birth, or
    through adoption or marriage, are mothers. Each of us should let our
    mother know that she is ever close in our hearts, and that her many
    gifts to us are cherished and remembered~not only on Mother's Day, but
    throughout the year.
    In recognition of the contributions of all mothers to their families and
    to the Nation, the Congress, by a joint resolution approved May 8,1914
    (38 Stat. 770), has designated the second Sunday in May each year
    as Mother's Day and requested the President to call for its appropriate
    observance.


    IN WITNESS WHEREOF, I have hereunto set my hand this tenth day
    of May, in the year of our Lord nineteen hundred and eighty-nine, and
    of the Independence of the United States of America the two hundred
    and thirteenth.


Sample added concepts:

     Reasonable additions
            womb      uterus
            women       woman, adultjemale
            birth & families     parentage

     Unimportant additions
            eyes     eye, eyeball, oculus, optic, peeper, organ~o~sight
            hundred     hundred, 100, C, century, one_C centred
            Sunday      sunday, sabbath, lord's~day, day~oLrest

     Mistaken additions
            acts     acts~oLthe~apostles
            daily -4 gazette
            set & families     category, class, type
            watched & heart      ticker
            source & witness     informant
            source & birth     beginning


          Figure 5: Concepts Added by the Expansion Procedure


                                  351