LSI meets TREC: A Status Report
                                 Susan T. Dumais
                                     Belicore
                                   445 South St.
                             Morristown, NJ 07960-1910
                                 std@bellcore.com


1. Overview of Latent Semantic Indexing

Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton &
McGill, 1983) in which the dependencies between terms and between documents, in addition to
the associations between terms and documents, are explicitly taken into account. This is done by
simultaneously modeling all the association of terms and documents. We assume that there is
some underlying or "latent" structure in the pattern of word usage across documents, and use
statistical techniques to estimate this latent structure. A description of terms, documents and
user queries based on the underlying, "latent semantic", structure (rather than surface level word
choice) is used for representing and retrieving information.

Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely
related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). A
large term-document matrix is decomposed it into a set of k, typically 100 to 300, orthogonal
factors from which the original matrix can be approximated by linear combination. Instead of
representing documents and queries directly as sets of independent words, LSI represents them
as continuous values on each of the k orthogonal indexing dimensions. Since the number of
factors or dimensions is much smaller than the number of unique terms, words will not be
independent. For example, if two terms are used in similar contexts (documents), they will have
similar vectors in the reduced-dimension LSI representation. The SVD technique can capture
such structure better than simple term-term or document-document correlations and clusters.
LSI partially overcomes some of the deficiencies of assuming independence of words, and
provides a way of dealing with synonymy automatically without the need for a manually
constructed thesaurus. LSI is a completely automatic method. (The Appendix provides a brief
overview of the mathematics underlying the LSI"SVD method. Deerwester et al., 1990, and
Furnas et al., 1988 present additional mathematical details and examples.)

One can also interpret the analysis performed by SYD geometrically. The result of the SVD is a
vector representing the location of each term and document in the k -dimensional LSI
representation. The location of term vectors reflects the correlations in their usage across
documents.  In this space the cosine or dot product between vectors corresponds to their
estimated similarity. Since both term and document vectors are represented in the same space,

                                      137

similarities between any combination of terms and documents can be easily obtained. Retrieval
proceeds by using the terms in a query to identify a point in the space, and all documents are
then ranked by their similarity to the query.

The LSI method has been applied to many of the standard IR collections with favorable results.
Using the same tokenization and term weightings, the LSI method has equaled or outperformed
standard vector methods and other variants in almost every case, and was as much as 30% better
in some cases (Deerwester et al., 1990). As with the standard vector method, differential term
weighting and relevance feedback both improve LSI performance substantially ~umais, 1991).
LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991),
and in filtering applications (Foltz and Dumais, 1992).

The TREC conference was an opportunity for us to "scale up" our tools, and to explore the LSI
dimension-reduction ideas using a very rich corpus of word usage. This large collection of
standard documents and relevance judgements should be a valuable IR resource and an
important step in the systematic development of more effective retrieval systems.


2. Application of LSI to the TREC collection
2.1 Overview
We used existing LSJ/SVD software for analyzing the training and test collections, and for query
processing and retrieval. For pragmatic reasons, we divided the TREC collection into 9
subcollections - APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2. Queries were passed
against the appropriate subcollections, and the returns recombined to arrive at a single ranked
output.

There were three main stages involved in processing documents and constructing the relevant
data structures. All steps were completely automatic and involved no human intervention. The
resulting reduced-dimension representations were used for matching and retrieval.

 1.  Pre-processing and indexing (extracting terms, calculating term weights, etc.)
 2.  Computing the SYD (number of dimensions ranged from 235-310)

 3.  Adding new documents and/or terms


2.1.1 Pre-processing and indexing
We did minimal pre-processing on the raw text of the TREC documents. Some markups (any
text within <>delimiters) were removed, and all hand-indexed entries were removed from the
WSJ and ZIFF collections. Upper case characters were translated into lower case, punctuation
was removed, and white spaces were used to delimit terms. A minimum term length of 2 was
used.

All terms occurring in more than one document, and not on a stop list of 439 words were used to
generate a term-document matrix. We did not use: stemming, phrases, syntactic or semantic
parsing, word sense disambiguation, heuristic association, spelling checking or correction,

                                        138

proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any
manual indexing. The entries in the term-document matrix were then transformed using a
1og((tf~+1))x(1-entropy:)) weighting. The weight assigned to each term was: 1 - entropy or noise,
where entropy = ~ Ptd log ~  ndocs is the number of documents in the collection, t is the
                 ~ log (ndocs)'

index for terms, d is the index for documents, Pgd= ~ tf~ is the frequency of term t in
document d, and gf£ is the global frequency of occurrence of term t in the collection. (For
simplicity, we refer to this as the log*entropy term weight.) The transformed term-document
matrix was used as input to the SYD. The results of the SVD analysis, a k-dimensional real
vector for each term and each document and k singular values, were stored in a database. The
terms and their log*entropy weights were also stored in a database.

Each of the 9 subcollections was processed separately. Because of software constraints, the
initial indexing and SYD analysis were done on a random subset of 20,000-57,000 documents.
The remaining documents were added into the resulting data structure as described below. (We
have recently completed an SVD analysis of the complete 226,000 document by 90,000 term
DOE collection, but this was not used in the experiments reported below.)

Table 1 summarizes the number of terms and documents in the samples used for scaling as well
as the total number of terms and documents in the databases.


                 SYD scaling
               sampled           added   total   total            databas~,3
  collection   docs     terms    docs    docs    terms     ndim    me
  DOEl          50000   42221    176087  226087    42221    250     262
  WSJ1          49555   70019     49556   99111    70019    250     169
  APi           42465   78167     42465   84930    78167    250     163
  ZIFFi         37590   60565     37590   75180    60565    250     135
  FRi           26207   54713        0    26207    54713    250      80
  WSJ2          50000   76080     24520   74520    76080    235     141
  AP2           50000   82997     29923   79923    82997    235     153
  ZIFF2         56920   72197        0    56920    72197    235     121
  FR2           20108   48728        0    20108    48728    235      64
  totals       382845   585687   360141  742986  585687

                        Table 1. Summary of 9 subcollections

NOThS
  1.  Inthe union of the 9 subcollections, there were 585687 word tokens, and 200785 word
      types.

  2.  In general, database size will be: (ndocs+nterms)*ndim*4

  3.  The total combined database size was 1288 meg (750000 docs and 585000 terms). If a
      single combined database had been used, the total database size would have been smaller

                                     139

    because many terms are now represented in more than one database. There were only
    about 2Ooooo unique terms, so a combined database would have been about 930 meg. The
    fact that the number of terms does not grow linearly with the number of documents will
    help make large SYD calculations possible.

2.1.2 SVD
The SVD program takes the log*entropy transformed term-document matrix as input, and
calculates best "reduced-dimension" approximation to this matrix. The result of the SVD
analysis is a reduced-dimension vector for each term and each document, and a vector of the
singular values. For TREC, we computed a separate SVD for each of the 9 subcollection. The
number of dimensions, k, ranged from 235-310.

2.1.3 Adding new documents and te?7ns

As noted above, the initial indexing and SYD were typically performed on a random sample of
documents from each subcollection. The documents not included in the sample were "folded in"
to the database. These documents were located at the weighted vector sum of their constituent
terms. That is, the vector for a new document was computed using the term vectors for all terms
in the document. These term vectors were combined using the appropriate term weights, and the
singular values to differentially weight each dimension. ~etails are given in Deerwester et al.,
1990, p.399.) For documents that are actually present in the term-document matrix, the derived
vector corresponds exactly to the document vector given by the SVD. New terms can be added
in an analogous fashion. The vector for new terms is computed using the document vectors of
all documents in which the term appears. For the TREC experiments, only new documents, not
terms, were added. The sizes of the complete databases (including all documents which were
added) are summarized in Table 1.

When adding documents and terms in this manner, we assume that the derived "semantic space"
is fixed and that new items can be fit into it. In general, this is not the same space that one would
obtain if a new SYD were calculated using both the original and new documents. In previous
experiments, we found that sampling and scaling 50% of the documents, and "folding in" the
remaining documents resulted in performance that was indistinguishable from that observed
when all documents were scaled.


2.2 Timing data
For the TREC experiments, all the pre-processing and retrieval was done on a Sparc2 with 384
meg of RAM. The SYD analyses were run on a Dec5OOO with approximately 380 meg of RAM.
Table 2 provides a summary of times (in minutes) to process documents and create the necessary
data structures. It is important to note that these costs are incurred only once at the beginning.
Subsequent query processing does not require any new SVD calculations or database updates.


                                 140

2.2.1 Pre-processing and indexing
This includes the time for sampling documents (if necessary), processing the raw ascii text,
creating the raw term-document matrix, calculating the log*entropy term weights (which
requires two-passes), and transforming the matrix entries using a log*entropy weights. A
combination of C-code, awk, and shell-scripts were used. The time required for this depends on
the amount of raw text, the number of terms, and the number of documents.

2.22 SVD
The SVD program computes the best "k-dimensional" approximation to the transformed term-
document matrix. We used a sparse, iterafive Single-Vector Lanczos SYD code (Berry, 1992).
The code is written in ANSI Fortran-77 using double precision arithmetic, and is available from
Netlib. The number of singular values (dimensions) calculated for the TREC subcollections
ranged from 235 to 310. As it turned out, we used only 235 or 250 dimensions for retrieval, so
fewer dimensions could have been computed. Thus some of the reported SYD times are higher
than necessary - in particular, the SYD times for APi, ZIFFi, and FRi would be approximately
20% lower if only 250 dimensions had been computed.


2.2.3 Adding new documents

The time required to add new documents includes the time to pre-process and index the text of
the new documents as well as the time to compute the new document vectors.


2.2.4 I/O translation
Because several existing tools were patched together for the TREC experiments, there were
some additional 1,0 translation involved. This will be removed soon.


                                         adding
           collection   index    SYD   new docs  I/O  TOTAL (mins)
           DOEl           49     1219    591     194      2053
           WSJi          241     1474    404     174      2293
           APi           271     1644    455     214      2584
           ZIFFi         241     1359    352     156      2108
           FRi           241      939      0     133      1313
           WSJ2          427     1382    461     220      2490
           AP2           338     1210    273     218      2039
           ZIFF2         260     1452      0     208      1920
           FR2           187      486      0     105       778

                  Table 2. Summary of LS1,SVD times (in minutes)


                                       141

2.3 Retrieval
2.3.1 Query processing
Queries were automatically processed in the same way as documents. For queries derived from
the topic statement, we began with the full text of each topic (all topic fields), and stripped out
the SGML field identifiers. For feedback queries, we used the full text of relevant documents.

We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation,
heuristic association, spelling checking or correction, proper noun identification, complex
tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. Note also that we did
not use any Boolean connectors or proximity operators in query formulation. The implicit
connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an
additional kind of "fuzziness" introduced by the dimension-reduced association matrix
representation of terms and documents.

2.3.2 Adhoc queries:
The topic statements were automatically processed as described above to generate a list of query
terms and their frequencies. This histogram of query terms was used to form a "query vector".
A query vector was the weighted vector sum of its constituent term vectors. A separate query
vector was created for matching against each of 9 databases (DOE 1, WSJl, APl, FRi, ZIFFi,
WSJ2, AP2, FR2, ZIFF2). For each subcollection, all query terms occurring in that database and
their term weights were used. For example, a DOE 1 query vector was created using the term
weights and term vectors from the DOEl database, and it was compared against the 226k
documents in DOEl. This procedure was repeated for the remaining 8 collections, resulting in a
total of 746k similarities between the adhoc queries and the 746k documents in the full TREC
collection. (Note that we always started with the same query terms. However, these terms
usually had somewhat different weights in the different collections, and some terms were not
present in some subcollections.) We used 235 dimensions and a cosine similarity measure for all
collections.

We submifled results from two sets of adhoc queries. The two sets of adhoc results differed only
in how the information from the 9 subcollections was combined to arrive at a single ranking. In
one case, we combined the information from the 9 databases by simply taking the raw cosines
from the different collections and ranking them from largest to smallest. We call these the
adhoc~topic cosine results.  In another case, we normalized the cosines within each
subcollection before combining. That is, within each subcollection, we transformed the cosines
to z-scores so that they had a mean of 0 and a standard deviation of 1. We then combined across
collections using these normalized z-scores rather than the raw cosines, again ranking from
largest to smallest. We call these the adhoc_topic normalized_cosine results. This method of
normalizing scores offers somewhat more flexibility in combining information from many
subcollections. It means, for example, that different numbers of dimensions or similarity
measures could be used in the different subcollections, but combined on the basis of a
comparable score.  In the TREC experiments, all comparisons used the same number of
dimensions, so this normalization will equate for real differences in the data rather than
statistical artifacts of the analysis.

                                       142

2.3.3 Routing quenes:
For the routing queries, we created a filter or profile for each of the 50 training topics. We
submitted results from two sets of routing queries. In one case, the filter was based on just the
topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter
was located at the vector sum of the terms in the topic. We call these the routing_topic cosine
results. In the other case, we used feedback about relevant documents from the training set. The
filter in this case was derived by taking the vector sum of all relevant documents. We call these
the routing~reIdocs cosine results. This was an atypical variant of relevance feedback in that
we replaced (rather than combined) the original topic with relevant documents. These two
extremes provide baselines against which to compare methods for combining information from
the original query and feedback about relevant documents.

In both cases, the filter was a single vector. New documents were matched against the filter
vector and ranked in decreasing order of similarity. We have previously conducted experiments
which suggest that performance can be improved if the filter is represented as several separate
vectors. We did not use this method for the TREC results we submitted, but would like to do so
in subsequent experiments. (See also Kane-Esrig et al., 1991 or Foltz and Dumais, 1992, for a
discussion of multi-point interest profiles in LSI.)

For each topic, a separate filter vector was created in 4 training databases (WSJl, APi, FRi,
ZIFFi). Thus each of the 50 filters was represented in 4 different databases. New documents
(those from WSJ2, AP2, FR2, ZIFF2) were automatically pre-processed and indexed as
described above. New documents were "folded in" to the comparable training database and
compared to the filter vector in that database. For example, a new document from WSJ2 was
added to the WSJi database and compared to the 50 filters in that database. In is important to
note that only term vectors and term weights from the training subcollecfions were used in
indexing and comparing these new documents to the routing filters.

We used 250 dimensions and a cosine similarity measure for all routing comparisons. Results
from the different training subcollections were combined using raw cosine values.

2.3.4 Matching/retrieval times
Both queries (or filters) and documents were represented as reduced-dimension vectors. The
cosine between each query and every document was computed, and documents were ranked in
decreasing order of similarity to the query. The cosines were computer using 235 dimensions
for the adhoc queries and 250 dimensions for the routing queries. Although there are many
fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted
indices are not useful. This means that each query must be compared to every document.

For 250-dimensional vectors, about 50,000 cosines can be computed per minute on a Sparc2.
This time includes both comparison time and ranking time, and assumes that all document
vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the
742,986 documents was about 12 minutes if all comparisons were sequential.  It is
straightforward to split this matching across several machines or to use parallel hardware.
Preliminary experiments using a 16,000 PE MasPar showed that 50,000 cosines could be

                                    143

computed and sorted in 1 second. For the routing queries, the time to compare a new document
to the filters (50 filters in each of 4 databases) was about .2 sec on a Sparc2 even when the term
and filter vectors were read from disk.


3. Improving performance
3.1 Time
The LSIISVD system was built as a research prototype to investigate many different information
retrieval and interface issues. Retrieval efficiency was not a central concern because we first
wanted to assess whether the method worked before worrying about efficiency, and because the
initial applications of LSI involved much smaller databases of a few thousand documents.
Almost no effort went into re-designing the tools to work efficiently for the large TREC
databases. There are some obvious ways to decrease indexing, SVD, and retrieval time, and we
discuss some of them below.


3.1.1 Pre-processing, SVD, and database creation
About 10% of the time is spent in unnecessary 110 translation (e.g., transposing large matrices),
and this will be eliminated soon.

About 60% - 70% of the time is spent in the SVD. SYD algorithms get faster all the time. The
sparse, iterative algorithm we now use is about 100 times faster than the method we used
initially. There are the usual speed-memory tradeoffs in the SYD algorithms, so time can
probably be decreased some by using a different algorithm and more memory. Parallel
algorithms will help a little, but probably only by a factor of 2. Finally, all calculations are now
done in double precision, so both time and memory could be decreased by using single
precision. Preliminary experiments with smaller IR test collections suggest that this decrease in
precision will not lead to numerical problems for the SYD algorithm.

It is important to note that the pre-processing and SVD analyses are one-time-only costs for
relatively stable domains. We have also found that at least as many new items can be added to an
existing SVD database without redoing the SVD scaling or seriously diminishing retrieval
effectiveness.


3.1.2 Query construction and retrieval
Some calculations (e.g., scalings of various sorts) were done on the fly but could easily be
precomputed. In addition, all calculations were done in floating point, and could be speeded up
using integer arithmetic.  By dropping all very low vector values, sparse vectors could be
constructed to take advantage of efficient methods currently used by vector methods.

Query vectors were compared to every document. Document clustering could be used to reduce
the number of comparisons. Query matching can also be improved tremendously by simply
using more than one machine or parallel hardware. Parallel query matching produces much
larger gains than any of the modifications discussed above. Using a 16,000 PE MasPar, with no


                                       144

attempt to optimize the data storage or sorting, we decreased the time required to match a 250-
dimensional query vector against all document vectors and sort by a factor of 60 to 100.

3.2 Accuracy
This section on accuracy is divided into two main parts - one examining the results of the two
Adhoc and two Routing runs we submitted, and the other looking in detail at some failures of the
LSI system.

Compared to other systems, LSI performance was average on the adhoc topics and somewhat
below average on the routing topics (tho see the section on misses for a discussion of sizable
improvements). Because there were so many differences between systems (tokenization, query
construction, representation, matching, amount of human effort, etc.) it is difficult to isolate
performance differences to specific, theoretically interesting components. For this reason, we
focus on our own experimental results and failure analyses as a first step in understanding and
improving performance.

3.2.1 LSJ experiments
3.2.1.1 Adhoc - normalization experiment.   We submitted results from two sets of adhoc
queries. The two sets of adhoc query results differed only in how the similarities (cosines) from
the 9 subcollections were combined to arrive at a single ranking. In one case, adhoc_topic
cosine, we simply used the raw cosines from the different collections.  In the other case,
adhoc~topic normalized cosine, we normalized the cosines within each subcollection before
combining.

The differences in accuracy between these two methods were not very large. The raw cosine
method of combining was about 15% better overall that the normalized cosine method (2786 vs.
2469 relevant articles, and .1274 vs. .1100 11-pt precision).  In general, some form of
normalization is needed to correct for measurement artifacts (e.g., lower mean cosine in higher
dimensions) when combining scores from many different subcollections. Since we used the
same number of dimensions for comparisons in each subcollection this correction was
unnecessary in the present experiments.  Normalization appeared to have some undesirable
consequences for a few topics. Because normalization subtracts the mean cosine and divides by
the variance, the same raw cosine will have a higher normalized score if it comes from a
subcollection with a low mean cosine and low variance - but these are precisely the
subeollections that we probably want to avoid!

3.2.1.2 Routing - feedback experiment.   For the routing queries, we created two filters or
queries for each of the 50 training topics. In one case, routing~topic cosine, the routing query
was based on just terms in the topic statements, as if it had been an adhoc query. In the other
case, routing_reldocs cosine, we used feedback about relevant documents from the training set
and located the filter at the vector sum of the relevant documents. Our intent is to use these two
runs as baselines against which alternative methods for combining the original query and
relevant documents can be compared.


                                           145

Somewhat surprisingly, the query using just the topic terms was about 25% more accurate than
the feedback query (2234 vs. 1837 relevant articles; .1235 vs. .0972 11-pt precision). We
suspect that part of the problem was attributable to the small number and inaccuracy of
relevance judgements in the initial training set. This had substantial impact on performance for
some topics because our feedback queries were based only on the relevant articles (and ignored
the original topic description). For Topic 050, for example, there was only one relevant articles,
and it did not appear to us to be relevant to the topic - the Topic was about "potential military
interest in virtual reality", and the so-called relevant article was about "Denmark's crisis on
nuclear power threatening its membership in NATO". Not surprisingly, using only this article as
a query, no relevant articles about virtual reality were returned. Now that we have a larger
number of hopefully more accurate relevance judgements, we will repeat this basic comparison.

We will then use these two baseline runs to explore: a) combining the relevant documents and
the original topic; b) selecting only some relevant documents and/or discriminating terms; and c)
representing the query vector as several points of interest rather than a single average.


3.2.2 Failure analyses
In order to better understand retrieval performance we examined two kinds of retrieval failures:
false alarms, and misses. False alarms are documents that LSI ranks highly that are judged to be
irrelevant. Misses are relevant documents that are not in the top 200 returned by LSI. The
observations presented below are based on preliminary analyses of some topics on which LSI
performed poorly. Although we suggest methods for improving performance, most have not
been tested systematically on the entire TREC collecfion, although we plan to do so.


3.2.2.1 False  Alarms.  The  most common  reason for false alarms (accounting for
approximately 50% of those we examined) was lack of specificity. These highly ranked but
irrelevant articles were generally about the topic of interest but did not meet some of the
restrictions described in the topic. Many topics required this kind of detailed processing or fact-
finding that the LSI system was not designed to address. Precision of LSI matching can be
increased by many of the standard techniques - proper noun identification, use of syntactic or
statistically-derived phrases, or a two-pass approach involving a standard initial global matching
followed by a more detailed analysis of the top few thousand documents. Salton and Buckley
(SMART's global and local matching), Evans (CLARIT's evoke and discriminate strategy),
Nelson (ConQuest's global match followed by the use of locality of information), and Jakobs,
Krupka and Rau (GE's pre-filter followed by a variety of more stringent tests) all used two-pass
approaches to good advantage in the TREC tests. We intend to try some of these methods for
TREC-2, and will focus on general-purpose, completely automatic methods that do not have to
be modified for each new domain or query restricfion.

Another common cause of false alarms appears to be the result of inappropriate query pre-
processing. The use of negation is the best example of this problem. About 20% of the TREC
topics contained explicit negations. LSI included negated words in the query along with all the
other words.   Topic 094, about computer-aided crime, also stated that articles that simply
mentioned the spread of a computer virus or worm were NOT relevant. The first 20 documents
that LSI returned were all about computer viruses! Another example of inappropriate query

                                      146

processing involved the use of logical connectives. LSI does not handle Boolean combinations
of words, and sometimes returned articles covering only a subset of ANDed topics.

Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI
uses a statistically-derived "semantic" space and not surface-level word overlap for matching
queries to documents, it is sometimes difficult to understand why a particular document was
returned. One advantage of the LSI method is that documents can match queries even when they
have no words in common; but this can also produce some spurious hits. Topic 066, about
natural language processing technology, returned several articles about chip processing
technologies and high technology products in general. Another reason for false alarms could be
inappropriate word sense disambiguation. LSI queries were located at the weighted vector sum
of the words, so words were "disambiguated" to some extent by the other query words.
Similarly, the initial SVD analysis used the context of other words in articles to determine the
location for each word in the LSI space. However, since each word has only one location, it
sometimes appears as if it is "in the middle of nowhere". A related possibility concerns long
articles. Lengthy articles which talk about many distinct subtopics were averaged into a single
document vector, and this can sometimes produce spurious matches. Breaking larger documents
into smaller subsections and matching on these might help.


3.2.2.2 Misses.  For this analysis we examined a random subset of relevant articles that were
not in the top 200 returned by LSI. Many of the relevant articles were fairly highly ranked by
LSI, but there were also some notable failures that would be seen only by the most persistent
readers. So far, we have not systematically distinguished between misses that "almost made it"
and those that were much further down the list.

About 40% of the misses we examined, represent articles that were primarily about a different
topic than the query, but contained a small section that was relevant to the query. Because
documents are located at the average of their terms in LSI space, they will generally be near the
dominant theme, and this is a desirable feature of the LSI representation. Some kind of local
matching should help in identifying less central themes in documents.

Another 40% of the misses appear to be the result of inappropriate selection of subcollections.
Recall that we analyzed 9 subcollections separately and combined the similarities later to arrive
at a single ranked list.  The different subcollections sometimes had different densities of
documents on some topics. This is most evident when considering computer-related topics. For
general collections like AP or WSJ, relatively few of the articles were about computers, and we
suspect that few of the 235-250 dimensions in the LSI semantic space were devoted to
distinguishing among such documents. Thus, the similarities of these documents to queries
about computers were relatively high and undifferentiated. For the ZIFF collections, on the
other hand, most of the LSI dimensions were used to represent differences among computer
concepts. Similarities of the top few hundred articles to computer queries were lower on
average, but much finer distinctions among subtopics were possible. One consequence of this
was that, when combining across collections, few articles from the ZIFF subcollections were
included for queries about computer-related topics! Different term weights in the different
subcollections also contributed to this problem.


                                       147

Inappropriate subcollection selection accounts for many of LSI's failures on computer-related
topics. Consider, for example, topic 037 about IBM's SAA standards. LSI performs very poorly
compared to other systems on this topic, returning the fewest relevant articles (19) and having
the lowest 11-pt precision (.0088). Summing over all Systems for this topic, 68% of the total
number of returned articles (554/812) and 99% of the relevant articles (444/449) were from
ZIFF2. For LSI, however, only 17 of the top200 articles (8%) were from ZIFF2; all 17 of these
articles were relevant. When a comparable proportion of LSI documents were selected from
ZIFF2, the number relevant increased to 175 and the 11-pt precision increased to .3532. This
performance places LSI slightly above the median for topic 037. We performed the same
analysis for all topics in which more than 33% of the total returned articles were from ZIFF.
There were 26 such topics (19 Adhoc Topics from 026-050, and 7 Routing Topics). For these
topics, the mean percent of ZIFF articles chosen by all systems was 59%, compared with 9% for
LSI. When comparable proportions of ZIFF articles were selected for LSI, the average number
of relevant documents increased from 26 to 58, and the 11-pt precision increased from .0554 to
.1111. A total of 700 new relevant documents were found for the 19 Adhoc Topics, and 125
new relevant documents were found for the 7 Routing Topics. Performance improvements were
observed for 23 of the 26 topics - two of those in which there were no improvements were about
AT&T products where poor pre-processing omitted AT&T from the LSI query (see below).

While these results are encouraging, the problem of how to select appropriate subcollections is
not solved. For the routing topics, we could use training data to set some apriori mixture of
articles from various subcollections. This strategy is not, however, generally applicable for
adhoc queries. We will examine some more appropnate way of combining across subcollections
to take distributional effects like this into account. Alternatively, we could use randomly
selected documents (rather than topically organized ones) to create the subcollections. Finally,
we could use a single large combined scaling in which there would be no need to combine across
subcollections.

Finally, some misses were attributable to poor text (and query) pre-processing and tokenization.
Since we do not keep single letters or use a database of company and location names, several
important acronyms like U.S. and AT&T disappeared completely from both articles and queries!
Not surprisingly, this resulted in many missed documents. We noticed that many of the top
performing automatic systems used SMART's pre-processing, and we hope to do so as well for
TREC-2. This will allow us to better understand the usefulness of LSI per se without many of
the additional confounds introduced with different indexing.


3.3 Open experimental issues

The results of the failure analyses suggest several directions to pursue for TREC-2, including:
improving pre-processing and tokenization; exploring some precision-enhancing methods; and
developing methods for more effecfively combining across subcollections. We also hope to
explore three additional.


                                           148

3.3.1 Separate vs. combined scaling
We used 9 separate subscalings for the TREC experiments.         Our decision to use the 9
subcollections was largely a pragmatic one initially. Would like to create a single large scaling
and compare the results with those obtained using the subcollections.


3.3.2 Centroid query vs. many separate points of interest
A single vector was used to represent each query. In some cases the vector was the average of
terms in the topic statement, and in other cases the vector was the average of previously
identified relevant documents.  A single query vector can be inappropriate if interests are
multifacted and these facets are not near each other in the LSI space. In the case of the routing
queries, for example, we could match new documents against each of the previously identified
relevant documents separately rather than against their average. We have developed techniques
that allow us to match using a controllable compromise between averaged and separate vectors
(Kane-Esrig et al., 1991). We did not use this method for TREC, but would like to do so for
TREC-2.


3.3.3 Interactive inteifaces
All LSI evaluations were conducted using a non-interactive system in essentially batch mode. It
is well known that one can have the same underlying retrieval and matching engine, but achieve
very different retrieval success using different interfaces.  We would like to examine the
performance of real users with interactive interfaces. A number of interface features could be
used to help users make faster (and perhaps more accurate) relevance judgements, or to help
them explicitly reformulate queries. (See Dumais and Schmitt, 1991, for some preliminary
results on query reformulation and relevance feedback.) Another interesting possibility involves
returning something richer than a rank-ordered list of documents to users. For example, a
clustering and graphical display of the top-k documents might be quite useful. We have done
some preliminary experiments using clustered return sets, and would like to extend them to the
TREC-2 collection. The general idea is to provide people with useful interactive tools that let
them make good use of their knowledge and skills, rather than attempting to build all the smarts
into the database representation or matching components.


4. Onward to TREC-2

We were quite pleased that we were able to use many of the existing LSIISVD tools on the
TREC collection. The most important finding in this regard was that the large, sparse SYD
problems could be computed without numerical or convergence problems. The computations
were not fast, but a day of CPU on existing workstations (for each subcollection) is certainly
feasible, especially given that these calculations are only done once at the beginning to create the
database. We have already computed some larger SYDs (100k x 200k) and would like to take
advantage of this for TREC-2.

In terms of accuracy, LSI performance was reasonable. There are some obvious ways to improve

                                          149

the initial pre-processing and indexing and we will do so. We would like to use indexing
methods that are as similar as possible to other automatic vector methods, so that we can
examine the contribufion of LSI per se. We also now have many of the basic tools in place and
should be able to conduct more experiments comparing various indexing and query matching
ideas using the same underlying LSI engine.

LSI was designed as a method to increase recall, especially for the short queries that users
typical generate. For the TREC application we would like to explore some precision enhancing
tools as well. Some of these will probably consist of more refined matching algorithms, but we
also hope to move in the direction of interactive interfaces. We see this as an effective way of
combining human and machine intelligence.


5. References
 [1]   Berry, M. W.  Large scale singular value computations. International Journal of
       Supercomputer Applications, 1992, 6(1), 13-49.
 [2]   Cullum, J.K. and Willoughby, R.A. Lanczos algorithms for large symmetric eigenvalue
       computations - Vol 1 Theory, (Chapter 5: Real rectangular matrices). Brikhauser,
       Boston, 1985.

 [3]   Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, 0. W. and Harshman, R. A.
       Indexing by latent semantic analysis. Journal of the Society for Information Science,
       1990,41(6), 391-407.
 [4]   Dumais, S. T. Improving the retrieval of information from external sources. Behavior
       Research Methods, Instruments and Computers, 1991,23(2), 229-236.

 [5]   Dumais, S. T. and Schmitt, D. 0.  Iterative searching in an online database.
       Proceedings of Human Factors Society 35th Annual Meeting, 1991, 398-402.

                                                                     In

  [6]   Foltz, P. W. and Dumais, S. T. Personalized information delivery: An analysis of
        information filtering methods. Communications of the ACM, Dec. 1992,35(12), 51-60.

  [7]   Furnas, 0. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A.,
        Streeter, L. A., and Lochbaum, K. E. Information retrieval using a singular value
        decomposition model of latent semantic structure. In Proceedings of SIGIR, 1988, 465-
        480.
  [8]   Kane-Esrig, Y., Streeter, L., Dumais, S. T., Keese, W. and Casella, 0. The relevance
        density method for multi-topic queries in information retrieval. In Proceedings of the
        23rd Symposium on the Inte~ace, E. Keramidas (Ed.), 1991,407-410.

  [9]   Salton, 0. and McGill, M.J. Introduction to Modern Information Retrieval. McGraw-
        Ilill, 1983.


                                     150

                                     Appendix

Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely
related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). We
take a large term-document matrix and decompose it into a set of k, typically 100 to 300,
orthogonal factors from which the original matrix can be approximated by linear combination.

More formally, any rectangular matrix, X, for example a t xd matrix of terms and documents, can
be decomposed into the product of three other matrices:

                                   X =T0S0D0',
                                   :xd gxrrxrrxd


such that T0 and D0 have orthonormal columns, S0 is diagonal, and r is the rank of X. This is so-
called singular value decomposition of X.

If only the k largest singular values of S0 are kept along with their corresponding columns in the
T0 and D0 matrices, and the rest deleted (yielding matrices S, T and D), the resulting matrix, X, is
the unique matrix of rank k that is closest in the least squares sense to X:

                                   X     T SD'.
                                   :xd ixd txkkxkkxd


The idea is that the X matrix, by containing only the first k independent linear components of X,
captures the major associational structure in the matrix and throws out noise. It is this reduced
model, usually with k  100, that we use to approximate the term to document association data in
X. Since the number of dimensions in the reduced model (k) is much smaller than the number of
unique terms (t), minor differences in terminology are ignored. In this reduced model, the
closeness of documents is determined by the overall pattern of term usage, so documents can be
near each other regardless of the precise words that are used to describe them, and their
description depends on a kind of consensus of their term meanings, thus dampening the effects
of polysemy. In particular, this means that documents which share no words with a user's query
may still be near it if that is consistent with the major patterns of word usage. We use the term
"semantic" indexing to describe our method because the reduced SVD representation captures
the major associative relationships between terms and documents.

One can also interpret the analysis performed by SVD geometrically. The result of the SVD is a
k-dimensional vector represenfing the location of each term and document in the k-dimensional
representation. The location of term vectors reflects the correlations in their usage across
documents.  In this space the cosine or dot product between vectors corresponds to their
estimated similarity. Since both term and document vectors are represented in the same space,
similarities between any combinafion of terms and documents can be easily obtained. Retrieval
proceeds by using the terms in a query to identify a point in the space, and all documents are
then ranked by their similarity to the query. We make no attempt to interpret the underlying
dimensions or factors, nor to rotate them to some intuitively meaningful orientation. The
analysis does not require us to be able to describe the factors verbally but merely to be able to
represent terms, documents and queries in a way that escapes the unreliability, ambiguity and
redundancy of individual terms as descriptors.

                                        151

Choosing the appropriate number of dimensions for the LSI representation is an open research
question. Ideally, we want a value of k that is large enough to fit all the real structure in the data,
but small enough so that we do not also fit the sampling error or unimportant details. If too
many dimensions are used, the method begins to approximate standard vector methods and loses
its power to represent the similarity between words. If too few dimensions are used, there is not
enough discrimination among similar words and documents. We typically find that performance
improves as k increases for a while, and then decreases ([)umais, 1991). That LSI typically
works well with a relatively small (compared to the number of unique terms) number of
dimensions shows that these dimensions are, in fact, capturing a major portion of the meaningful
structure.


                                  152