LSI meets TREC: A Status Report Susan T. Dumais Belicore 445 South St. Morristown, NJ 07960-1910 std@bellcore.com 1. Overview of Latent Semantic Indexing Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton & McGill, 1983) in which the dependencies between terms and between documents, in addition to the associations between terms and documents, are explicitly taken into account. This is done by simultaneously modeling all the association of terms and documents. We assume that there is some underlying or "latent" structure in the pattern of word usage across documents, and use statistical techniques to estimate this latent structure. A description of terms, documents and user queries based on the underlying, "latent semantic", structure (rather than surface level word choice) is used for representing and retrieving information. Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). A large term-document matrix is decomposed it into a set of k, typically 100 to 300, orthogonal factors from which the original matrix can be approximated by linear combination. Instead of representing documents and queries directly as sets of independent words, LSI represents them as continuous values on each of the k orthogonal indexing dimensions. Since the number of factors or dimensions is much smaller than the number of unique terms, words will not be independent. For example, if two terms are used in similar contexts (documents), they will have similar vectors in the reduced-dimension LSI representation. The SVD technique can capture such structure better than simple term-term or document-document correlations and clusters. LSI partially overcomes some of the deficiencies of assuming independence of words, and provides a way of dealing with synonymy automatically without the need for a manually constructed thesaurus. LSI is a completely automatic method. (The Appendix provides a brief overview of the mathematics underlying the LSI"SVD method. Deerwester et al., 1990, and Furnas et al., 1988 present additional mathematical details and examples.) One can also interpret the analysis performed by SYD geometrically. The result of the SVD is a vector representing the location of each term and document in the k -dimensional LSI representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Since both term and document vectors are represented in the same space, 137 similarities between any combination of terms and documents can be easily obtained. Retrieval proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their similarity to the query. The LSI method has been applied to many of the standard IR collections with favorable results. Using the same tokenization and term weightings, the LSI method has equaled or outperformed standard vector methods and other variants in almost every case, and was as much as 30% better in some cases (Deerwester et al., 1990). As with the standard vector method, differential term weighting and relevance feedback both improve LSI performance substantially ~umais, 1991). LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991), and in filtering applications (Foltz and Dumais, 1992). The TREC conference was an opportunity for us to "scale up" our tools, and to explore the LSI dimension-reduction ideas using a very rich corpus of word usage. This large collection of standard documents and relevance judgements should be a valuable IR resource and an important step in the systematic development of more effective retrieval systems. 2. Application of LSI to the TREC collection 2.1 Overview We used existing LSJ/SVD software for analyzing the training and test collections, and for query processing and retrieval. For pragmatic reasons, we divided the TREC collection into 9 subcollections - APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2. Queries were passed against the appropriate subcollections, and the returns recombined to arrive at a single ranked output. There were three main stages involved in processing documents and constructing the relevant data structures. All steps were completely automatic and involved no human intervention. The resulting reduced-dimension representations were used for matching and retrieval. 1. Pre-processing and indexing (extracting terms, calculating term weights, etc.) 2. Computing the SYD (number of dimensions ranged from 235-310) 3. Adding new documents and/or terms 2.1.1 Pre-processing and indexing We did minimal pre-processing on the raw text of the TREC documents. Some markups (any text within <>delimiters) were removed, and all hand-indexed entries were removed from the WSJ and ZIFF collections. Upper case characters were translated into lower case, punctuation was removed, and white spaces were used to delimit terms. A minimum term length of 2 was used. All terms occurring in more than one document, and not on a stop list of 439 words were used to generate a term-document matrix. We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation, heuristic association, spelling checking or correction, 138 proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. The entries in the term-document matrix were then transformed using a 1og((tf~+1))x(1-entropy:)) weighting. The weight assigned to each term was: 1 - entropy or noise, where entropy = ~ Ptd log ~ ndocs is the number of documents in the collection, t is the ~ log (ndocs)' index for terms, d is the index for documents, Pgd= ~ tf~ is the frequency of term t in document d, and gf£ is the global frequency of occurrence of term t in the collection. (For simplicity, we refer to this as the log*entropy term weight.) The transformed term-document matrix was used as input to the SYD. The results of the SVD analysis, a k-dimensional real vector for each term and each document and k singular values, were stored in a database. The terms and their log*entropy weights were also stored in a database. Each of the 9 subcollections was processed separately. Because of software constraints, the initial indexing and SYD analysis were done on a random subset of 20,000-57,000 documents. The remaining documents were added into the resulting data structure as described below. (We have recently completed an SVD analysis of the complete 226,000 document by 90,000 term DOE collection, but this was not used in the experiments reported below.) Table 1 summarizes the number of terms and documents in the samples used for scaling as well as the total number of terms and documents in the databases. SYD scaling sampled added total total databas~,3 collection docs terms docs docs terms ndim me DOEl 50000 42221 176087 226087 42221 250 262 WSJ1 49555 70019 49556 99111 70019 250 169 APi 42465 78167 42465 84930 78167 250 163 ZIFFi 37590 60565 37590 75180 60565 250 135 FRi 26207 54713 0 26207 54713 250 80 WSJ2 50000 76080 24520 74520 76080 235 141 AP2 50000 82997 29923 79923 82997 235 153 ZIFF2 56920 72197 0 56920 72197 235 121 FR2 20108 48728 0 20108 48728 235 64 totals 382845 585687 360141 742986 585687 Table 1. Summary of 9 subcollections NOThS 1. Inthe union of the 9 subcollections, there were 585687 word tokens, and 200785 word types. 2. In general, database size will be: (ndocs+nterms)*ndim*4 3. The total combined database size was 1288 meg (750000 docs and 585000 terms). If a single combined database had been used, the total database size would have been smaller 139 because many terms are now represented in more than one database. There were only about 2Ooooo unique terms, so a combined database would have been about 930 meg. The fact that the number of terms does not grow linearly with the number of documents will help make large SYD calculations possible. 2.1.2 SVD The SVD program takes the log*entropy transformed term-document matrix as input, and calculates best "reduced-dimension" approximation to this matrix. The result of the SVD analysis is a reduced-dimension vector for each term and each document, and a vector of the singular values. For TREC, we computed a separate SVD for each of the 9 subcollection. The number of dimensions, k, ranged from 235-310. 2.1.3 Adding new documents and te?7ns As noted above, the initial indexing and SYD were typically performed on a random sample of documents from each subcollection. The documents not included in the sample were "folded in" to the database. These documents were located at the weighted vector sum of their constituent terms. That is, the vector for a new document was computed using the term vectors for all terms in the document. These term vectors were combined using the appropriate term weights, and the singular values to differentially weight each dimension. ~etails are given in Deerwester et al., 1990, p.399.) For documents that are actually present in the term-document matrix, the derived vector corresponds exactly to the document vector given by the SVD. New terms can be added in an analogous fashion. The vector for new terms is computed using the document vectors of all documents in which the term appears. For the TREC experiments, only new documents, not terms, were added. The sizes of the complete databases (including all documents which were added) are summarized in Table 1. When adding documents and terms in this manner, we assume that the derived "semantic space" is fixed and that new items can be fit into it. In general, this is not the same space that one would obtain if a new SYD were calculated using both the original and new documents. In previous experiments, we found that sampling and scaling 50% of the documents, and "folding in" the remaining documents resulted in performance that was indistinguishable from that observed when all documents were scaled. 2.2 Timing data For the TREC experiments, all the pre-processing and retrieval was done on a Sparc2 with 384 meg of RAM. The SYD analyses were run on a Dec5OOO with approximately 380 meg of RAM. Table 2 provides a summary of times (in minutes) to process documents and create the necessary data structures. It is important to note that these costs are incurred only once at the beginning. Subsequent query processing does not require any new SVD calculations or database updates. 140 2.2.1 Pre-processing and indexing This includes the time for sampling documents (if necessary), processing the raw ascii text, creating the raw term-document matrix, calculating the log*entropy term weights (which requires two-passes), and transforming the matrix entries using a log*entropy weights. A combination of C-code, awk, and shell-scripts were used. The time required for this depends on the amount of raw text, the number of terms, and the number of documents. 2.22 SVD The SVD program computes the best "k-dimensional" approximation to the transformed term- document matrix. We used a sparse, iterafive Single-Vector Lanczos SYD code (Berry, 1992). The code is written in ANSI Fortran-77 using double precision arithmetic, and is available from Netlib. The number of singular values (dimensions) calculated for the TREC subcollections ranged from 235 to 310. As it turned out, we used only 235 or 250 dimensions for retrieval, so fewer dimensions could have been computed. Thus some of the reported SYD times are higher than necessary - in particular, the SYD times for APi, ZIFFi, and FRi would be approximately 20% lower if only 250 dimensions had been computed. 2.2.3 Adding new documents The time required to add new documents includes the time to pre-process and index the text of the new documents as well as the time to compute the new document vectors. 2.2.4 I/O translation Because several existing tools were patched together for the TREC experiments, there were some additional 1,0 translation involved. This will be removed soon. adding collection index SYD new docs I/O TOTAL (mins) DOEl 49 1219 591 194 2053 WSJi 241 1474 404 174 2293 APi 271 1644 455 214 2584 ZIFFi 241 1359 352 156 2108 FRi 241 939 0 133 1313 WSJ2 427 1382 461 220 2490 AP2 338 1210 273 218 2039 ZIFF2 260 1452 0 208 1920 FR2 187 486 0 105 778 Table 2. Summary of LS1,SVD times (in minutes) 141 2.3 Retrieval 2.3.1 Query processing Queries were automatically processed in the same way as documents. For queries derived from the topic statement, we began with the full text of each topic (all topic fields), and stripped out the SGML field identifiers. For feedback queries, we used the full text of relevant documents. We did not use: stemming, phrases, syntactic or semantic parsing, word sense disambiguation, heuristic association, spelling checking or correction, proper noun identification, complex tokenizers, a controlled vocabulary, a thesaurus, or any manual indexing. Note also that we did not use any Boolean connectors or proximity operators in query formulation. The implicit connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an additional kind of "fuzziness" introduced by the dimension-reduced association matrix representation of terms and documents. 2.3.2 Adhoc queries: The topic statements were automatically processed as described above to generate a list of query terms and their frequencies. This histogram of query terms was used to form a "query vector". A query vector was the weighted vector sum of its constituent term vectors. A separate query vector was created for matching against each of 9 databases (DOE 1, WSJl, APl, FRi, ZIFFi, WSJ2, AP2, FR2, ZIFF2). For each subcollection, all query terms occurring in that database and their term weights were used. For example, a DOE 1 query vector was created using the term weights and term vectors from the DOEl database, and it was compared against the 226k documents in DOEl. This procedure was repeated for the remaining 8 collections, resulting in a total of 746k similarities between the adhoc queries and the 746k documents in the full TREC collection. (Note that we always started with the same query terms. However, these terms usually had somewhat different weights in the different collections, and some terms were not present in some subcollections.) We used 235 dimensions and a cosine similarity measure for all collections. We submifled results from two sets of adhoc queries. The two sets of adhoc results differed only in how the information from the 9 subcollections was combined to arrive at a single ranking. In one case, we combined the information from the 9 databases by simply taking the raw cosines from the different collections and ranking them from largest to smallest. We call these the adhoc~topic cosine results. In another case, we normalized the cosines within each subcollection before combining. That is, within each subcollection, we transformed the cosines to z-scores so that they had a mean of 0 and a standard deviation of 1. We then combined across collections using these normalized z-scores rather than the raw cosines, again ranking from largest to smallest. We call these the adhoc_topic normalized_cosine results. This method of normalizing scores offers somewhat more flexibility in combining information from many subcollections. It means, for example, that different numbers of dimensions or similarity measures could be used in the different subcollections, but combined on the basis of a comparable score. In the TREC experiments, all comparisons used the same number of dimensions, so this normalization will equate for real differences in the data rather than statistical artifacts of the analysis. 142 2.3.3 Routing quenes: For the routing queries, we created a filter or profile for each of the 50 training topics. We submitted results from two sets of routing queries. In one case, the filter was based on just the topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter was located at the vector sum of the terms in the topic. We call these the routing_topic cosine results. In the other case, we used feedback about relevant documents from the training set. The filter in this case was derived by taking the vector sum of all relevant documents. We call these the routing~reIdocs cosine results. This was an atypical variant of relevance feedback in that we replaced (rather than combined) the original topic with relevant documents. These two extremes provide baselines against which to compare methods for combining information from the original query and feedback about relevant documents. In both cases, the filter was a single vector. New documents were matched against the filter vector and ranked in decreasing order of similarity. We have previously conducted experiments which suggest that performance can be improved if the filter is represented as several separate vectors. We did not use this method for the TREC results we submitted, but would like to do so in subsequent experiments. (See also Kane-Esrig et al., 1991 or Foltz and Dumais, 1992, for a discussion of multi-point interest profiles in LSI.) For each topic, a separate filter vector was created in 4 training databases (WSJl, APi, FRi, ZIFFi). Thus each of the 50 filters was represented in 4 different databases. New documents (those from WSJ2, AP2, FR2, ZIFF2) were automatically pre-processed and indexed as described above. New documents were "folded in" to the comparable training database and compared to the filter vector in that database. For example, a new document from WSJ2 was added to the WSJi database and compared to the 50 filters in that database. In is important to note that only term vectors and term weights from the training subcollecfions were used in indexing and comparing these new documents to the routing filters. We used 250 dimensions and a cosine similarity measure for all routing comparisons. Results from the different training subcollections were combined using raw cosine values. 2.3.4 Matching/retrieval times Both queries (or filters) and documents were represented as reduced-dimension vectors. The cosine between each query and every document was computed, and documents were ranked in decreasing order of similarity to the query. The cosines were computer using 235 dimensions for the adhoc queries and 250 dimensions for the routing queries. Although there are many fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted indices are not useful. This means that each query must be compared to every document. For 250-dimensional vectors, about 50,000 cosines can be computed per minute on a Sparc2. This time includes both comparison time and ranking time, and assumes that all document vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the 742,986 documents was about 12 minutes if all comparisons were sequential. It is straightforward to split this matching across several machines or to use parallel hardware. Preliminary experiments using a 16,000 PE MasPar showed that 50,000 cosines could be 143 computed and sorted in 1 second. For the routing queries, the time to compare a new document to the filters (50 filters in each of 4 databases) was about .2 sec on a Sparc2 even when the term and filter vectors were read from disk. 3. Improving performance 3.1 Time The LSIISVD system was built as a research prototype to investigate many different information retrieval and interface issues. Retrieval efficiency was not a central concern because we first wanted to assess whether the method worked before worrying about efficiency, and because the initial applications of LSI involved much smaller databases of a few thousand documents. Almost no effort went into re-designing the tools to work efficiently for the large TREC databases. There are some obvious ways to decrease indexing, SVD, and retrieval time, and we discuss some of them below. 3.1.1 Pre-processing, SVD, and database creation About 10% of the time is spent in unnecessary 110 translation (e.g., transposing large matrices), and this will be eliminated soon. About 60% - 70% of the time is spent in the SVD. SYD algorithms get faster all the time. The sparse, iterative algorithm we now use is about 100 times faster than the method we used initially. There are the usual speed-memory tradeoffs in the SYD algorithms, so time can probably be decreased some by using a different algorithm and more memory. Parallel algorithms will help a little, but probably only by a factor of 2. Finally, all calculations are now done in double precision, so both time and memory could be decreased by using single precision. Preliminary experiments with smaller IR test collections suggest that this decrease in precision will not lead to numerical problems for the SYD algorithm. It is important to note that the pre-processing and SVD analyses are one-time-only costs for relatively stable domains. We have also found that at least as many new items can be added to an existing SVD database without redoing the SVD scaling or seriously diminishing retrieval effectiveness. 3.1.2 Query construction and retrieval Some calculations (e.g., scalings of various sorts) were done on the fly but could easily be precomputed. In addition, all calculations were done in floating point, and could be speeded up using integer arithmetic. By dropping all very low vector values, sparse vectors could be constructed to take advantage of efficient methods currently used by vector methods. Query vectors were compared to every document. Document clustering could be used to reduce the number of comparisons. Query matching can also be improved tremendously by simply using more than one machine or parallel hardware. Parallel query matching produces much larger gains than any of the modifications discussed above. Using a 16,000 PE MasPar, with no 144 attempt to optimize the data storage or sorting, we decreased the time required to match a 250- dimensional query vector against all document vectors and sort by a factor of 60 to 100. 3.2 Accuracy This section on accuracy is divided into two main parts - one examining the results of the two Adhoc and two Routing runs we submitted, and the other looking in detail at some failures of the LSI system. Compared to other systems, LSI performance was average on the adhoc topics and somewhat below average on the routing topics (tho see the section on misses for a discussion of sizable improvements). Because there were so many differences between systems (tokenization, query construction, representation, matching, amount of human effort, etc.) it is difficult to isolate performance differences to specific, theoretically interesting components. For this reason, we focus on our own experimental results and failure analyses as a first step in understanding and improving performance. 3.2.1 LSJ experiments 3.2.1.1 Adhoc - normalization experiment. We submitted results from two sets of adhoc queries. The two sets of adhoc query results differed only in how the similarities (cosines) from the 9 subcollections were combined to arrive at a single ranking. In one case, adhoc_topic cosine, we simply used the raw cosines from the different collections. In the other case, adhoc~topic normalized cosine, we normalized the cosines within each subcollection before combining. The differences in accuracy between these two methods were not very large. The raw cosine method of combining was about 15% better overall that the normalized cosine method (2786 vs. 2469 relevant articles, and .1274 vs. .1100 11-pt precision). In general, some form of normalization is needed to correct for measurement artifacts (e.g., lower mean cosine in higher dimensions) when combining scores from many different subcollections. Since we used the same number of dimensions for comparisons in each subcollection this correction was unnecessary in the present experiments. Normalization appeared to have some undesirable consequences for a few topics. Because normalization subtracts the mean cosine and divides by the variance, the same raw cosine will have a higher normalized score if it comes from a subcollection with a low mean cosine and low variance - but these are precisely the subeollections that we probably want to avoid! 3.2.1.2 Routing - feedback experiment. For the routing queries, we created two filters or queries for each of the 50 training topics. In one case, routing~topic cosine, the routing query was based on just terms in the topic statements, as if it had been an adhoc query. In the other case, routing_reldocs cosine, we used feedback about relevant documents from the training set and located the filter at the vector sum of the relevant documents. Our intent is to use these two runs as baselines against which alternative methods for combining the original query and relevant documents can be compared. 145 Somewhat surprisingly, the query using just the topic terms was about 25% more accurate than the feedback query (2234 vs. 1837 relevant articles; .1235 vs. .0972 11-pt precision). We suspect that part of the problem was attributable to the small number and inaccuracy of relevance judgements in the initial training set. This had substantial impact on performance for some topics because our feedback queries were based only on the relevant articles (and ignored the original topic description). For Topic 050, for example, there was only one relevant articles, and it did not appear to us to be relevant to the topic - the Topic was about "potential military interest in virtual reality", and the so-called relevant article was about "Denmark's crisis on nuclear power threatening its membership in NATO". Not surprisingly, using only this article as a query, no relevant articles about virtual reality were returned. Now that we have a larger number of hopefully more accurate relevance judgements, we will repeat this basic comparison. We will then use these two baseline runs to explore: a) combining the relevant documents and the original topic; b) selecting only some relevant documents and/or discriminating terms; and c) representing the query vector as several points of interest rather than a single average. 3.2.2 Failure analyses In order to better understand retrieval performance we examined two kinds of retrieval failures: false alarms, and misses. False alarms are documents that LSI ranks highly that are judged to be irrelevant. Misses are relevant documents that are not in the top 200 returned by LSI. The observations presented below are based on preliminary analyses of some topics on which LSI performed poorly. Although we suggest methods for improving performance, most have not been tested systematically on the entire TREC collecfion, although we plan to do so. 3.2.2.1 False Alarms. The most common reason for false alarms (accounting for approximately 50% of those we examined) was lack of specificity. These highly ranked but irrelevant articles were generally about the topic of interest but did not meet some of the restrictions described in the topic. Many topics required this kind of detailed processing or fact- finding that the LSI system was not designed to address. Precision of LSI matching can be increased by many of the standard techniques - proper noun identification, use of syntactic or statistically-derived phrases, or a two-pass approach involving a standard initial global matching followed by a more detailed analysis of the top few thousand documents. Salton and Buckley (SMART's global and local matching), Evans (CLARIT's evoke and discriminate strategy), Nelson (ConQuest's global match followed by the use of locality of information), and Jakobs, Krupka and Rau (GE's pre-filter followed by a variety of more stringent tests) all used two-pass approaches to good advantage in the TREC tests. We intend to try some of these methods for TREC-2, and will focus on general-purpose, completely automatic methods that do not have to be modified for each new domain or query restricfion. Another common cause of false alarms appears to be the result of inappropriate query pre- processing. The use of negation is the best example of this problem. About 20% of the TREC topics contained explicit negations. LSI included negated words in the query along with all the other words. Topic 094, about computer-aided crime, also stated that articles that simply mentioned the spread of a computer virus or worm were NOT relevant. The first 20 documents that LSI returned were all about computer viruses! Another example of inappropriate query 146 processing involved the use of logical connectives. LSI does not handle Boolean combinations of words, and sometimes returned articles covering only a subset of ANDed topics. Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI uses a statistically-derived "semantic" space and not surface-level word overlap for matching queries to documents, it is sometimes difficult to understand why a particular document was returned. One advantage of the LSI method is that documents can match queries even when they have no words in common; but this can also produce some spurious hits. Topic 066, about natural language processing technology, returned several articles about chip processing technologies and high technology products in general. Another reason for false alarms could be inappropriate word sense disambiguation. LSI queries were located at the weighted vector sum of the words, so words were "disambiguated" to some extent by the other query words. Similarly, the initial SVD analysis used the context of other words in articles to determine the location for each word in the LSI space. However, since each word has only one location, it sometimes appears as if it is "in the middle of nowhere". A related possibility concerns long articles. Lengthy articles which talk about many distinct subtopics were averaged into a single document vector, and this can sometimes produce spurious matches. Breaking larger documents into smaller subsections and matching on these might help. 3.2.2.2 Misses. For this analysis we examined a random subset of relevant articles that were not in the top 200 returned by LSI. Many of the relevant articles were fairly highly ranked by LSI, but there were also some notable failures that would be seen only by the most persistent readers. So far, we have not systematically distinguished between misses that "almost made it" and those that were much further down the list. About 40% of the misses we examined, represent articles that were primarily about a different topic than the query, but contained a small section that was relevant to the query. Because documents are located at the average of their terms in LSI space, they will generally be near the dominant theme, and this is a desirable feature of the LSI representation. Some kind of local matching should help in identifying less central themes in documents. Another 40% of the misses appear to be the result of inappropriate selection of subcollections. Recall that we analyzed 9 subcollections separately and combined the similarities later to arrive at a single ranked list. The different subcollections sometimes had different densities of documents on some topics. This is most evident when considering computer-related topics. For general collections like AP or WSJ, relatively few of the articles were about computers, and we suspect that few of the 235-250 dimensions in the LSI semantic space were devoted to distinguishing among such documents. Thus, the similarities of these documents to queries about computers were relatively high and undifferentiated. For the ZIFF collections, on the other hand, most of the LSI dimensions were used to represent differences among computer concepts. Similarities of the top few hundred articles to computer queries were lower on average, but much finer distinctions among subtopics were possible. One consequence of this was that, when combining across collections, few articles from the ZIFF subcollections were included for queries about computer-related topics! Different term weights in the different subcollections also contributed to this problem. 147 Inappropriate subcollection selection accounts for many of LSI's failures on computer-related topics. Consider, for example, topic 037 about IBM's SAA standards. LSI performs very poorly compared to other systems on this topic, returning the fewest relevant articles (19) and having the lowest 11-pt precision (.0088). Summing over all Systems for this topic, 68% of the total number of returned articles (554/812) and 99% of the relevant articles (444/449) were from ZIFF2. For LSI, however, only 17 of the top200 articles (8%) were from ZIFF2; all 17 of these articles were relevant. When a comparable proportion of LSI documents were selected from ZIFF2, the number relevant increased to 175 and the 11-pt precision increased to .3532. This performance places LSI slightly above the median for topic 037. We performed the same analysis for all topics in which more than 33% of the total returned articles were from ZIFF. There were 26 such topics (19 Adhoc Topics from 026-050, and 7 Routing Topics). For these topics, the mean percent of ZIFF articles chosen by all systems was 59%, compared with 9% for LSI. When comparable proportions of ZIFF articles were selected for LSI, the average number of relevant documents increased from 26 to 58, and the 11-pt precision increased from .0554 to .1111. A total of 700 new relevant documents were found for the 19 Adhoc Topics, and 125 new relevant documents were found for the 7 Routing Topics. Performance improvements were observed for 23 of the 26 topics - two of those in which there were no improvements were about AT&T products where poor pre-processing omitted AT&T from the LSI query (see below). While these results are encouraging, the problem of how to select appropriate subcollections is not solved. For the routing topics, we could use training data to set some apriori mixture of articles from various subcollections. This strategy is not, however, generally applicable for adhoc queries. We will examine some more appropnate way of combining across subcollections to take distributional effects like this into account. Alternatively, we could use randomly selected documents (rather than topically organized ones) to create the subcollections. Finally, we could use a single large combined scaling in which there would be no need to combine across subcollections. Finally, some misses were attributable to poor text (and query) pre-processing and tokenization. Since we do not keep single letters or use a database of company and location names, several important acronyms like U.S. and AT&T disappeared completely from both articles and queries! Not surprisingly, this resulted in many missed documents. We noticed that many of the top performing automatic systems used SMART's pre-processing, and we hope to do so as well for TREC-2. This will allow us to better understand the usefulness of LSI per se without many of the additional confounds introduced with different indexing. 3.3 Open experimental issues The results of the failure analyses suggest several directions to pursue for TREC-2, including: improving pre-processing and tokenization; exploring some precision-enhancing methods; and developing methods for more effecfively combining across subcollections. We also hope to explore three additional. 148 3.3.1 Separate vs. combined scaling We used 9 separate subscalings for the TREC experiments. Our decision to use the 9 subcollections was largely a pragmatic one initially. Would like to create a single large scaling and compare the results with those obtained using the subcollections. 3.3.2 Centroid query vs. many separate points of interest A single vector was used to represent each query. In some cases the vector was the average of terms in the topic statement, and in other cases the vector was the average of previously identified relevant documents. A single query vector can be inappropriate if interests are multifacted and these facets are not near each other in the LSI space. In the case of the routing queries, for example, we could match new documents against each of the previously identified relevant documents separately rather than against their average. We have developed techniques that allow us to match using a controllable compromise between averaged and separate vectors (Kane-Esrig et al., 1991). We did not use this method for TREC, but would like to do so for TREC-2. 3.3.3 Interactive inteifaces All LSI evaluations were conducted using a non-interactive system in essentially batch mode. It is well known that one can have the same underlying retrieval and matching engine, but achieve very different retrieval success using different interfaces. We would like to examine the performance of real users with interactive interfaces. A number of interface features could be used to help users make faster (and perhaps more accurate) relevance judgements, or to help them explicitly reformulate queries. (See Dumais and Schmitt, 1991, for some preliminary results on query reformulation and relevance feedback.) Another interesting possibility involves returning something richer than a rank-ordered list of documents to users. For example, a clustering and graphical display of the top-k documents might be quite useful. We have done some preliminary experiments using clustered return sets, and would like to extend them to the TREC-2 collection. The general idea is to provide people with useful interactive tools that let them make good use of their knowledge and skills, rather than attempting to build all the smarts into the database representation or matching components. 4. Onward to TREC-2 We were quite pleased that we were able to use many of the existing LSIISVD tools on the TREC collection. The most important finding in this regard was that the large, sparse SYD problems could be computed without numerical or convergence problems. The computations were not fast, but a day of CPU on existing workstations (for each subcollection) is certainly feasible, especially given that these calculations are only done once at the beginning to create the database. We have already computed some larger SYDs (100k x 200k) and would like to take advantage of this for TREC-2. In terms of accuracy, LSI performance was reasonable. There are some obvious ways to improve 149 the initial pre-processing and indexing and we will do so. We would like to use indexing methods that are as similar as possible to other automatic vector methods, so that we can examine the contribufion of LSI per se. We also now have many of the basic tools in place and should be able to conduct more experiments comparing various indexing and query matching ideas using the same underlying LSI engine. LSI was designed as a method to increase recall, especially for the short queries that users typical generate. For the TREC application we would like to explore some precision enhancing tools as well. Some of these will probably consist of more refined matching algorithms, but we also hope to move in the direction of interactive interfaces. We see this as an effective way of combining human and machine intelligence. 5. References [1] Berry, M. W. Large scale singular value computations. International Journal of Supercomputer Applications, 1992, 6(1), 13-49. [2] Cullum, J.K. and Willoughby, R.A. Lanczos algorithms for large symmetric eigenvalue computations - Vol 1 Theory, (Chapter 5: Real rectangular matrices). Brikhauser, Boston, 1985. [3] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, 0. W. and Harshman, R. A. Indexing by latent semantic analysis. Journal of the Society for Information Science, 1990,41(6), 391-407. [4] Dumais, S. T. Improving the retrieval of information from external sources. Behavior Research Methods, Instruments and Computers, 1991,23(2), 229-236. [5] Dumais, S. T. and Schmitt, D. 0. Iterative searching in an online database. Proceedings of Human Factors Society 35th Annual Meeting, 1991, 398-402. In [6] Foltz, P. W. and Dumais, S. T. Personalized information delivery: An analysis of information filtering methods. Communications of the ACM, Dec. 1992,35(12), 51-60. [7] Furnas, 0. W., Deerwester, S., Dumais, S. T., Landauer, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of SIGIR, 1988, 465- 480. [8] Kane-Esrig, Y., Streeter, L., Dumais, S. T., Keese, W. and Casella, 0. The relevance density method for multi-topic queries in information retrieval. In Proceedings of the 23rd Symposium on the Inte~ace, E. Keramidas (Ed.), 1991,407-410. [9] Salton, 0. and McGill, M.J. Introduction to Modern Information Retrieval. McGraw- Ilill, 1983. 150 Appendix Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). We take a large term-document matrix and decompose it into a set of k, typically 100 to 300, orthogonal factors from which the original matrix can be approximated by linear combination. More formally, any rectangular matrix, X, for example a t xd matrix of terms and documents, can be decomposed into the product of three other matrices: X =T0S0D0', :xd gxrrxrrxd such that T0 and D0 have orthonormal columns, S0 is diagonal, and r is the rank of X. This is so- called singular value decomposition of X. If only the k largest singular values of S0 are kept along with their corresponding columns in the T0 and D0 matrices, and the rest deleted (yielding matrices S, T and D), the resulting matrix, X, is the unique matrix of rank k that is closest in the least squares sense to X: X T SD'. :xd ixd txkkxkkxd The idea is that the X matrix, by containing only the first k independent linear components of X, captures the major associational structure in the matrix and throws out noise. It is this reduced model, usually with k 100, that we use to approximate the term to document association data in X. Since the number of dimensions in the reduced model (k) is much smaller than the number of unique terms (t), minor differences in terminology are ignored. In this reduced model, the closeness of documents is determined by the overall pattern of term usage, so documents can be near each other regardless of the precise words that are used to describe them, and their description depends on a kind of consensus of their term meanings, thus dampening the effects of polysemy. In particular, this means that documents which share no words with a user's query may still be near it if that is consistent with the major patterns of word usage. We use the term "semantic" indexing to describe our method because the reduced SVD representation captures the major associative relationships between terms and documents. One can also interpret the analysis performed by SVD geometrically. The result of the SVD is a k-dimensional vector represenfing the location of each term and document in the k-dimensional representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Since both term and document vectors are represented in the same space, similarities between any combinafion of terms and documents can be easily obtained. Retrieval proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their similarity to the query. We make no attempt to interpret the underlying dimensions or factors, nor to rotate them to some intuitively meaningful orientation. The analysis does not require us to be able to describe the factors verbally but merely to be able to represent terms, documents and queries in a way that escapes the unreliability, ambiguity and redundancy of individual terms as descriptors. 151 Choosing the appropriate number of dimensions for the LSI representation is an open research question. Ideally, we want a value of k that is large enough to fit all the real structure in the data, but small enough so that we do not also fit the sampling error or unimportant details. If too many dimensions are used, the method begins to approximate standard vector methods and loses its power to represent the similarity between words. If too few dimensions are used, there is not enough discrimination among similar words and documents. We typically find that performance improves as k increases for a while, and then decreases ([)umais, 1991). That LSI typically works well with a relatively small (compared to the number of unique terms) number of dimensions shows that these dimensions are, in fact, capturing a major portion of the meaningful structure. 152