Latent Semantic Indexing (LSI) and TREC-2 Susan T. Dumais Beilcore 445 South St. Morristown, NJ 07960 std@bellcore.com 1. Overview of Latent Semantic Indexing Latent Semantic Indexing (LSI) is an extension of the vector retrieval method (e.g., Salton & McGill, 1983) in which the dependencies between terms are explicidy taken into account in the representation and exploited in retrieval. This is done by simultaneously modeling all the interrelationships among terms and documents. We assume that there is some underlying or "latent" structure in the pattern of word usage across documents, and use statistical techniques to estimate this latent structure. A description of terms, documents and user queries based on the underlying, "latent semantic11, structure (rather than surface level word choice) is used for representing and retrieving information. One advantage of the LSI representation is that a query can be very similar to a document even when they share no words. Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby9 1985), to model the associative relationships. A large term-document matrix is decomposed it into a set of k, typically 100 to 300, orthogonal factors from which the original matrix can be approximated by linear combination. Instead of representing documents and queries direcdy as sets of independent words, LSI represents them as continuous values on each of the k orthogonal indexing dimensions. Since the number of factors or dimensions is much smaller than the number of unique terms, words will not be independent. For example, if two terms are used in similar contexts (documents), they will have simllar vectors in the reduced- dimension LSI representation. The SYD technique can capture such structure better than simple term- term or document-document correlations and clusters. LSI partially overcomes some of the deficiencies of assuming independence of words, and provides a way of dealing with synonymy antomatically without the 105 need for a manually constructed thesaurus. LSI is a completely automatic method. ~e Appendix provides a brief overview of the mathematics underlying the LSL(SVD method. Deerwester et al., 1990, and Furnas et al., 1988 present additional mathematical details and examples.) One can also interpret the analysis performed by SVD geo[netrically. The result of the SVD is a vector representing the location of each term and document in the k-dimensional LSI representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Retrieval typically proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their simllarity to the query. However, since both term and document vectors are represented in the same space, sintilarities between any combination of terms and documents can be easily obtained. The LSI method has been applied to many of the standard IR collections with favorable results. Using the same tokenization and term weightings, the LSI method has equaled or outperformed standard vector methods and other variants in aimost every case, and was as much as 30% better in some cases ([)eerwester et al., 1990). As with the standard vector method, differential term weighting and relevance feedback both improve LSI performance substantially (Dumais, 1991). LSI has also been applied in experiments on relevance feedback (Dumais and Schmitt, 1991), and in filtering applications ~oltz and Dumais, 1992). The recent MatcIiPlus system described by Gallant et al. (1992) is related to LSI. Both systems model the relationships between terms by looking at the sintilarity of the contexts in which words are used, and exploit these associations to improve retrieval. Both systems use a reduced dimension vector representation, but differ in how the term, document and query vectors are formed. 2. LSI and TREC-1 We used the ThEC-1 conference as an opporunlty to "scale up" our tools, and to explore the LSI dimension-reduction ideas using a very rich corpus of word usage. We we~ pleased that we were able to use many of the existing LS~SVD tools on the large collection. (See Dumais, 1993 for details.) We were able to compute the SVDs of 50k docs x 75k words matrices without numerical or convergence problems on a standard Dec5OOO or SpardO workstation. Because of these limits on the size of the matrices we could handle, we divided the ~IREC-1 documents into 9 separate subeollections (APi, DOEl, FRi, WSJi, ZIFFi, AP2, FR2, WSJ2, ZIFF2) and worked with these. There are also some theoretical reasons why working with subcollections makes sense. By using topically coherent subcollections one can get better discrimination among documents. When the ZIFF subcollection is analyzed separately, for example, 200 or so dimensions can be devoted to differences among computer-related topics. When these same documents are part of a large corpus, many fewer indexing dimensions will be devoted to discriminating among them. interms of accuracy, LSI performance in TREC-i was about average. There were some obvious problems with our inltial pre-processing of documents (e.g., "U.S." and 11A.T.T." were omitted), and there were some unanticipated problems in coinbining across subeollections. Since many of the top performing automatic systems used SMART's preprocessing, we chose to do so as well for TREC-2. In addition, using the SMART software for this purpose allows us to compare LSI with the comparable vector method, so that we can examine the contribution of LSI per se. We also planned on comparing an LSI analysis using subeollections (from `IREC-i) with an LSI analysis of the entire collection for TREC-2. We had high hopes of being able to build on our TREC-i work for TREC-2. in practice, however, the changes in the pre-processing algorithm, and the decision to use a single combined LsI analysis resulted m our starting from scratch" in many respects. We have completed soine experiments for TREC-2, but we did not get as far as we would have liked, especially for the adhoc topics. 106 3. LSI and TREC-2 3.1 Pre-processing We used the SMART system1 for pre-processing the documents and queries. Some markups (e.g. <> delimiters) were removed, and all hand-indexed entries were removed from the WSJ and ZIFF collections. Upper case characters w&e translated into lower case, punctuation was removed, and white spaces were used to delimit terms. The SMART stop list of 571 words was used as is. The SMART stemmer (a modified Lovins algorithm) was used without modification to strip words endings. We did not use: phrases, proper noun identification, word sense disambiguation, a thesaurus, syntactic or semantic parsing, spelling checking or coffection, complex tokenizers, a controlled vocabulary, or any manual indexing. The result of this pre-processmg can be thought of as a term-document matrix, in which each cell entry indicates the fi~quency with which a term appears in a document. The entries in the term-document matrix were then transformed using an "ltc" weighting. The "ltc" weighting takes the log of individual cell entries, multiplies each entry for a term (row) by the WF weight of the term, and then normalizes the document (col) length. We began by processing the 742358 documents from CD-i and CD-2. Using the mimal pre-processing described above, there were 960765 unique tokens, 512251 unique stems, and 81901331 non-zero entries in the term-document matrix. 742331 documents contained at least one term. To decrease the matrix to a size we thought we could handle, we removed tokens occuiring in fewer than 5 documents. This resulted in 781421 unique tokens, 104533 unique stemmed words, and 81252681 non-zero entries. We used the resulting 742331 document x 104533 term matrix as the starting point for results reported in this paper. The "ltc" weights were computed on this matrix. 3.2 SVD analysis The "ltc" matrix described above was used as input to the SVD algorithm. The SVD program takes the ite 1. The SMART system (version 1 l~) was made available through the SMART group at Cornell University. Chris Buckley was especially generous in consultations about how to get the software to do somewhat non-standard things. transformed term-document matrix as input, and calculates the best "reduced-dimension91 approximation to this matrix. The result of the SVD analysis is a reduced-dimension vector for each term and each document, and a vector of the singular values. The number of dimensions, k, was between 200 and 300 in our experiments. This reduced- dimensional representation is used for retrieval. The cosine between term-term, document-document, or term-document vectors is used as the measure of similarity between them. We were recendy able to compute the SYD analysis of the full 742k x 104k matrix described above, but not in time to include the results in this paper. For the runs submitted, we used a sample of documents from the above matrix. When appropriate, the documents that were not sampled were "folded in" to the resulting reduced-dimension ISI space. In all cases, we used the weights from the 742k x 104k matrix (and did not recompute them for our samples). For the routing experiments, we used the subset of do;cnents for which we had relevance judgements. There were 68385 unique documents with relevance judgements. The SVD analysis was computed on the relevant 68385 document x 88112 term subset of the above matrix, containing 14461782 non-zero cells. A 204 reduced-dimensional approximation took 18 hrs of CPU time to compute on a SpardO workstation. This 204-dimensional representation was used for matching and retrieval. For the adhoc experiments, we took a random sample of 70000 documents. A reduced-dimensional SVD approximation was computed on a 69997 document x 82%8 term matrix (7666044 non-zeros). The resulting 199 reduced-dimensional representation was used for retrieval. The 672331 documents not included in this sample were "folded in" to the 199- dimension LsI space, and the adhoc queries were compared against all 742k documents. These "folded in" documents were located at the weighted vector sum of their coustituent terms. That is, the vector for a new document was computed using the term vectors for all terms in the document. For documents that are actually present in the term- document matrix, this derived vector corresponds exacfly to the document vector given by the SVD. New terms can be added in an analogous fashion. The vector for new terms is computed using the document vectors of all documents in which the term appears. When adding documents and terms in this manner, we assume that the derived semantic space" is fixed and 107 that new items can be fit into it. In general, this is not the same space that one would obtain if a new SYD were calculated using both the original and new documents. `[1 previous experiments, we found that sampling and scaling 50% of the documents, and "folding in" the remaing documents resulted in peiformance that was indistinguishable from that observed when all documents were scaled. Here, however, the scaling is based on less than 10% of the total corpus. We also had (from ThEC-1) LSI analyses of the 9 subeollections in CD-i and CD-2. 3.3 Queries and retrieval Queries were automatically processed in the same way as documents. For queries derived from the topic statement, we began with the full text of each topic (all topic fields), and stripped out the SGML field identifiers. For feedback queries, we used the full text of relevant documents. A query vector (or new document in the case of routing) indicating the frequency with which each term appears in the query was automatically generated for each topic. The query was transformed using SMART's "ltc" weighting. Note that we did not use any Boolean connectors or proximity operators in query formulation. Cibe implicit connectives, as in ordinary vector methods, fall somewhere between ORs and ANDs, but with an additional kind of "flizziness" introduced by the dimension-reduced association matrix representation of terms and documents.) The terms in the query are used to identify a vector in the LSI space; recall that each term has a vector representation in the space. A query is simply located at the weighted vector sum of its constituent term vectors. The cosine between the query vector and every document vector is computed, and documents are ranked in decreasing order of similarity to the query. (Although there are many fewer dimensions than in standard vector retrieval, the entries are almost all non-zero so inverted indices are not useful. This means that each query must be compared to every document.) For 200-dimensional vectors, about 60,000 cosines can be computed per minute on a Sparc2. This time includes both comparison time and ranking time, and assumes that all document vectors are pre-loaded into memory. For the adhoc queries, the time to compare a query to the 743k documents was about 10 minutes if all comparisons were sequential. It is' however, straightforward to split this matching across several machines or to use parallel hardware since all documents are independent. Preliminary experiments using a 16,000 PE MasPar showed that 60,000 cosines could be computed and sorted in less than 1 second. It is important to note that all step in the LSI analysis are completely automatic and involved no human intervention. Documents are automatically processed to derive a term-document matrix. This matrix is decomposed by the SVD software, and the resulting reduced-dimension representation is used for retrieval. While the SVD analysis is somewhat costly in terms of time for large collections, it need is computed only once at the beginrilng to create the reduced-dimension database. Cilie SYD takes only about 2 minutes on a Sparc 10 for a 2k x 5k matrix, but this time increases to about 18 hours for a 60k x 80k matrix.) 3.4 TREC-2: Routing experiments For the routing queries, we created a filter or profile for each of the 50 training topics. We submitted results from two sets of routing queries. in one case, the filter was based on just the topic statements - i.e., we treated the routing queries as if they were adhoc queries. The filter was located at the vector sum of the terms in the topic. We call these the routin~topic (`sfrl) results. This method makes no use of the trahibig data, representing the topic as if it was an adhoc query. in the other case, we used information about relevant documents from the training set. The filter in this case was derived by taking the vector sum or centroid of all relevant documents. We call these the rouhn~reldoes (`sir2) results. There were an average of 328 relevant documents per topic, with a range of 40 to 896. This is a somewhat unusual variant of relevance feedback; we replace (rather than combine) the original topic with relevant documents, and we do not downweight terms that appear in non- relevant documents. These two extremes provide baselines against which to compare other methods for combining information from the original query and feedback about relevant documents. in both cases, the filter was a single vector. New documents were matched against the filter vector and ranked in decreasing order of similarity. The new documents (336306 documents from CD-3) were automatically processed as described in section 3.2 above. It is important to note that only terms from the CD-i and CD-2 traihing collection were used in indexing these documents. Each new document is located at the weighted vector sum of its constituent 108 term vectors in the 204-dimension LSI space (in just the same way as queries are handled). New documents were compared to each of the 50 routing filter vectors using a cosine similarity measure in 204-dimensions. The 1000 best matching documents for each filter were submitted to NIST for evaluation. 3.4.1 Results The main results of the lsirl and lsir2 runs are shown in Table 1. The two runs differ only in how the profile vectors were created - using the weighted average of the words in the topic statement for Isiri routin~topic, and using the weighted average of all relevant documents from the traihing collection (CD-i and CD-2) for lsir2 routin~reldoes. Not surprisingly, the lsir2 profile vectors which take advantage of the known relevant documents do better than the lsirl profile vectors that simply use the topic statement on all measures of performance. The improvement in average precision is 31% (.2622 vs. .3442). Users would get an average of 1 additional relevant document in the top 10 returned using the lsir2 method for filtering. Table 1 Isiri lsir2 rl+r2 (topic wds) (rel does) (sum ri r2) Rel_ret 6522 7155 7367 Avg prec .2622 .3442 .3457 PratlOO .3799 .4524 A394 Pr at 10 .5480 .6660 .6620 R-prec .3050 .3804 .3786 Q >= Median 27 (4) 40(9) 42 (6) Q< Median 23 (0) 10 (0) 8 (0) Table 1: LSI Routing Results. Comparison of topic words vs. relevant documents as routing filters. Compared to other TREC-2 systems, LSI does reasonably well, especially for the routin~reldocs (lsir2) run (and the rl+r2 run to be discussed below). in the case of lsir2, LSI is at or above the median performance for 40 of the 50 topics, and has the best score for 9 topics. LSI performs about average for the routin~topic (Isiri) run even though no information from the training set was used in forming the routing vectors in this case (except, of course, for the global term weights). We have also performed similar comparisons between U U ~d the centroid of all relevant documents for some of the standard IR test collections ~ed, Cr51, Cranfield, CACM, Time). In these cases, we found an average improvement of 107% when the query was replaced by the centroid of all relevant documents. The Improvement was 67% when the top three relevant documents were used, and 33% when just the first relevant document was used. The smaller advantages observed In JREC-2 are partially due to statistical artifacts, and partially to the fl~EC topics which are much richer need statements than the usual IR queries. (We also examined topic and reldocs profiles In TREC-1. Somewhat surprisingly, the query using just the topic terms was about 25% more accurate than the query using relevant documents from training. This is attributable to the small number and inaccuracy of relevance judgements in the Initial training set for ~IREC-1. This had substantial Impact on performance for some topics because our reldocs queries were based only on the relevant articles and ignored the original topic description.) The lsirl and lsir2 runs provide baselines against which various combinations of query information and relevant document information can be measured. We have tried a simple combination of the lsirl and lsir2 profile vectors, in which both components have equal weight. That is, we took the sum of the lsirl and lsir2 profile vectors for each of the topics and used this as a profile vector. The results of this analysis are shown in the third column of the table labeled rl+r2. This combination does somewhat better than the centroid of the relevant documents in the total number of relevant documents returned and in average precision. (We returned fewer than 1000 documents for 5 of the topics and not all documents returned by the rl+r2 method had been judged for relevacce, so we suspect that performance could be improved a bit more.) For 27 of the topics, rl+r2 was better than the maximum of the other two methods. It was never more than about 10% worse than the best method. Thus it appears that this combination takes advantage of the best of both methods. The rl+r2 method which combines a query vector with a vector representing the centroid of all relevant documents is a kind of relevance feedback. This is an unusual variant of relevance feedback since all the words in relevant documents are used, words in non- relevant documents are not down-weighted, and query terms are not re-weighted. Interestingly, this method appears to produce improvements that are comparable to those obtained by Buckley, Allan and Salton (1993) using more traditional relevance feedback methods. 109 A~~r~e preci~i~ f~ tile rl+r2 method is 31% better than for lsirl which used only the topic words (.3457 vs. .2622), and this is quite similar to the 38% improvement reported by Buckley, Allan and Salton (1993) for their richest routing query expansion method. The lsir2 method is generally better than the lsirl method, but there is substantial variability across topics. The topics on which there are the largest differences are generally those in which the cosine between the the lsirl and lsir2 topic vectors are smallest. The cosines between corresponding topic vectors range from .87 to .54. The lsir2 method is substantially better on topics: 71 (incursions by foreign military or guerrilla groups), 73 (movement of people from one country to another), 87 (criminal actions against officers of falled financial institution), 94 (crime peipetrated with the aid of a computer), 98 (production of fiber optics equipment). There are a few topics for which lsirl is substantially better than lsir2: 63 (machine translation system), 65 (information retrieval system), 85 (actions against corrupt public officials), 95 (computer application to crime solving). It is not entirely clear what distinguishes between these topics, especially topics 94 and 95, for example. We have not yet had time to look in detail at the fallures of the LSI system. We will examine both misses and false alarms in more detail. A preliminary examination of a few topics suggests that lack of specificity is the main reason for false alarms (highly ranked but irrelevant documents). This is not surprising because LSI was designed as a recall- enhancing method, and we have not added precision- enhancing tools although it would be easy to do so. We would also like to examine some query splitting ideas. We have previously conducted experiments which suggest that performance can be improved if the filter is represented as several separate vectors. We did not use this method for the IREC-2 results we submitted, but would like to do so. (See also Kane- Esrig et al., 1991 or Foltz and Dumais, 1992, for a discussion of multi-point interest profiles in LSI.) 3.5 TREC-2: Adhoc experiments We submitted two sets of adhoc queries - lsiasm and lsial. We had intended to compare the new SMART pre-processing (isiasm) and a single LsI space (`sial) with our old JREC-1 pre-processing and 9 separate subeollection spaces. Unfortunately, there were some serious errors in our translation between internal document numbers and the 4)OCNO> labels (documents not in the LSI scaling were mislabeled), so the isial run results are incomplete and misleading. We have corrected this translation problem, and the correct results are labeled lsial*. These results are summarized In Table 2. We have not yet completed the comparison agalnst the 9 separate subspaces from ThEC-1. performance would be somewhat better. Similarly for the topl()0() documents, lsial* had more than 4000 more documents without relevance judgements than did lsiasm. Table 3 isiasm lsial* isiasm `sial* Table 2 toplOO toplOO toplOOO toplOOO isiasm isial lsial* relevant 2153 2122 15559 12230 error correct not-relevant 2847 1961 7869 6987 not-judged 0 927 26572 30694 Rel_ret 7869 4756 6987 Avg prec .3018 .1307 .2505 Table 3: Summary of missing relevance Prat 100 .4306 .2664 .3922 judgemeuts for standard vector method and LSI. Prat 10 .5020 .3340 .5100 R-prec .3580 .1937 .3069 Q >= Median 37 (2) 16 (1) 25 (1) Q< Median 13 (0) 34 (7) 25 (0) Table 2: LSI Adhoc Results. Comparison of standard vector method with LSI (corrected version, but missing relevance judgements). In terms of absolute levels of performance, both lsiasm and lsial* are about average. The SMART results (lsiasm) are somewhat worse than the ~IREC-2 SMART results reported by Buckley et al., Fuhr et al., or Voorhees, but this is because we used slightiy different pre-processing options and did not include phrases. Mthough it is generally difficult to compare across systems, the SMART (lsiasm) and LSI (lsial*) runs can meaningfully be compared since both use the same pre-processing. The starting term-document matrix was the same in both cases. Much to our disappointment, the reduced-dimension LSI performance appears to be somewhat worse than the comparable SMART vector method. However, it is important to realize that many of the documents returt~ by lsial* were not judged for relevance because they were not submitted as an official run. Table 3 shows the number of doccments for which there are no judgements. Consider the results for just the toplOO documents for each query (i.e., the documents judged by the NIST assessors). For lsiasm, all 5000 documents were judged since this was an official run, and 2153 were relevant. For lsial*, only 4073 documents were judged and almost as many, 2122, were relevant. Thus, if only 31 of the 927 unjudged lsial* documents are relevant LsI performance would be comparable to SMART performance, and if more than 31 were relevant LSI 110 Because the missing relevance judgements make direct comparisons between SMART and LSI difficult, we decided to look at performance for just the documents for which we had relevance judgements. That is, we looked at performance considering just the 38175 unique documents for which we have adhoc relevance judgements. These results are shown in Table 4. Table 4 lsiasm lsial* 38175 38175 Rel_ret 9493 9596 Avg prec .3700 .3789 Prat 100 A306 .4466 PratlO .5020 .5220 R-prec .3977 .3995 Table 4: LSI Adhoc Results. Comparison of standard vector method with LSI using only documents for which relevance judgements were available. The most strildng aspect of these results is the higher overall levels of performance. This is to be expected since we are only considering the 38175 documents for which we have relevance judgements, and there are 700k fewer documents than in the official results. Considering only this subset of documents, there is a small advantage for LSI compared to the SMART vector method. Taken together with the results for just the top 100 documents these results suggest that LSI can outperform a stralghtforward vector method. We were somewhat disappointed at the relatively small difference between 151 and a comparable vector method in the ThEC environment, given that we have consistenfly observed larger advantages previously. The most likely reason for this is that the `IREC topics are much rich& and more detailed descriptions of searchers' needs than are found in typical IR requests. The average `IREC query has 51 unique words in it, and many of these are very specific names. Since 151 is primarily a recall enhancing method it has littie effect on these already very rich queries. This is much the same conclusion that groups who tried various kinds of query expansion (another recall enhancing method) reached - e.g., see Voorhees 1993. We tried one analysis using the new "suuunary" field for each topic. This field alone is used as a much shorter topic statement (often quite similar to the description field) that covers the relevance assessments. These results are summ&i~~d in Table 5. As expected, overall performace is lower than with the ccxnplete topics. More interestingly, the difference between 151 and the standard vector method is now larger - 16% in average precision. This is still a somewhat smaller advantage than we have seen in previous experiments with smaller test collections, but even the summary queries have an average of 11 unique words in them. Table 5 isiasm lsial* 38175 38175 4. Improving performance 4.1 Improving performance - Speed The 15llSVD system was built as a research prototype to investigate many different information retrieval and interface issues. Retrieval efficiency was not a central concern because we first wanted to assess whether the method worked before worrying about efficiency, and because the initial applications of 151 involved much smaller databases of a few thousand documents. Almost no effort went into re-designing the tools to work efficiendy for the large TREC databases. 4.1.1 SVD SVD algorithms get faster all the time. The sparse, iterative algorithm we now use is about 100 times faster than the method we used inltially (Berry, 1992). There are the usual speed-memory tradeoffs in the SVD algorithms, so time can probably be decreased some by using a different algorithm and more memory. Parallel algorithms will help a littie, but probably only by a factor of 2. Finally, all calculations are now done in double precision, soboth time and memory could be decreased by using single precision computations. Preliminary experiments with smaller IR test collections suggest that this decrease in precision will not lead to numerical problems for the SVD algorithm. It is important to note that the pre- processing and SVD analyses are one-time-only costs for relatively stable doinains. Rel_ret 8043 Avg prec .2589 Pr at 100 .3344 Prat 10 .3420 R-prec .3114 Table 5: 151 Adhoc Results - Comparison of standard vector using only documents for judgements were available, Summary field. 4.1.2 Retrieval Query vectors are conipared to eveiy document. This process is linear in the number of documents in the database, and can be quite slow for large databases. Although there are no practical and efficient algorithms for finding nearest neighbors in 200- or 300-dimensional spaces, there are several methods which could be used to speed retrieval. a) Document clustering could be used to reduce the number of comparisons, but accuracy would probably suffer some. We have explored several heuristics for clustering, but none are particularly effective when high levels of accuracy are maintained. b) HNC's Matchplus system (Gallant et al., 1993) uses another approach to reduce the number of alternative documents that must be matched. They use an inltial keyword match to elilninate many documents and calculate reduced-dimension vector scores for only the subset of documents meeting the inltial keyword restriction. This may be a reasonable alternative for long queries (like `IREC), but we believe that recall would be reduced for short queries. In addition two 8676 .3008 .3710 .3680 .3386 Summary Topics. method with 151 which relevance using only the The reduced dimension 151 vector retrieval method can offer performance advantages compared to a standard vector method for the large ThEC collection. The advantages are larger with shorter queries, as expected. The exact natwe of this advantage (e.g., which documents are retrieved by 151 but not the standard vector method) needs to be examined in more detall. 111 data structures need to be maintained. c) Query matching can also be improved tremendously by simply using more than one machine or parallel hardware. Using a 16,000 PE MasPar, with no attempt to optimize the data storage or sorting, we decreased the time required to match a 200- dimensional query vector against all document vectors and sort by a factor of 60 to 100. 4.2 Improving Performance - Accuracy We have only begun to look at a large number of parametric variations that might improve LSI performance. One important variable for LSI retrieval is the number of dimensions in the reduced dimension space. In previous experiments we have found that performance improves as the number of dimensions is increased up to 200 or 300 dimensions, and decreases slowly after that to the level observed for the standard vector method (Dumais, 1991). We have examined `IREC-2 performance using fewer dimensions than reported above (204 for the routing queries and 199 for the adhoc queries) and consistendy found worse performance. Thus, it looks like we could improve performance simply by increasing the number of dimensions some. Unfortunately, this requires reruning the SVD. We also noticed that many of the adhoc queries contained "NOTS". Since LSI does not use any Boolean logic and represents a query as the vector sum of its constituent terms, we thought that removing this information might help. We modified the topic statements by hand to remove negated phrases. Performance improved by less that 2%. We still need to experiment with different term weighting methods. For the routing and adhoc experiments we used SMART's "ltc" weighting for both the corpus of documents and the queries. Buckley and Salton's ~fl~C-1 paper suggests that alternative weightings may be more effective for the large TIl:EC document collection. Reweighting the query vectors is easy. Reweighting the document collection is more difficult, because this changes the term-document matrix and a new SYD is required. For the routing queries we would like to try several alternative methods of combining information from the original query and the relevant documents to take better advantage of the good training data that is available. We expect term re-weighting and the use of negative information (e.g., down weighting terms from non-relevant documents) to improve performance some. 112 In order to better understand retrieval performance we have begun to examine two kinds of retrieval failures: false alarms, and misses. False alarms are documents that LSI ranks highly that are judged to be irrelevant. Misses are relevant documents that are not in the top 1000 returned by LSI. 4.2.1 False Alarms. The most common reason for false alarms was lack of specificity. These highly ranked but irrelevant articles were generally about the topic of interest but did not meet some of the restrictions described in the topic statement. Many topics required this kind of detailed processing or fact-linding that the LSI system was not designed to address. Precision of LSI matching can be increased by many of the standard techniques - proper noun identification, use of syntactic or statistically- derived phrases, or a two-pass approach involving a standard initial global matching followed by a more detailed analysis of the top few thousand documents. Buckley and Salton (1992, SMART's global and local matching), Evans et al. (1992, CLARIT's evoke and discrjiimate strategy), Nelson (1992, ConQuest's global match followed by the use of locality of information), and Jakobs, Knipka and Rau (1992, GE's pre-filter followed by a variety of more stringent tests) all used two-pass approaches to good advantage in ThEC-1 or TREC-2. We would like to try some of these methods, and will focus on general-purpose, completely automatic methods that do not have to be modified for each new domain or query restriction. Another possible reason for false alarms appears to be the result of inappropriate query pre-processing. The use of negation is the best example of this problem. 32 of 50 adhoc queries contain some negation in the topic statement. Some pretiminary experiments (described briefly above) found only a small improvement m performance when negated information was manually removed from the topics. Another example of inappropriate query processing involved the use of logical connectives. LSI does not handle Boolean combinations of words, and often returned articles covering only a subset of ANDed topics. often one aspect of the query appears to dominate (typically the one described by the terms with high weights). Limiting the contribution of any one term to the overall similarity score might help this problem. Finally, it is not at all clear why about 20% of the false alarms were returned by LSI. Since LSI uses a statistically-derived "semantic" space and not surface-level word overlap for matching queries to documents, it is sometimes difficult to understand why a particular document was returned. One advantage of the LSI method is that documents can match queries even when they have no words in common; but this can also produce some spurious hits. Another reason for false alarms could be inappropriate word sense disambiguation. LsI queries are located at the weighted vector sum of the words, so words are "disambiguated" to some extent by the other query words. Similarly, the inltial SYD analysis used the context of other words in articles to determine the location for each word in the LSI space. However, smce each word has only one location, it sometimes appears as if it is "in the middle of nowhere". A related possibility concerns long articles. Lengthy articles which talk about many distiiict subtopics were averaged into a single document vector, and this can sometimes produce spurious matches. Breaking larger documents into smaller subsections and matching on these might help. 4.2.2 Misses. For this analysis we will examine a random subset of relevant articles that were not in the top 1000 returned by LSI. Many of the relevant articles were fairly highly ranked by LSI, but there were also some notable failures that would be seen only by the most persistent readers. So far, we have not systematically distinguished between misses that "almost made it" and those that were much finher down the list. Most of the misses we examined, represent articles that were primarily about a different topic than the query, but contained a small section that was relevant to the query. Because documents are located at the average of their terms in LsI space, they will generally be near the dominant theme, and this is a desirable feature of the LSI representation. Some kind of local matching should help in identifying less central themes in documents. Some misses were also attributable to poor text (and query) pre-processing and tokenizatio~ 4.3 Open issues On the basis of preliminary failure analyses we would like to exploring some precision-enhancing methods. We would also like to explore three additional areas. 4.3.1 Separate vs. combined scaling We used 9 separate subscalings for the ~IREC-l experiments. For TREC-2 we used a single scaling 113 (based on a very small sample). We have also recentiy finished a complete scaling and will compare this with the subeollection scalings and the sampled full scaling. 4.3.2 Centroid query vs. many separate points of interest A single vector was used to represent each query. "1 some cases the vector was the average of terms in the topic statement, and in other cases the vector was the average of previously identified relevant documents. A single query vector can be inappropriate if interests are multifacted and these facets are not near each other in the LsI space. We have developed techniques that allow us to match using a controllable compromise between averaged and separate vectors ~ane-Fsrig et al., 1991). In the case of the routing queries, for example, we could match new documents against each of the previously identified relevant documents separately rather than against their average. 4.3.3 Interactive interfaces Ml LSI evaluations were conducted using a non- interactive system in essentially batch mode. It is well known that one can have the same underlying retrieval and matching engine, but achieve very different retrieval success using different interfaces. We would like to examine the performance of real users with interactive interfaces. A number of interface features could be used to help users make faster (and perhaps more accurate) relevance judgements, or to help them explicitiy reformulate queries. (See Dumais and Schmitt, 1991, for some preliminary results on query reformulation and relevance feedback.) Another interesting possibility involves retning something richer than a rank-ordered list of documents to users. For example, a clustering and graphical display of the top-k documents might be quite useful. We have done some preliminary experiments using clustered return sets, and would like to extend this work to the TREC collections. The general idea is to provide people with useful interactive tools that let them make good use of their knowledge and skills, rather than attempting to build all the smarts into the database representation or matching components of the system. 5. Onward to TREC-3 We were quite pleased that we were able to use many of the existing LSIISVD tools on the ThEC-1 and TREC-2 collections. The most Important findmg in this regard was that the large, sparse SVD problems could be computed without numerical or convergence problems. We modified the preprocessing substantially for ~fl~C-2, now bave many of the basic tools in place and should be able to conduct more experiments comparing various indexing and query matching ideas using the same underlying LSI engine. Bigger SVDs, faster query matching, Improving precision, and interactive interface issues are the major areas targeted for Improvement. [8] Dumais, S. T. and Schmitt, D. 0. Iterative searcbing in an oniine database. In Proceedings of Human Factors Society 35th Annual Meeting, 1991, 398402. [9] Evans, D., Leffe~, R., Grefenstette, 0., Handerson, S., Hersh, W., and Archbold, A. CLM~ fl~EC Design, experiments and results. lii D. Harman (Ed.) The First Text REtrieval Conference (TREC-1), NIST Special Publication 500-207, 1992,251-286. [10] Foltz, P. W. and Dumals, S. T. Personalized information delivery: An analysis of information filtering methods. Communications of the ACM, Dec. 1992, 35(12), 51-60. 6. References [1] Berry, M. W. computations. Supercomputer 49. Large scale singular value International Journal of Applications, 1992, 6(1), 13- [2] Buckley, C., Allan, J., and Salton, 0. Automatic routing and ad-hoc retrieval using SMART: IREC 2. To appear in: Proceedings of TREC-2. [3] Buckley, C. and Salton, 0. Automatic retrieval with locality information using SMART. In D. Harman (Ed.) The First Text REtrieval Conference (TREC-1), NIST Special Publication 500-207, 1992,59-72. [4] Cullum, JX. and Willoughby, R.A. Lanczos algorithms for large symmetric eigenvalue computations - Vol 1 Theory, (Chapter 5: Real rectangular matrices). Brii:hauser, Boston, 1985. [5] Deerwester, S., Dumais, S. T., Landauer, T. K., Furnas, 0. W. and Harshinan, R. A. Indexing by latent semantic analysis. Journal of the Society for Information Science, 1990, 41(6), 391-407. [6] Dumais, S. T. Improving the retrieval of information from exteiiial sources. Behavior Research Methods, Instruments and Computers, 1991,23(2), 229-236. [7] Dumais, S. T. LSI meets ThEC: A staus report. In D. Harman (Ed.) The First Text REtrieval Conference (TREC-1). NIST special publication 5O(~207, 137-152. 114 [11] Furnas, 0. W., Deerwester, S., Dumals, S. T., Landaner, T. K., Harshman, R. A., Streeter, L. A., and Lochbaum, K. E. Information retrieval using a singular value decomposition model of latent semantic structure. In Proceedings of SIGIR, 1988,465-480. [12] Gallant, S., Hecht-Nielson, R., Cald, W., Qing, K., Carleton, J., Sudbeck, D. TIPSThR Panel - `INC's MatchPlus System. In D. Harman (Ed.) The First Text REtrieval Conference (TREC-1), NIST Special Publication 500-207, 1992, 107-112. [13] Jacobs, P, Krupka, 0., and Rau, L. A Boolean approxImation method for query construction and topic assignment in TREC. In D. Harman (Ed.) The First Text REtrieval Conference (TREC-1), NIST Special Publication 500-207, 1992,297-308. [14] Kane-Esrig, Y., Streeter, L., Dumais, S. T., Keese, W. and Casella, G. The relevance density method for multi-topic queries in information retrieval. In Proceedings of the 23rd Symposium on the Interface, E. Keramidas (Ed.), 1991,407-410. [15] Nelson, P. Site report for the Text REtrieval Conference. In D. Harman (Ed.) The First Text REtrieval Conference (TRFC-1), NIST Special Publication 500-207, 1992,287-2%. [16] Salton, 0. and McGill, M.J. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. [17] Voorhees, E. On expanding query vectors with lexically related words. To appear in: Proceedings of TREC-2. 7. Appendix Latent Semantic Indexing (LSI) uses singular-value decomposition (SVD), a technique closely related to eigenvector decomposition and factor analysis (Cullum and Willoughby, 1985). We take a large term-document matrix and decompose it into a set of k, typically 100 to 3009 orthogonal factors froni which the original matrix can be approximated by linear combination. More formally, any rectangular matrix, X, for example a txd matrix of terms and documents, can be decomposed into the product of three other matrices: =T0~S0~D0', such that T0 and D0 have orthonormal columns, S0 is diagonal, and r is the rank of X. This is so-called singular value decomposition of X. If only the k largest singular values of SO are kept along with their corresponding columns in the T0 and D0 matrices, and the rest deleted (yielding matrices S, T and D), the resulting matrix, X, is the unique matrix of rank k that is closest in the least squares sense to X: xx = ~ The idea is that the x matrix, by containing only the first k independent linear components of X, captures the major associational structure in the matrix and throws out noise. It is this reduced model, usually with k 100, that we use to approximate the term to document association data in X. Since the number of dimensions in the reduced model (k) is much smaller than the number of unique terms (t), niinor differences m terminology are ignored. In this reduced model, the closeness of documents is determined by the overall pattern of term usage, so documents can be near each other regardless of the precise words that are used to describe them, and their description depends on a kind of consensus of their term meanings, thus damperling the effects of polysemy. `[1 particular, this means that documents which share no words with a user's query may still be near it if that is consistent with the major patterns of word usage. We use the term "semantic" indexing to describe our method because the reduced SVD representation captures the major associative relationships between terms and documents. One can also interpret the analysis pefformed by SVD geometrically. The result of the SVD is a k - dimensional vector representing the location of each 115 term and document in the k-dimensional representation. The location of term vectors reflects the correlations in their usage across documents. In this space the cosine or dot product between vectors corresponds to their estimated similarity. Since both term and document vectors are represented in the same space, similarities between any combination of terms and documents can be easily obtaihed. Retrieval proceeds by using the terms in a query to identify a point in the space, and all documents are then ranked by their similarity to the query. We make no attempt to interpret the underlying dimensions or factors, nor to rotate them to some intuitively meaningfiil orientation. The analysis does not require us to be able to describe the factors verbally but merely to be able to represent terms, documents and queries in a way that escapes the unreliability, ambiguity and redundancy of individual terms as descriptors. choosing the appropriate number of dimensions for the LSI representation is an open research question. Ideally, we want a value of k that is large enough to fit all the real structure in the data, but small enough so that we do not also fit the sampling error or unimportant details. If too many dimensions are used, the method begins to approximate standard vector methods and loses its power to represent the s~larity between words. If too few dimensions are used, there is not enough discrinilnation among similar words and documents. We find that performance improves as k increases for a while, and then decreases (Dumais, 1991). That LSI typically works well with a relatively small (compared to the number of unique terms) number of dimensions shows that these dimensions are, in fact, capturing a major portion of the meaningffl structure.