Automatic Routing and Ad-hoc Retrieval Using SMART : TREC 2 Chris BuCkley*, James Allan, and Gerard Salton Abstract The Smart information retrieval project em- phasizes completely automatic approaches to the understanding and retrieval of large quan- tities of text. We continue our work in the TREC 2 environment7 performing both rout- ing and ad-hoc experiments. The ad-hoc work extends our investigations into combin- ing global similarities, giving an overall indica- tion of how a document matches a query, with local similarities identifying a smaller part of the document which matches the query. The performance of the ad-hoc runs is good, but it is clear we are not yet taking full advantage of the available local information. Our routing experiments use conventional relevance feedback approaches to routing, but with a much greater degree of query expan- sion than was done in TREC 1. The length of a query vector is increased by a factor of 5 to 10 by adding terms found in previously seen relevant documents. This approach im- proves effectiveness by 30-40% over the origi- nal query. Introduction For over 30 years, the Smart project at Cor- nell University has been interested in the anal- ysis, search, and retrieval of heterogeneous text databases, where the vocabulary is al- lowed to vary widely, and the subject matter is unrestricted. Such databases may include newspaper articles, newswire dispatches, text- books, dictionaries, encyclopedias, manuals, magazine articles, and so on. The usual text analysis and text indexing approaches that are based on the use of thesauruses and other vo- cabulary control devices are difficult to apply *Department of Computer Science, Cornell Univer- sity, Ithaca, NY 14853-7501. This study was sup- ported in part by the Nationai Science Foundation un- der grant IRI 89-15847. 45 in unrestricted text environments, because the word meanings are not stable in such circum- stances and the interpretation varies depend- mg on context. The applicability of more com- plex text analysis systems that are based on the construction of knowledge bases covering the detailed structure of particular subject ar- eas, together with inference rules designed to derive relationships between the relevant con- cepts, is even more questionable in such cases. Complete theories of knowledge representation do not exist, and it is unclear what concepts, concept relationships, and inference rules may be needed to understand particular texts.[11] Accordingly, a text analysis and retrieval component must necessarily be based primar- ily on a study of the available texts themselves. Fortunately very large text databases are now available in machine-readable form, and a sub- stantial amount of information is automati- cally derivable about the occurrence properties of words and expressions in natural-language texts, and about the contexts in which the words are used. This information can help in determining whether a query and a text are se- mantically homogeneous, that is, whether they cover similar subject areas. When that is the case, the text can be retrieved in response to the query. Automatic Indexing In the Smart system, the vector-processing model of retrieval is used to transform both the available information requests as well as the stored documents into vectors of the form: = (w~17w~27.. .7 w~t) where D~ represents a document (or query) text and Wik is the weight of term Tk in doc- ument D~. A weight of zero is used for terms that are absent from a particular document, and positive weights characterize terms actu- ally assigned. The assumption is that t terms in all are available for the representation of the information. In choosing a term weighting system, low weights should be assigned to high-frequency terms that occur in many documents of a col- lection, and high weights to terms that are im- portant in particular documents but unimpor- tant in the remainder of the collection. The weight of terms that occur rarely in a collec- tion is relatively unimportant, because such terms contribute little to the needed similar- ity computation between different texts. A well-known term weighting system fol- lowing that prescription assigns weight Wik to term Tk in query Q~ in proport;on to the fre- quency of occurrence of the term in Q~, and in inverse proportion to `the number of doc- uments to which the term is assigned.[12, 10] Such a weighting system is known as a tf x idf (term frequency times inverse document fre- quency) weighting system. In practice the query lengths, and hence the number of non- zero term weights assigned to a query, varies widely. To allow a meaningful final retrieval similarity, it is convenient to use a length nor- malization factor as part of the term weighhng formula. A high-quality term weighting for- mula for Wik, the weight of term Tk in query Q~ is Wik = (log(j;k) + 1.0) * log(N/nk) ~Ztk~i[(log(i;k) + 1.0) * log(N/nk)]2 (1) where fik is the occurrence frequency of Tk in Q~, N is the collection size, and flk the num- ber of documents with term Tk assigned. The factor log(N/nk) is an inverse collection fre- quency ("idf") factor which decreases as terms are used widely in a collection, and the denom- inator in expression (1) is used for weight nor- malization. This particular form will be called "ltc" weighting within this paper. The weights assigned to terms in documen~s are much the same. In practice, for both effec- tiveness and efficiency reasons the idf factor in the documents is dropped. [1] The terms Tk included in a given vector can in principle represent any entities assigned to a document for content identification. In the Smart context, such terms are derived by a text transformation of the following kind:[10] 1. recognize individual text words 46 2. use a stop list to eliminate unwanted func- tion words 3. perform suffix removal to generate word stems 4. optionally use term grouping methods based on statistical word co-occurrence or word adjacency computations to form term phrases (alternatively syntac- tic analysis computations can be used) 5. assign term weights to all remaining word stems and/or phrase stems to form the term vector for all information items. Once term vectors are available for all informa- tion items, all subsequent processing is based on term vector manipulations. The fact that the indexing of both doc- uments and queries is completely automatic means that the results obtained are reasonably collection independent and should be valid across a wide range of collections. No human expertise in the subject matter is required for either the initial collection creation, or the ac- tual query formulation. Phrases The same phrase strategy (and phrases) used in TREG 1 ([1]) is used for TREC 2. Any pair of adjacent non-stopwords are regarded as potential phrases. The final list of phrases is composed of those pairs of words occur- ring in 25 or more documents of the initial TREC 1 document set (Dl, TREC 1 initial collection). Phrase weighting is again a hy- brid scheme where phrases are weighted with the same scheme as single terms, except that normalization of the entire vector is done by dividing by the length of the single term sub- vector only. In this way, the similarity con- tribution of the single terms is independent of the quantity or quality of the phrases. Text Similarity Computation When the text of document D~ is represented by a vectors of the form (d~1, ....... , d~~) and query Q~ by the vector (qil, ....... , qit), a similarity (S) computation between the two items can conveniently be obtained as the in- ner product between corresponding weighted term vector as follows: S(D~,Q5) = ~(d~k*q~k) k=1 Thus, the similarity between two texts (whether query or document) depends on the weights of coinciding terms in the two vectors. Information retrieval and text linking sys- tems based on the use of global text similar- ity measures such as that of expression (2) will be successful when the common terms in the two vectors are in fact used in seman- tically similar ways. In many cases it may happen that highly-weighted terms that con- tribute substantially to the text similarity are semantically distinct. For example, a sound may be an audible phenomenon, or a body of water. TREC 1 ([1]) demonstrated that local con- texts could be used to disambiguate word senses, for example rejecting documents about industrial salts'' when given a query about the "SALT peace treaty". Overall, however, the improvement in effectiveness due to local matching was minimal in TREC 1. One reason for this is the richness of the TREC queries. Global text matching is almost invariably suf- ficient for disambiguation. Another reason is the homogeneity of the queries. They deal pri- marily with two subjects: finance, and science and technology. Within a single subject area vocabulary is more standardized and ambigu- ity is therefore minimized. One other potential reason for the unexpect- edly slight improvement is that most of the in- formation from local matches is simply being thrown away. Local matches are used as a fil- ter to reject documents that do not satisfy a local criteria: the overall global similarity used for ranking is changed only by the addition of a constant indicating the local match criteria was satisfied. The positive information that a long document might have a single paragraph which very closely matched the query is ig- nored. For TREC 2, we look at combining global and local similarities into a single final simi- larity to be used for ranking purposes. The other focus of our TREC 2 work is tak- ing advantage of the vast quantity of relevance judgements available for the routing experi- ments. In TREC 1, the relevance information was fragmentary and even occasionally incor- rect. It was hard to use this information in a reasonable fashion. Happily, the results of the (2) TREC 1 experiments furnished a large num- ber of very good relevance judgements to be used for TREC 2. Conventional vector-space feedback methods of query expansion and re- weighting are tuned for the TREC environ- ment in the routing portion of TREC 2. System Description The Cornell TREC experiments use the SMART Information Retrieval System, Ver- sion 11, and are run on a dedicated Sun Sparc 2 with 64 Mbytes of memory and 5 Gbytes of local disk. SMART Version 11 is the latest in a long line of experimental information retrieval sys- tems, dating back over 30 years, developed un- der the guidance of G. Salton. Version 11 is a reasonably complete re-write of earlier ver- sions, and was designed and implemented pri- marily by C. Buckley. The new version is ap- proximately 44,000 lines of C code and docu- mentation. SMART Version 11 offers a basic frame- work for investigations into the vector space and related models of information retrieval. Documents are fully automatically indexed, with each document representation being a weighted vector of concepts, the weight indi- cating the importance of a concept to that par- ticular document (as described above). The document representatives are stored on disk as an inverted file. Natural language queries un- dergo the same indexing process. The query representative vector is then compared with the indexed document representatives to ar- rive at a similarity (equation (2)), and the doc- uments are then fully ranked by similarity. Ad-hoc Results Cornell submitted two runs in the ad-hoc cat~ gory. The first, crulV2, is a very simple vector comparison. The second, crnlL2, makes use of simplified least squares analysis and a train- ing set to combine global similarity and part- wise similarities in a meaningful ratio. Both systems performed at or above the median in almost all queries, as can be seen in in Table 1. The crnlV2-b run is the same as the official crnlV2 run, but with an error in the experi- mental procedure corrected (discussed below). 47 Run Best > median median < median crnlRl 7 40 3 crnlCl 5 45 0 Evaluation measures in Table for both the of- ficial and some non-official runs show the im- portance of query expansion. Run 1 is the base case original query only (ltc weights). x Y A B C P number of singi'e terms to add (possible values 0 to 500) number of phrases to add (0 to 100) relative importance of original query (fixed at 8) relative importance of average weight in relevant documents (4 to 48) relative importance of average weight in non-relevant documents (0 to 16) relative importance of phrases in final retrieval as compared to single terms (0, 0.5, or 1.0) Table 4: Parameters of routing Just re-weighting the query terms according to Rocchio's algorithm gives a 7% improve- ment. Adding a few terms (20 single terms + 10 phrases) gives 17% improvement over the base case, and expanded by 350 (300+50) terms results in a 38% improvement. The official run crnlCi is actually a bit dis- appointing. It only results in a 3% improve- ment over the crnlRl run, which is not very significant considering the effort required. Few people are going to keep track of 158 test runs on a per query basis. It may be practical to keep track of 4 or so main query variants, but then the improvement would probably be less than 3%. We are conducting experiments in this area currently. An open question is the effectiveness of varying the feedback approach itself between queries. Preliminary experiments using Fuhr's RPI ([3]) weighting schemes in addition to the Rocchio variants show larger improvements. In general, RPI (and the other probabilistic models) perform noticeably better than Roc- chio if there is very little query expansion, though quite a bit worse under massive ex- pansion. We expect that the combination of RPI for those queries with little expansion and Rocchio for other queries will work well. One benefit of the CrR1C1 run not entirely represented by the evaluation figures is that retrieval performance is more even. Poten- tial mismatches between feedback method and query are far less likely. crulCi does reason- ably on all the queries (above the median sys- tem for every query when compared against the other systems). Routing Implementation and Timing The original routing queries are automatically indexed from the query text, and weighted us- 52 mg the `~ltc" weighting scheme (equation (1)). Collection frequency information used for the idf factors is gathered from D12 documents only. Relevance information about potential query terms is gathered and stored on a per query basis. For each query, statistics (includ- ing relevant and non-relevant frequency and total "ltc" weights) are kept about the 1000 most frequently occurring terms in the D12 relevant documents. For TREC 2, this is done by a batch run taking about 90 CPU minutes. In practice, this would be done incrementally as each document was compared to the query and judged. The statistics amounted to about 40,000 bytes per query. Using these statistics, and the decided upon parameters for the feedback process (A, B, etc.), actual construction of the final query takes about 0.5 seconds per query. Retrieval times vary tremendously with length of query. We ran in batch mode, con- structing an inverted file for the entire D3 document set ("lnc" document weights) and then comparing a query against that inverted file. Not only is this not what would be done in practice, but it is much less efficient than would be done in practice given our massive expansion of queries: for each query in cruiRi, well over half the entire inverted file was read! CPU time per query ranged from about 5 sec- onds (no expansion) to 65 seconds (expansion by 500 terms). Conclusion No firm conclusions can be reached regarding the usefulness of combining local and global similarities in the TREC environment. In some limited circumstances minor improve- ments can be obtained, but in general we have not (yet!) been able to take advantage of the local information we know should be useful. Query Phrase Single Phrase Orig Weight Weight Import Expand Expand Query Relevant Non-relevant 51 0.u 500 24 4 52 0.0 200 * 8 16 0 53 1.0 500 50 8 48 4 54 0.0 200 8 36 4 55 0.0 200 8 24 8 56 1.0 0 0 8 8 4 57 0.0 500 8 24 4 58 1.0 500 0 8 48 4 59 0.5 100 10 8 8 4 60 1.0 0 0 8 8 4 61 0.0 0 8 8 4 62 0.0 500 8 24 4 63 1.0 0 0 8 8 4 64 0.0 500 8 16 4 65 1.0 200 0 8 16 4 66 1.0 200 0 8 16 4 67 0.0 500 8 16 4 68 1.0 300 0 8 48 4 69 1.0 500 0 8 24 4 70 1.0 500 50 8 48 4 71 0.0 500 8 8 4 72 1.0 300 0 8 48 4 73 0.0 500 8 16 4 74 1.0 100 100 8 8 4 75 0.0 100 8 24 6 76 1.0 300 0 8 24 4 77 0.0 500 8 8 4 78 0.0 0 8 24 4 79 0.0 500 8 24 4 80 1.0 500 0 8 24 4 81 1.0 500 0 8 32 4 82 1.0 500 0 8 32 4 83 0.0 500 8 24 4 84 0.0 100 8 36 4 85 1.0 100 10 8 24 4 86 0.0 30 8 24 4 87 1.0 500 0 8 48 4 88 1.0 500 50 8 48 4 89 1.0 500 0 8 48 4 90 0.0 500 8 16 4 91 1.0 0 0 8 24 4 92 0.0 100 8 36 4 93 1.0 100 50 8 8 4 94 0.0 200 8 36 4 95 0.0 200 8 16 8 96 1.0 300 50 8 48 4 97 0.0 200 8 36 4 98 0.0 500 8 24 4 99 0.0 50 8 24 4 100 1.0 500 50 8 48 4 Table 5: Optimum routing parameters, query-by-query 53 X.Y A.B.C R-prec Total Rel recall-prec 1. no fdbk 0.0 8.0.0 3382 6509 2869 2. no expand 0.0 8.8.4 3531 6849 3087 3. little expand 20.10 8.8.4 3756 7192 3345 4. CrR1Ri 300.50 8.16.4 4273 7764 3952 5. crulCi varies varies 4367 7808 4091 Table 6: Routing evaluation For TREC 2, this failure is ameliorated by the base level performance of the global run. If the correct weights are used, the effectiveness of automatic indexing is extremely good. Automatic massive query expansion proves to be very effective for routing. Conven- tional relevance feedback techniques are used to weight the expanded queries. Parameters for the relevance feedback algorithms are esti- mated both over all the queries and for each query individually. The individual query esti- mation perform better (3-4%) but by an in- sufficient amount to be convincing. References [1) Chris Buckley, Gerard Salton, and James Allan. Automatic retrieval with locality information using SMART. In D. K. Har- man, editor, Proceedings of ~he Firsi Tex~ REtrieval Conference (TREC-1), pages 59-72. NIST Special Publication 500-207, March 1993. [2] James P. Callan and W. Bruce Croft. An evaluation of query processing strate- gies using the tipster collection. In Pro- ceedings of the Sixteenth Annual Inter- national ACM SIGIR Conference on Re- search and Development in Information Retrieval, pages 347-355, June 1993. [3] Norbert Fuhr. Models for retrieval with probabilistic indexing. Information Pro- cessing and Management, 25(1):55-72, 1989. [4] Norbert Fuhr and Chris Buckley. Au- tomatic structuring of text files. ACM Transactions on Information Systems, 9(3):223-248, 1991. [5] Norbert Fuhr and Chris Buckley. Op- timizing document indexing and search 54 term weighting based on probabilistic models. In D. K. Harman, editor, Pro- ceedings of the First Text REtrieval Con- ference (TREC-1), pages 89-99. NIST Special Publication 500-207, March 1993. [6] D. Harman. Towards interactive query expansion. In Proceedings of the Eleventh Jnternational Conference on Research and Development ~n Information Re- trieval, pages 321-331. Association for Computing Machinery, 1988. [7] Marti A. Hearst and Christian Plaunt. Subtopic structuring for full-length doc- ument access. In Proceedings of the Six- teenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 59-68, June 1993. [8] E. Ide. New experiments in rele- vance feedback. In Gerard Salton, ed- itor, The SMART Retrieval System- Experiments in Automatic Document Processing, chapter 16. Prentice Hall, En- glewood Cliffs, NJ, 1971. [9] J .J. Rocchio. Relevance feedback in in- formation retrieval. In Gerard Salton, editor, The SMART Retrieval System- Experiments in Automatic Document Processing. Prentice Hall, Englewood Cliffs, NJ, 1971. [10] Gerard Salton. Automatic Text Process- ing - the Transformation, Analysis and Retrieval of Information by Computer. Addison-Wesley Publishing Co., Reading, MA, 1989. [11] Gerard Salton. Developments in auto- matic text retrieval. Science, 253:974- 980, August 1991. [12] Gerard Salton and Chris Buckley. Term- weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513-523, 1988. [13] Gerard Salton and Chris Buckley. Im- proving retrieval performance by rele- vance feedback. Journal of the Amer- ican Society for Information Science, 41(4):288-297, 1990. [14] Gerard Salton and Chris Buckley. Au- tomatic text structuring and retrieval: Experiments in automatic encyclopedia searching. In Proceedings of the Four- teenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-30, 1991. [15] Gerard Salton and Chris Buckley. Global text matching for information retrieval. Science, 253:1012-1015, August 1991. [16] Gerard Salton, Chris Buckley, and James Allan. Automatic structuring of text files. Electronic Publishing, 5(1): 1-17, March 1992. [17] Craig Stanfill and David L. Waltz. Statis- tical methods, artificial intelligence, and information retrieval. In Paul S. Ja- cobs, editor, Text-based Intelligent Sys- tems: Current Research and Practice in Information Extraction and RetrievaL Lawrence Erlbaum Associates, Inc., Hills- dale, NJ, 1971. 55