Optimizing Document Indexing and Search Term Weighting Based on Probabilistic Models Norbert Fllhr* Chris Buckleyf Abstract We describe the application of probabilistic indexing and retrieval methods to the TREC material. For document indexing, we apply a description-oriented approach which uses relevance feedback information from previous queries run on the same collection. This method is also very flexible w.r.t. the underlying document representation. In our experiments, we consider single words and phrases and use polynomial functions for mapping the statistical parameters of these terms onto probabilistic indexing weights. Based on these weights, a linear (utility-theoretic) retrieval function is applied when no relevance feedback data is available for the specific query. Otherwise, the retrieval~with~probabilistic~indexing model can be used. The experimental results show excellent performance in both cases, but also indicate possible improvements. 1 Learning in IR terms documents - - 1*....... - -`I........ learning application terms - 1 documents k - learning application queries queries routing queries ad-hoc queries search term weighting description-oriented from relevance feedback indexing Abbildung 1: Learning approaches in IR Figure 1 shows two major learning approaches that are used in IR, both of which are applicable to the tasks to be performed within TREc. For the routing queries, we have relevance feedback data for some documents w.r.t. a specific query, and then the system has to rank further documents for the same query. As indicated by the third dimension, our knowledge is restricted to the terms we have *University of Dortmund, Informatik VI, P.O. Box 500500, W-4600 Dortmund 50, Germany, fuhr~ls6.informatik.uni dortmund.de tDepartment of Computer Science, Upson Hail, Cornell University, Ithaca, NY 14853, USA, chnsb~cs.coniell.edu 89 seen in the learning sample. If the new documents would contain totally different terms, the learning sample would be of no use for coping with these documents. For dealing with the routing queries, mostly search term weighting methods are applied, which are well establisbed in IR. For the ad-hoc queries, the task is more difficult: Given relevance information for some query-document pairs, this information has to be exploited in order to rank new documents w.r.t. new queries; further- more, most of these new query-document pairs will involve new terms. As a method for dealing with this type of task, description-oriented indexing has been developed ([Fuhr & Buckley 91]). The major concept of this approach is abstraction. For the routing queries, we abstract from specific documents by regarding the presence or absence of terms. In description-oriented indexing, we have to abstract in addition from specific queries and terms. This can be done by regarding features of these objects instead of the objects itself. Similar to pattern recognition methods, documents, queries and terms are described by sets of features here. This way, new documents, queries and terms can be mapped onto sets of features which we have already seen in the learning sample. In principle, description-oriented learning could be combined with most IR models. Only the proba- bilistic model, however, relates directly to retrieval quality. The probability ranking principle described in [Robertson 77] states that optimum retrieval performance is achieved when documents are ranked according to descending values of their probability of relevance. So our probabilistic approach uses the learning data in order to optimize retrieval quality for the test sample. This statement also holds for our work with the routing queries, where we apply the retrieval-with-probabilistic-indexing (RPI) model for estimating the query term weights. In the following section, we describe the description-oriented document indexing method which yields probabilistic weights for terms w.r.t. documents. In order to use these weights in retrieval, two methods are applied here. When no relevance information for the specific query is available, a utility-theoretic retrieval function can be used, where the utility weights for the terms from the query are derived by some heuristics. With relevance feedback data, however, the RPI model can be applied. Experiments with the ad-hoc queries are described in section 3.1, followed by the presentation of our work with the routing queries in section 3.2. 2 Probabilistic document indexing 2.1 General approach We first give a brief outline of the description-oriented indexing approach, which is presented in full de- tail in [Fuhr & Buckley 91]. Based on the binary independence indexing model ([Maron & Kuhns 60], [Fuhr 89]), one can define probabilistic document indexing weights as follows: let dm denote a doc- ument, t~ a term and R the fact that a query-document pair is judged relevant, then P(RIi~, dm) denotes the probability that document drn will be judged relevant w.r.t. an arbitrary query that con- tains term i~ These weights can hardly be estimated directly, since there will not be enough relevance information available for a specific document. Instead, the description-oriented indexing approach is applied, where the indexing task is divided into two steps, namely a description step and a decision step. In the description step, term-document pairs (i~, dm) are mapped onto so-called relevance descriptions ~ dm). The elements X~ of the relevance description contain values of features of ~ dm and their relationship, like e.g. 90 if within-document frequency (wdf) of ~ logidf = log(inverse document frequency), lognumierms = log(number of different terms in dm), imaxif = 1/(maximum wdf of a term in dm). In the experiments described in the following, we only use these simple parameters. However, it should be emphasized that the concept of relevance description is a means for coping with rather complex forms of document representations, e.g. when advanced natural language processing techniques are applied. In the decision step, probabilistic indexing weights of the form P(RIx~t~, dm)) are estimated. Instead of the probability P(RIt~, drn) which relates to a specific document and a specific term, P(RIx~ii, dm)) is the probability that an arbitrary term-document pair having relevance description i will be involved in a relevant query-document relationship. This probability is estimated by a s~called indexing function u(x). Different regression methods or probabilistic classification algorithms can serve as indexing function. Here, we use polynomial regression. For this purpose, a polynomial structure = ....... , VL) has to be defined as v(x) = (1, x1, x2, . . . , XN, X21, X1X2, where N is the number of dimensions of x. The class of polynomials is given by the components X3m ... (i,j,k,... E [1, N]; l,m,n,... > 0) which are to be included in the polynomial. The indexing function now yields u(x~ - ~ v(i), where a = (.1...., aL)T is the coefficient vector which has to be estimated in a separate training phase preceding the application of the indexing function. So P(RIx~ is approximated by the polynomial u(x~ = al+a2 xl +a3.x2+...+aN+1 .XN+aN+2 x21 +aN+3 xl x2+.. In the training phase, a learning sample is derived from relevance feedback data in the following way: for every query-document pair (q~, drn), a learning sample element (x~i~, dm), y) is generated for each term t~ common to q~ and dm with y = 1, if dm is relevant w.r.t. q~, and y = 0 otherwise. Given this set of learning sample elements, a linear regression method is applied in order to minimize the squared error (y - u(x))2 (see ~Fuhr & Buckley 91] for the details). The result of the regression method is the coefficient vector a which defines the indexing function. In the application phase, for each term occurring in a document, a relevance description is constructed first and then the indexing function gives the indexing weight of the term in the document. 2.2 Experiments We developed two types of document indexing, one with single terms only and one with both single words and phrases. For the single words, we initially used a linear indexing function, which gave us the function: u(x(ii,dm)) = -0.0019844 + 0.0000296 if + 0.0001079 imaxif + 0.0003519 logidf + 0.0003068 lognumierms. Retrieval experiments with this type of document indexing, however, revealed that the few extremly long documents (Federal Register documents with up to 2.6 MB) were always retrieved in the top 91 ranks. This effect was due to the positive coefficient of lognumierms, which gave indexing weights larger than 1 for the terms in these documents, thus yielding high retrieval status value for these documents w.r.t. most queries. Obviously a linear indexing function is not well suited for coping with such extreme distributions of the elements of the relevance description. Besides the effect on the few very long documents, other documents are also affected by this distribution, for the following reason: Many terms in the long documents will be assigned indexing weights larger than 1. For the computation of the squared error that is to be minimized, the quadratic difference (y - u(x~)2 between the indexing weight u(x) and the actual relevance decision y (0 or 1) in the learning sample is considered. Even if y = 1, there is still the difference (u(x) - 1)2, for u(x~ > 1. In order to reduce this error and thus the overall error, a larger error for other indexing weights is taken into account. There are three possible strategies for coping with this problem: * Some experiments performed with other material have shown that overall indexing quality can be improved by excluding outliers from the learning sample ([Pfeifer 91]). As outliers, not only those pairs with weights lying on the "correct side" of the interval [0,1] (i.e. either y = 0 and u(x) < 0 or y = 1 and u(x~ > 1) should be regarded. Also the exclusion of those pairs with weights lying on the "wrong side" (i.e. either y = 0 and u(x~ > 1 or y = 1 and u(x~ < 0) yields better results. * Using logistic functions instead of linear functions overcomes the problem of indexing weights outside the interval [0,1]. However, we have some experimental evidence ([Pfeifer 90]) that even in this case the removal of outliers from the learning sample (where the outliers are defined w.r.t. a linear indexing function) improves the indexing quality. * For the experiments described here, we switched from linear to polynomial functions, which also gave fairly good results. The function finally used for indexing with single terms only is u(i) = 0.00042293 + 0.00150083 if logidf imaxif + -0.00150665 if imaxif + 0.00010465 logidf + -0.00122627 lognumierms imaxif. F or indexing with phrases, we first had to derive the set of phrases to be considered. We took all adjacent non-stopwords that occurred at least 25 times in the D1 (training) document set. Then an indexing function common for single words and phrases was developed by introducing additional binary factors is~single (is~phrase) having the value 1 if the term is a single word (a phrase) and 0 otherwise. The parameters if and logidf were defined for phrases in the same way as for single words, and in the computation of imaxif and lognumierms for a document both phrases and single words were considered. This procedure produced the indexing function u(x~ = 0.00034104 + 0.00141097 is~single if logidf imaxif + -0.00119826 is~single if imaxif + 0.00014122 is~single logidf + -0.00120413 lognurnierms zmaxif + 0.00820515 is~phrase if logidf . imaxif + -0.04188466 is~phrase if imaxif + 0.00114585 is~phrase logidf. 92 For each phrase occurring in a document, indexing weights for the phrase as well as for its two components (as single words) were computed. In the following, we will refer to the indexing with single words only as "word indexing" and to the indexing using both single words and phrases as "phrase indexing". 3 Retrieval For the probabilistic document indexing weights as described above, there are specific retrieval func- tions which yield a probabilistic or utility-theoretic ranking of documents w.r.t. a query. These func- tions differ in the aspect whether or not they consider relevance feedback information for the specific query. For this reason different retrieval functions were applied for the ad-hoc queries and the routing queries. 3.1 Ad-hoc queries For the ad-hoc queries, we used a linar utility-theoretic retrieval function described in [Wong & Yao 89). Let qT~ denote the set of terms occurring in the query, ~ the set of document terms and ti~,,,, the indexing weight u(x~i~, dm)). If c~k gives the utility of term t~ for the actual query q~, then the utility of document dm w.r.t. query q~ can be computed by the retrieval function Cik Uim. (1) t~qT~ndT~ The only problem here is the estimation of the utility weights Cik, for which we do not have a theoreti- cally justified method. As a heuristic approach, we used the SMART if.idf weights, where If denotes the number of occurrences of the term I~ in the query q~ here [Salton & Buckley 88]. We assume that there are other choices for this parameter which could significantly improve retrieval quality. run if.idf ~uhra1 ~uhrpi (words) (phrases) average precision for recall points: 11-pt Avg. 0.1750 0.1943 0.2054 3-pt Avg. 0.1159 0.1399 0.1664 query-wise comparison with median: 11-pt Avg: 32:14 32:17 Prec. @ 100 docs 25:21 35:13 Best/worst results: 11-pt Avg: 6(2)11(1) 4(1)12 Prec. @ 100 docs 2(2)12 8(3)1(2) single words vs. phrases: 11-pt Avg: 25:25 Prec. @ 100 docs 21:28 Tabelle 1: Results for adhoc queries The retrieval function (1) was applied for both kinds of document indexing, where words only were considered for run fuhral and both words and phases in the run fuhrpi. Figure 2 shows the recall- precision curves for both runs in comparison to a standard SMART if idf run. It can be seen that 93 both probabilistic indexing methods clearly outperform the if idf run, with the exception of the low-recall end for the phrase run (see below). Comparing word indexing with phrase indexing shows mixed results (see also table 1): For the recall-precision averages, phrases perform better again with the exception of the low recall end. More information is given by the recall and precision values at different numbers of documents retrieved (see appendix of this volume). Precision for phrases is worse or about equal to that of words, but recall is always better. This means that phrases perform better for narrow queries (with a small number of relevant documents). precision 0.8- if idf ~ 0.7- iuhrai (words) I ~uhrp1 (phrases) p 0.6- 0.5- 0.4- 0.3- 0.2- 0.1- 0.2 0.4 0.6 0~ 1 recall Abbildung 2: Recall-precision curves for probabilistic indexing in comparison to if. idf run 0 In general, one would expect that using phrases in addition to single words would never decrease precision. We blame the current definition of the retrieval function (1) for the observed behaviour. This function assumes that all the terms considered are independent, which is obviously not true when we consider a phrase in addition to its two components as terms. Since the size of the error made here depends on the indexing weights of the components, this fact may explain the precision decrease for the highest ranked documents (i.e. at low recall levels). A better strategy would be to ignore the components in case the phrase occurs in the document. The query-wise comparison with the median (see table 1) shows a large scattering of the results for the single queries. A preliminary analysis has indicated that the relative performance for a specific query depends significantly on the fact whether or not the query statement contains negated terms. For example, in topic 86 we have... it is a bank, as opposed to another financial institution such as a savings and loan or credit union ... and topic 87 reads ... Civil actions, such as creditor claims filed for recovery of losses, are not relevant ... . In both cases, our system yields one of the worst results (for word indexing). Since our query indexing procedure extracts only the terms from the query and is not able to recognize negation, the system explicitly searches for documents containing these negated terms. Theoretically, it is obvious that negated terms should be given a negative utility weight. However, the examples cited here show that it is a difficult task to recognize negation automatically. 3.2 Routing queries For the routing queries, the retrieval-with-probabilistic-indexing (RPI) model described in [Fuhr 89] and [Fuhr 92) was applied. This model combines query-specific relevance feedback information with 94 probabilistic document indexing in order to estimate query term weights. Let Pik denote the average indexing weight of term i~ in the documents judged relevant w.r.t. query q~ and r~k the average indexing weight of i~ in the nonrelevant documents. Now the query term weight is computed by the formula = Pik(1 - rik) rik(l-Pik) -1 and the RPI retrieval function yields ~qk,dm)= ~ log(cikuim + 1). (2) tiE q~TndT~ In our experiments, due to the lack of time, we used the standard SMART if idf document indexing here (Salton & Buckley 88] (single words only) instead of the probabilistic indexing described above. After an initial retrieval run with both if. idf weights for queries and documents, relevance feedback information was used for computing the feedback query term weights Cik. Only the terms occurring in the query were considered here, so no query expansion took place. Theoretically, the RPI formula can also be applied in case of query expansion. However, the additional terms should be treated differently when estimating their query term weights. This problem has not been investigated yet for the RPI model. # queries: 49 query-wise comparison with median: 11-pt Avg: 43:5 Prec. @ 100 docs 43:5 Best/worst results: 11-pt Avg: 12/1 Prec. C~ 100 docs 16/1 Tabelle 2: Results for routing queries Table 2 shows the results for the run ~uhra2 with routing queries. It can be seen that this approach works very well for almost every query. The single worst result is for topic # 50, where we did not retrieve any relevant document; this outcome is due to the fact that there was only one rele- vant document in the training sample, which is obviously not sufficient for probabilistic parameter estimation. A Operational details of runs A.1 Overall All runs were done completely automatically, treating the text portions of both documents and queries as fiat text without structure. This made everything much simpler, but was not ideal given the complexity in both form and meaning of the queries. Recall-precision definitely suffered. All actual indexing (as opposed to weighting) and retrieval was done with the Cornell SMART Version 11.0 system, using the standard SMART procedures (eg, stopwords, stemming, inverted file retrieval). All runs were made on a Sun Sparc 2 with 64 Mbytes of memory. All times reported are CPU time. 95 A.2 Ad-hoc runs Two automatic ad-hoc runs were done; one in which the documents and queries were indexed with single terms only, and one in which they were indexed with both single terms and adjacency phrases. The overall procedure for both runs was: 1. Index D1 (the learning document set) and Qi (the learning query set). 2. For each document d ~ D1 2.1 For each q C Qi 2.1.1 Determine the relevance value r of d to q 2.1.2 For each term tin common between qT (set of query terms) and ~ (set of document terms) 2.1.2.1 Find values of the elements of the relevance description involved in this run and add values plus relevance information to the least squares matrix being constructed 3. Solve the least squares matrix to find the coefficient vector a 4. Index D1 U D2 (both sets of documents together) with term-freq weights. 5. For each document d C D1 U D2 (both sets of documents together) 5.1 For each term t C (1T 5.1.1 Find values of the relevance description x~t, d) involved in run. 5.1.2 Give t the weight a~ v~i(t, d)) where a is the value determined in step 3. 5.2 Add d to the inverted file. 6. Weight Q2 (test query set) with tf. idf weights (ntc variant). 7. Run an inner product inverted file similarity match of Q2 against the inverted file formed in step 5, retrieving the top 200 documents. In an operational environment, the learning document set and test document set would be the same, so step 1 (or step 4) would be omitted. Once coefficients have been found, they should remain valid unless the character of the collection changes. So new documents can be added to a dynamic collection by just going through step 5 for each new document. The algorithm above is different from that implemented in the past for this learning approach. Earlier versions iterated over queries (instead of documents) in step 2 and used inverted files for speed (with the document vectors still being needed). However, the size of TREC required a reformulation since document vectors and inverted files could not be kept in memory. The coefficient determination is valid only if the relevant judgements used are representative of the entire set of reolevance judgements. In this first TREC, the initial judgements are fragmentary and not very representative; it's impossible to tell how much this affects things. A.3 Single term automatic ad-hoc run The single term ad-hoc run used 5 factors (described in 2.1): * constant * tf logidf. irnaxtf * if imaxif * logidf * lognurnierms imaxif 96 Determining the coefficients for the factors in the single term run took about 1.7 hours (steps 2 + 3 above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 6.4 hours from scratch. Note that this is only about 1.0 hours longer than construction of a normal (for SMART) if idf weighted inverted file. Query indexing and weighting used the normal SMART procedures and took about 1.5 seconds. The fields Topic, Nationality, Narrative, Concepts, Factors, Description were used to index the query, with no distinction made between fields. Retrieval plus ranking took 383 seconds. A.4 Phrase automatic ad-hoc run The phrases being used were tw~term SMART adjacency phrases. Phrases were adjacent non- stopwords, term components stemmed, that occurred at least 25 times in the D1 document set. The term components were put into alphabetical order, thus the text phrases "information retrieval" and "retrieving information" both mapped to the same phrase concept. The phrases were treated as a separate ctype within an indexed vector, and had their own dictionary and inverted file separate from those of the single terms. Determination of phrases took 5.8 hours, finding 4,700,000 phrases occurring in D1 at least once. Of those phrases 158,000 occurred at least 25 times. These phrases were then put into a dictionary and used as controlled vocabulary for phrases when doing the indexing of D1 (step 1) and D1 U D2 (step 4). The single term indexing remained exactly as it was in the single term run (terms occurring in phrases were not removed from the vector). The phrase term ad-hoc run used 8 factors (described in 2.2). * consi ant * is~single if logidf. irnaxif * is~single . if imaxif * is~sing1e . lo9idf * lognumierms . imaxif * is~phrase if. logidf imaxif * is~phrase if. imaxif * is~ph'~ase . logidf Determining the coefficients for the factors in the phrase run took about 2.4 hours (steps 2 + 3 above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 12.6 hours from scratch. Note that this is only about 1.7 hours longer than construction of a normal (for SMART) if idf weighted inverted file with phrases. Query indexing and weighting used the normal SMART procedures and took about 2.7 seconds. The fields Topic, Nationality, Narrative, Concepts, Factors and Description were used to index the query, with no distinction made between fields. Retrieval plus ranking took 374 CPU seconds. (This was less than the single term run, but for no apparent reason. Perhaps the machine was less loaded.) A.5 Automatic routing run There was one automatic routing run done. It was totally unconnected to the factor weighting approach described above. Basically, it was very easy to implement and run so we decided we might as well submit it. The actual weighting function had to be programmed, but even so less than 3 person-days in total was spent on routing. 97 The routing experiment format was treated almost exactly as a normal relevance feedback experimental run. The overall procedure was: 1. Index query set Qi and document set D1 with if idf weights 2. For each query q ~ Qi 2.1 For each term I £ qT (set of query terms) 2.1.1 Reweight term I using the RPI relevance weighting formula and the (fragmentary) relevance information supplied. 3 Index document set D2 with if. idf weights. Note that the collection frequency information used in the idf weight was derived from occurrences in D1 only (in actual routing the collection frequencies within D2 would not be known) 4. Run the reweighted queries of Qi (step 2) against the inverted file (step 3), returning the top 200 documents for each query. This approach differs from true routing in that A. All documents of D2 were indexed at once instead of individually indexing them and comparing at each query sequentially. Thus the indexing/retrieval times obtained for the above algorithm are pretty meaningless. B. True routing is really a binary decision most often implemented as a similarity threshold, and retrieving the top 200 documents (in ranked order) wouldn't normally be done. However, this difference from true routing is required for evaluation purposes, and was thus required for TREC. Note that the approach was completely automatic with the query and documents treated as fiat text (no structure). Differing from the ad-hoc runs above, the queries were indexed including the words from all topic sections. It's unknown what effect the fragmentary relevance information had on the query reformulation. The strength of the effect will depend on whether the top similarity documents according to the if idf weights used had been judged and included in the fragmentary judgements. Step 1 took 3.0 hours, step 2 about 1304 seconds, step 3 about 1.9 hours, and step 4 about 312 seconds. References Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM Transactio,,s 01, Information Systems 9(3), pages 223-248. Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and Management ~5(1), pages 55-72. Fulir, N. (1992). Integration of Probabilistic Fact and Text Retrieval. In: Belkin, N.; Ingwersen, P.; Pejtersen, M. (eds.): Proceedings of the Fifieenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 211-222. ACM, New York. Maron, M.; Kuhus, J. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval. Journal of the ACM 7, pages 216-244 Pfeifer, U. (1990). Development of Log-Linear and Linear-Iterat:ve !ndexing Functions (in German). Diploma thesis, TH Darmstadt, FB Informatik, Datenverwaltungssysteme II. Pfeifer, U. (1991). Entwicklung linear iterativer und logistischer Indexierungsfunktionen. In: Fuhr, N. (ed.): Information Retrieval, pages 23-37. Springer, Berlin et al. Robertson, S. (1977). The Probability Ranking Principle in IR. Journal of Documentation 33, pages 294-304. 98 Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Infor- ma~ion Processing and Managemen~ 24(5), pages 513-523. Wong, S.; Yao, Y. (1989). A Probability Distribution Model for Information Retrieval. Informaiton Processing and Managemen~ 25(1), pages 39-53. 99