Full Text Retrieval based on Probabilistic Equations with Coefficients fitted by Logistic Regression Wm. S. Cooper Aitao Chen Fredric C. Gey S.L.I.S., University of California Berkeley, CA 94720 ABSTRACT The experiments described here are part of a research program whose objective is to develop a full-text retrieval methodology that is statistically sound and powerful, yet reasonably simple. The methodology is based on the use of a probabilistic model whose parameters are fitted empirically to a learning set of relevance judgements by logistic regression. The method was applied to the TIPSTER data with opti- mally relativized frequencies of occurrence of match stems as the regression variables. In a routing retrieval experiment, these were supplemented by other variables corresponding to sums of logodds asso- ciated with particular match stems. Introduction The full-text retrieval design problem is largely a problem in the combination of statistical evidence. With this as its premise, the Berkeley group has concentrated on the challenge of finding a statistical methodology for com- bining retrieval clues in as powerful a way as possible, consistent with reasonable analytic and computational simplicity. Thus our research focus has been on the gen- eral logic of how to combine clues, with no attempt made at this stage to exploit as many clues as possible. We feel that if a straightforward statistical methodology can be found that extracts a maximum of retrieval power from a few good clues, and the methodology is clearly hospitable to the introduction of further clues in future, progress will have been made. We join Fuhr and Buckley (1991, 1992) in thinking that an especially promising path to such a methodology is to combine a probabilistic retrieval model with the tech- niques of statistical regression. Under this approach a 57 probabilistic model is used to deduce the general form that the document-ranking equation should take, after which regression analysis is applied to obtain empincally-based values for the constants that appear in the equation. In this way the probabilistic theory is made to constrain the universe of logically possible retrieval rules that could be chosen, and the regression techniques complete the choice by optimizing the model's fit to the learning data. The probabilistic model adopted by the Berkeley group is derived from a statistical assumption of `linked dependence'. This assumption is weaker than the historic independence assumptions usually discussed. In its sim- plest form the Berkeley model also differs from most tra- ditional models in that it is of `Type 0'-- meaning that the analysis is carried out w.r.t. sets of query-document pairs rather than w.r.t. particular queries or particular docu- ments. (For a fuller explanation of this typology see Robertson, Maron & Cooper 1982.) But when relevance judgement data specialized to the currently submitted search query is available, say in the form of relevance feedback or routing history data, the model is flexible enough to accommodate it (resulting in `Type 2' retrieval.) Logistic regression (see e.g. Hosmer & Lemeshow (1989)) is the most appropriate type of regression for this kind of IR prediction. Although standard multiple regres- sion analysis has been used successfully by others in com- parable circumstances ~uhr & Buckley op. ciL), we believe logistic regression to be logically more appropri- ate for reasons set forth elsewhere (Cooper, Dabney & Gey 1992). Logistic regression, which accepts binary training data and yields probability estimates in the form of logodds values, goes hand-in-glove with a probabilistic IR model that is to be fitted to binary relevance judgement data and whose predictor variables are themselves logodds. The Fundamental Equation Equation (1): As the starting point of our probabilistic retrieval model we adopt (with reservations to be explained presently) a statistical simplifying assumption called the `Assumption of Linked Dependence' (Cooper 1991). Intuitively, it states that in the universe of all query- document pairs, the degree of statistical dependency that exists between properties of pairs in the subset of all rele- vance-related pairs is linked in a certain way with the degree of dependency that exists between the same pair- properties in the subset of all nonrelevance-related pairs. Under at least one crude measure of degree of dependence (specifically, the ratio of the joint probability to the prod- uct of individual probabilities), the linkage in question is one of identity. That is, the claim made by the assumption is that the degree of the dependency is the same in both sets. For N pair-properties ..... . , AN, the mathematical statement of the assumption is as follows. ASSUM~flON OF LINKED DEPENDENCE: P(A1....ANIR) _ P(A1IR) P(ANIR) -xx P(A1,..., ANIR) - P(A1IR) P(ANIA) If one thinks of a query and document drawn at random, R is the event that the document is relevant to the query, while each clue A~ is some respect in which the document is similar or dissimilar to, or somehow related to, the query. For most clue-types likely to be of interest this sim- plifying assumption is at best only approximately true. Though not as implausible as the strong conditional inde- pendence assumptions traditionally discussed in proba- bilistic IR, still it should be regarded with skepticism. In practice, when the number N of clues to be combined becomes large the assumption can become highly unreal- istic, with a distorting tendency that often causes the prob- ability of relevance estimates produced by the resulting retrieval model to be grossly overstated for documents near the top of the output ranking. But it will be expedi- ent here to adopt the assumption anyway, on the under- standing that later on we shall have to find a way to curb its probability-inflating tendencies. From the Assumption of Linked Dependence it is possible to derive the basic equation underlying the proba- bilistic model: 58 log 0(RIA1....AN) = N log 0(R) + ~ [log 0(R I A~) - log 0(R)] where for any events E and E' the odds 0(E I E') is by P(E I E') definition P(E I E') Using this equation, the evidence of N separate clues can be combined as shown in the right side to yield the logodds of relevance based on all clues, shown on the left. Query-document pairs can be ranked by this logodds estimate, and a ranking of documents for a particular query by logodds is of course a probability ranking in the IR sense. For further discussion of the linked dependence assumption and a formal derivation of Eq. (1) from it, see Cooper (1991) and Cooper, Dabney & Gey (1992). An empirical investigation of independence and dependence assumptions in IR is reported by Gey (1993). Application to Term Properties We shall consider as a ftrst application of the model- ing equation the problem of how to exploit the properties of match terms. By a `match' term we mean a stem (or word, phrase, etc.) that occurs in the query and also, in identical or related form, in the document to which the query is being compared. The retrieval properties of a match term could include its frequency characteristics in the query, in the document, or in the collection as a whole, its grammatical characteristics, the type or degree of `match' involved, etc. If there are M match terms that relate a certain query to a certain document, Eq. (1) becomes applicable with N set to M. Each of the proper- ties ~..... ,AM then represents a composite clue or set of properties concerning one of the query-document pair's match terms. Suppose a `learning set' of human judgements of relevance or nonrelevance is available for a sample of rep- resentative query-document pairs. (However, for the time being we assume no such learning data is available for the very queries on which the retrieval is to be performed.) Logistic regression can be applied to the data to develop an equation capable of using the match term clues to esti- mate, for any query-document pair, values for each of the expressions in the right side of Eq. (1). Eq. (1) then yields a probability estimate for the pair. TABLE I: Equations to Approximate the Components of Eq. (1) log 0(R) [b0+b1M] log 0(R 1A1) - log 0(R) [a0 + a1X1,1 + . + aKX1,K + aK+lM J - [b0+b1M] log 0(R lAM) - log 0(R) [a0 + alXM,l + . +aKXM,K + aK+lM] - [b0+b1MJ M log0(RIA1,...,A~) - [a0M + a1 ~ Xm,i m=1 What must be done, then, is to find approximating expressions for the various components in the right side of Eq. (1). Table I (above) shows the interrelationships among the quantities involved. In this array, the material to the left of the column of approximation signs just restates Eq. (1), for Eq. (1) asserts that the expressions above the horizontal line sum to the expression below the line. On the right of each approximation sign is a simple formula that might be used to approximate the expression on the left. (More complex formulae could of course also be considered.) Each right-side formula involves a linear combination of K variables Xm.i,... , Xm.K corresponding to the K retrieval properties (frequency values, etc.) used to characterize the term in question. The idea is to apply logistic regression to find the values for the coefficients (the a's and b's) in the right side that produce the best regression fit to the learning sample. Once these constants have been determined, retrieval can be carried out by sum- ming the right sides to obtain the desired estimate of log 0(R I A1.....AN) for any given query-document palr. It may not be immediately apparent why terms in M have been included in the approximating formulas on the right. In Eq. (1), the number N of avallable retrieval clues is actually part of the conditionalizing information, a fact that could have been stressed by writing 0(R I N) in place of 0(R) and 0(R I A~, N) in place of 0(R I A~). So the approximating formula for log 0(R) is really an approxi- mation for log 0(RI M) and must involve a term in M. (Simple functions of M such as or log (M) might also be considered.) Similar remarks apply to the formulas for M + + aK X Xm,K + aK+1M2] - (M - 1) [ b0 + b1M] m=1 approximating log (R I Am). There are two approaches to the problem of finding coefficients to use in the approximating expressions. The first is what might be called the `triples-then-palrs' approach. It starts out with a logistic regression analysis performed directly on a sample of query-document-term triples constructed from the learning set. In the sample, each triple is accompanied by the clue-values associated with the match term in the query and document, and by the human relevance judgement for the query and docu- menL The clue-values are used as the values of the inde- pendent variables in the regression, and the relevance judgements as the values of the the dependent variable. After the desired coefficients have been obtained, the resulting regression equation can be applied to evaluate the left-side expressions, which can in turn be summed down the M terms to obtain a preliminary estimate of log0(RIA1,...,A~). A second regression analysis is then performed on a sample of query-document pairs, using the preliminary logodds prediction from the first (triples) analysis as an independent variable. This second analysis serves the pur- pose of correcting for distortions introduced by the Assumption of Linked Dependence. It also allows retrieval properties not associated with particular match terms, such as document length, age, etc., to be intro- duced. The triples-then-pairs approach is the one used by the Berkeley group in its Trec-l research (Cooper, Gey & Chen 1992). The theory behind it is presented in more detail in (Cooper, Dabney & Gey 1992). 59 The second way of determining the coefficients -- the one used in Trec-2 -- is the `pairS-only' approach. It requires only one regression analysis, performed on a sample of query-document pairs. It is based on the trivial observation that in the right side of the above array instead of adding across rows and then down the resulting column of sums, one can equivalently add down columns and across the resulting row of sums. Under either proce- dure the grand total value for log O(R I A1....AM) will be the same. Summing down the columns and then across the totals gives the expression shown in the bottom line of the array. It simplifies to logO(RIA1........AN) M M a1 ~ Xm,l+...+aK X Xm,K m=l m=l + (a0 - b0 + b1)M + (aK+l - b1)M2 Since there is no need to keep the a~ coefficients segre- gated from the b ~ coefficients to get a predictive equation, this suggests the adoption of a regression equation of form logO(RIA1,..., AM) M M C0 + C1 ~ Xm,i + ... + CK X Xm,K m=l +CK+lM + CK+2M2 The coefficients C0....CK+2 may be found by a logistic regression on a sample of query-document pairs con- structed from the learning sample. In the sample each pair is accompanied by its K different Xm,k-values each already summed over all match terms for the pair, the val- ues of M and M2, and (to serve as the dependent variable in the regression) the human relevance judgement for the pair. But if only one level of regression analysis is to be performed, where is the correction for the Assumption of Linked Dependence to take place? That assumption causes mischief because it creates a tendency for the pre- dicted logodds of relevance to increase roughly linearly with the number of match terms, whereas the true increase is less than linear. This can be corrected by modifying the variables in such a way that their values rise less rapidly than the number of match terms as the number of match terms increases. The variables can, for instance, be multi- plied by some function f(M) that drops gently with increasing M, say 1 1 . The exact form of - or ~ 1+logM 60 the function can be decided during the course of the regression analysis. With such a damping factor included, the foregoing regression equation becomes Equation (2): logO(RIA1....AM) M M CO+Clf(M)XXm,l+~~~+CKf(M) XXm,K m=l + CK+lM + CK+2M2 In our experiments, this simple modification was found to improve the regression fit and the precisiontrecall perfor- mance. It would appear therefore to be a worth-while refinement of the basic model. Note, however, that this adjustment only removes a general bias. It does nothing to exploit the possibility of measuring dependencies between particular stems to improve retrieval effective- ness. To exploit individual dependencies would be desir- able in principle, but would require a substantial elabora- tion of the model for what might well turn out to be an insignificant improvement in effectiveness (for discussion see Cooper (1991)). Optimally Relativized Frequencies The philosophy of the project called for the use of a few well-chosen retrieval clues. The most obvious clues to be exploited in connection with match terms are their frequencies of occurrence in the query and document. What is not so obvious is the exact form the frequencies should take. For instance, should they be absolute or rela- tive frequencies, or something in between? The IR literature mentions two ways in which fre- quencies might be relativized for use in retrieval. The first is to divide the absolute frequency of occurrence of the term in the query or document by the length of the query or document, or some parameter closely associated with length. The second is to divide the relative frequency so obtained by the relative frequency of occurrence of the term in the entire collection considered as one long run- ning text. Both kinds of relativization seem potentially beneficial, but the question remains whether these rela- tivizations, if they are indeed helpful, should be carried out in full strength, or whether some sort of blend of abso- lute and relative frequencies might serve beuer. To answer this question, regression techniques were used in a side investigation to discover the optimal extent to which each of the two kinds of relativization might best be introduced. To investigate the first kind of relativiza- tion -- relativization with respect to query or document length -- a semi-relativized stem frequency variable of the form absolute frequency document length + C was adopted. If the constant C is chosen to be zero, one has full relativization, whereas a value for C much greater than the document length will cause the variable to behave in the regression analysis as though it were an absolute frequency. Several logistic regressions were run to dis- cover by trial-and-error the value of C that produced the best regression fit to the TIPSTER learning data and the highest precision and recall. It was found that a value of around C = 35 for queries, and C = 80 for documents, optimized performance. For the relativization with respect to the global rela- tive frequency in the collection, the relativizing effect of the denominator was weakened by a slighfly different method -- by raising it to a power less than 1.0. For the document frequencies, for instance, a variable of form absolute frequency in doc document length + 80 [relative frequency in collection ] D with D < 1 was in effect used. Actually, it was the loga- rithm of this expression that was ultimately adopted, which allowed the variable to be broken up into a differ- ence of two logarithmic expressions. The optimal value of the power D was therefore obtainable in a single regression as the coefficient of the logged denominator. Thus the variables ultimately adopted consisted in essence of sums over two optimally relativized frequen- cies (`ORF's) -- one for match stem frequency in the query and one for match stem frequency in the document. Because of the logging this breaks up mathematically into three variables. A final logistic regression using sums over these variables as indicated in Eq. (2) produced the ranking equation Equation (3): log O(RI A1,...,AM) -3.51 + 1 ~ + 0.0929M 61 where ~ is the expression M M M 37.4 X Xm,i +0.330 X Xm,2 -0.1937 X Xm,3 m=1 m=1 m=1 Here Xm,i = number of times the m'th stem occurs in the query, divided by (total number of all stem occur- rences in query + 35); Xm,2 = number of times the m'th stem occurs in the document, divided by (total number of all stem occurrences in document + 80), quotient logged; Xm,3 = number of times the m'th stem occurs in the collection, divided by the total number of all stem occurrences in the collection, quotient logged; M = number of distinct stems common to both query and document. Although Eq. (2) calls for an M2 term as well, such a term was found not to make a statistically significant contribu- tion and so was eliminated. Eq. (3) provided the ranking rule used in the ad hoc run (labeled `Brkly3') and the first routing run (`Brkly4') submitted for Trec-2. The equation is notable for the spar- sity of the information it uses. Essentially it exploits only two ORF values for each match stem, one for the stem's frequency in the query and the other for its frequency in the document. Other variables were tried including the inverse document frequency (both logged and unlogged), a variable consisting of a count of all two-stem phrase matches, and several variables for measuring the tendency of the match stems to bunch together in the document. All of these exhibited predictive power when used in isola- tion, but were discarded because in the presence of the ORF's none produced any detectable improvement in the regression fit or the precisionfrecall performance. Some attempts at query expansion using the WordNet thesaurus also failed to produce noticable improvement, even when care was taken to create separate variables with sepan~te coefficients for the synonym-match counts as opposed to the exact-match counts. The quality of a retrieval output ranking matters most near the top of the ranking where it is likely to affect the most users, and maUers hardly at all far down the ranking where hardly any users are apt to search. Because of this it is desirable to adopt a sampling methodology that produces an especially good regression fit to the sample data for documents with relatively high predicted proba- bilities of relevance. In an attempt to achieve this empha- sis, the regression analysis was carried out in two steps. A logistic regression was first performed on a stratified sam- pie of about one-third of the then-available TIPSThR rele- vance judgements to obtain a preliminary ranking rule. This preliminary equation was then applied as a screening rule to the entire available body of judged pairs, and for each query all but the highest-ranked 500 documents were discarded. The resulting set of 50,000 judged query- document pairs (100 topics x 500 top-ranked docs per topic) served as the learning sample data for the final regression equation displayed above as Eq. (3). Application to Query-Specific Learning Data To this point it has been assumed that relevance judgement data is available for a sample of queries typical of the future queries for which retrieval is to be per- formed, but not for those very queries themselves. We consider next the contrasting situation in which the learn- ing data that is available is a set of past relevance judge- ments for the very queries for which new documents are to be retrieved. Such data is often available for a routing or SDI request for which relevance feedback has been col- lected in the past, or for a retrospective feedback search that has already been under way long enough for some feedback to have been obtained. For a query so equipped with its very own learning sample, it is possible to gather separate data about each individual term in the query. Such data reflects the term's special retrieval characteristics in the context of that query. For instance, for a query term T~ we may count the number n ( T~, R) of documents in the sample that con- tain the term and are also relevant to the query, the num- ber of documents n(T~, RTh that contain the term but are not relevant to the query, and so forth. The odds that a future document will turn out to be be relevant to the query, given that it contains the term, can be estimated crudely as n(T~,R) O(RIT~) n(T~,R) To refine this estimate, a Bayesian trick adapted from Good (1965) is useful. One simply adds the two figures used in expressing the prior odds of relevance (i.e. n (R) and n (~) into the numerator and denominator respec- tively with an arbitrary weight P as follows: 62 O(RIT~) n(T~, R)+~n(R) The smaller the value assigned to p, the more influ- ence the sample evidence will have in moving the esti- mate away from the prior odds. The adjustment causes large samples to have a greater effect than small, as seems natural. It also forestalls certain absurdities implicit in the unadjusted formula, for instance the possibility of calcu- lating infinite odds estimates. Suppose now that a query contains Q distinct terms T1....TM, TM+1....TQ, numbered in such a way that the first M of them are match terms that occur in the docu- ment against which the query is to be compared. The fun- damental equation can be applied by taking N to be Q, and interpreting the retrieval clues ..... . , AN as the pres- ence or absence of particular query terms in the document. The pertinent data can be organized as shown in Table II (next page). Thus we are led to postulate, for query-specific learning data, a retrieval equation of form Equation (4): logQ(RIA1......AQ) co+cif(Q)[~i +~2-~3] where M n(T,R)+pn(R) ~i= ~log m=1 Q ~2= m=XM+l log n(T~, ~ + P n(A) n(R) ~= (Q-1) log and where f(Q) is some restraining function intended as before to subdue the inflationary effect of the Assumption of Linked Dependence. The coefficients C0, C1 are found by a logistic regression over a sample of query-document pairs involving many queries. Combining Query-Specific and Query-Nonspecific Data If a query-specific set of relevance judgements is available for the query to be processed, a larger TABLE II: Reinterpretation of the Components of Eq. (1) for Routing Retrieval n(R) log 0(R) log log 0(RIA1)- log 0(R) log nfl(TT1l,~RJ)++Pp n~(R~) n(R) log log 0(RIAM)- log 0(R) log n(TM, R) + p n(R) n(TM, ~ + p n(A) n(TM+1,R)+ P n(R) log 0(R I AM+1) - log 0(R) log n(TM+l,R)+ p n(R) fl(TQ, R) + p n(R) log O(RIAQ) - log 0(R) log n(TQ,~+fl n(R) n(R) - log- n(R) n(R) - log n(R) - log n(R) M n(T~,R)+fl n(R) Q n(TTh,R)+pn(R) n(R) log 0(RIA1 ,..., AQ) - ~log + ~ log nonspecific set may well be available at the same time. If so the theory developed in the foregoing section can be applied in conjunction with the earlier theory to capture the benefits of both kinds of learaing sets. The retrieval equation will then contain variables not only of the kind occurring in Eq. (4) but also of the Eq. (2) kind. It is convenient to formulate this equation in such a way that it contains as one of its terms the entire ranking expression developed earlier for the nonspecific learning dat~ For the ~IlPSTER data the combined equation takes the form: Equation (S): log 0(RIA1,...,AM, A'1.... 0.688 ~4+0.344[~1 +~2-~3]+0.0623 where ~ is the entire right side of Eq. (3) and ~1, ~ ~3 are as defined in Eq. (4). This form for the equation is computationally convenient if Eq. (3) is to be used as a preliminary screening rule to eliminate unpromising 63 documents, with Eq. (5) in its entirety applied only to rank those that survive the screening. Eq. (5) was used to produce the Trec-2 routing run `Brkly5'. Its coefficients were determined by a logistic regression constrained in such a way as to make the query-specific variables contribute about twice as heavily as the nonspecific, when contribution is measured by stan- dardized coefficient size. This emphasis was largely arbi- trary; finding the optimal balance between the query- specific and the general contributions remains a topic for future research. A value of 201n(R) was used for p. This choice too was arbitrary, and it would be interesting to try to optimize it experimentally for some typical collection (trying out, perhaps, numbers larger than 20, divided by the total number of documents in the query's learning set). No restraining function f(Q) was used in the final form of Eq. (5) because none that were tried out produced any dis- cernible improvement in fit or retrieval effectiveness in this context. Calibration of Probability Estimates The most important role of the relevance probability estimates Calculated by a probabilistic IR system is to rank the output documents in as effective a search order as pos- sible. For this ranking function it is only the relative sizes of the probability estimates that matter, not their absolute magnitudes. However, it is also desirable that the absolute sizes of these estimates be at least somewhat realistic. If they are, they can be displayed to provide guidance to the users in their decisions as to when to stop searching down the ranking. This capability is a potentially important side benefit of the probabilistic approach. One way of testing the realism of the probability estimates is to see whether they are `well-Calibrated'. Good calibration means that when a large number of prob- ability estimates whose magnitudes happen to fall in a cer- tain small range are examined, the proportion of the trials in question with positive outcomes also falls in or close to that range. To test the calibration of the probability pre- dictions produced by the Berkeley approach, the 50,000 query-document pairs in the ad hoc entry Brkly3 along with their accompanying relevance probability estimates were sorted in descending order of magnitude of estimate. Pairs for which human judgements of relevance- relatedness were unavailable were discarded; this left 22,352 sorted pairs for which both the system's probabil- ity estimates of relevance and the `correct' binary judge- ments of relevance were available. This shorter list was divided into blocks of 1,000 pairs each -- the thousand pairs with the highest probability estimates, the thousand with the next highest, and so forth. Within each block the `actual' probability was estimated as the proportion of the 1,000 pairs that had been judged to be relevance-related by humans. This was compared against the mean of all the system probability estimates in the block. For a well- calibrated system these figures should be approximately equal. The results of the comparison are displayed in Table III. It can be seen that the system's probability predic- tions, while not wildly inaccurate, are generally somewhat higher than the actual proportions of relevant pairs. The same phenomenon of mild overestimation was observed when the runs Brkly4 and BrklyS were tested for well- calibratedness in a similar way. Since no systematic overestimation was observed when the calibration of the formula was originally tested against the learning data, it seems likely that the 64 TABLE III: Calibration of Ad Hoc Relevance~Probability Estimates Query-doc Mean System Proportion Pair Probability Actually Ranks Estimate Relevant 1 to 1,000 0.66 0.60 1,001 to 2,000 0.63 0.47 2,001 to 3,000 0.61 0.44 3,001 to 4,000 0.58 0.41 4,001 to 5,000 0.55 0.38 5,001 to 6,000 0.53 0.34 6,001 to 7,000 0.50 0.36 7,001 to 8,000 0.48 0.36 8,001 to 9,000 0.46 0.36 9,001 to 10,000 0A4 0.38 10,001 to 11,000 0.42 0.39 11,001 to 12,000 OA1 0.36 12,001 to 13,000 0.39 0.37 13,001 to 14,000 0.37 0.36 14,001 to 15,000 0.36 0.35 15,001to16,000 0.34 0.31 16,001 to 17,000 0.32 0.29 17,Ooltol8,000 0.31 0.28 18,001 to 19,000 0.29 0.23 19,001 to 20,000 0.28 0.22 20,001 to 21,000 0.25 0.21 21,001 to 22,000 0.23 0.23 22,001 to 22,352 0.18 0.19 overestimation seen in the table is due mainly to the shift from learning data to test data. Naturally, predictive for- mulae that have been fine-tuned to a certain set of learning data will perform less well when applied to a new set of data to which they have not been fine-tuned. If this is indeed the root cause of the observed overestimation, it could perhaps be compensated for (at least to an extent sufficient for practical pwposes) by the crude expedient of lowering all predicted probabilities to, say, around four fifths of their originally calculated values before display- ing them to the users. Computational Experience The statistical program packages used in the course of the analysis included SAS, S, and BLSS. Of these, SAS provides the most complete built-in diagnostics for logistic regression. BLSS was found to be especially con- venient for interactive use in a UNIX environment, and ended up being the most heavily used. The prototype retrieval system itself was imple- mented as a modification of the SMART system with SMART's vector-similarity subroutines replaced by the probabilistic computations of Eqs. (3) and (5). For the runs Brkiy3 and Brkly4, which used only Eq. (3), only minimal modifications of SMART were needed, and at run time the retrieval efficiency remained essentially the same as for the unmodified SMART system. This demon- strates that probabilistic retrieval need be no more cum- bersome computationally than the vector processing alter- natives. For BrklyS, which used Eq. (5), the modifications were somewhat more extensive and retrieval took about 20% longer. Retrieval Effectiveness The Berkeley system achieved an average precision over all documents (an `11-point average') of 32.7% for the ad hoc retrieval run Brkly3, and 29.0% and 35.4% respectively for the routing runs Brkly4 and Brkly5. The distinct improvement in effectiveness of BrklyS over Brkly4 suggests that in routing retrieval the use of fre- quency information about individual query stems is worth while. At the `0 recall level' a precision of 84.7%, the highest recorded at the conference, was achieved in the ad hoc run. The high effectiveness of the Berkeley system for the first few retrieved documents may be explainable in terms of the practice, mentioned earlier, of redoing the regression analysis for the highest-ranked 500 documents for each query. This technique ensures an especially good regression fit for the query~document pairs that are espe- cially likely to be relevant, thus emphasizing good perfor- mance near the top of the ranking where it is most impor- tanL The generally high retrieval effectiveness of the Berkeley system should be interpreted in the light of the fact that the system probably uses less evidence -- that is, fewer retrieval clues -- than any of the other high- performing TREC-2 systems. In fact, the only clues used were the frequency characteristics of single stems (not even phrases were included). What this suggests is that the underlying probabilistic logic may have the capacity to exploit exceptionally fully whatever clues may be 65 available. Summary and Conclusions The Berkeley design approach is based on a proba- bilistic model derived from the linked dependence assumption. The variables of the probability-ranking retrieval equation and their coefficients are determined by logistic regression on a judgement sample. Though the model is hospitable to the utilization of other kinds of evi- dence, in this particular investigation the only variables used were optimally relativized frequencies (ORF's) of match stems. The approach was found to have the following advantages: Experimental Efficiency. Since the numeric coeffi- cients in a regression equation are determined simultaneously in one computation, trial-and-error experimentation involving the evaluation of retrieval output to optimize parameters is largely avoidable. 2. Computational Simplicity. For ad hoc retrieval and routing retrieval that does not involve individual stem statistics, the computational simplicity and efficiency achieved by the model at run time are comparable to that of simple vector processing retrieval models. For routing retrieval that exploits individual stem frequencies the programming is somewhat more complicated and runs slightly slower. 3 Effective Retrieval. The level of retrieval effective- ness as measured by precision and recall is high rel- ative to the simple clue-types used. 4. potential for Well-Calibrated Probability Estimates. In-the-ballpark estimates of document relevance probabilities suitable for output display would appear to be within reach. Acknowledgements We are indebted to Ray Larson and Chris Plaunt for helpful systems advice, as well as to the several col- leagues already acknowledged in our Trec-1 report. The work stations used for the experiment were supplied by the Sequoia 2000 project at the University of California, a project principally funded by the Digital Equipment Cor- poration. A DARPA grant supported the programming effort. References Cooper, W. 5. (1991). Inconsistencies and Mis- nomers in Probabilistic IR. Proceedings Fourteenth Annual inlernational ACM SIGIR Conference on Research and Development in Information Retrieval. Chicago, 57-62. Cooper, W. S.; Dabney, D.; Gey, F. (1992). Proba- bilistic retrieval based on staged logistic regression. Pro- ceedings of the Fifteenth Annual International ACM SIGIR Confrrence on Research and Development in Infor- mation Retrieval. Copenhagen, 198-210. Cooper, W. S.; Gey, F. C.; Chen, A. (1993). Proba- bilistic retrieval in the TIPSTBR Collections: An Applica- tion of Staged Logistic Regression. The First Text REtrieval Conference (TREC-1). MST Special Publica- tion 500-207. Washington, 73-88. Fuhr, N.; Buckley, C. (1993). Optimizing Ioocu- ment Indexing and Search Term Weighting Based on Probabilistic Models. The First Text REtrieval Confer- ence (TREC-1). MST Special Publication 500-207. Washington, 89-100. Fuhr, N.; Buckley, C. (1991). A probabilistic learn- mg approach for document indexing. ACM Transactions on Information Systems, 9,223-248. Gey, F. C. (1993). Probabilistic Dependence and Logistic Inference in Information Retrieval Ph.D. Disser- tation, Uni~ of California, Berkeley, California. Good, I. J. (1965). The estimation of probabilities; An essay on modern Bayesian methods. Cambridge, Mass.; M.I.T. Press. Hosmer, D. W.; Lemeshow, 5. (1989). Applied Logistic Regression. New York: Wiley. Robertson, S. E.; Maron, M. E.; Cooper, W. S. (1982). Probability of relevance: A unification of two competing models for document retrieval. Information Technology: Research and Development 1, 1-21. 66