Probabilistic Retrieval in the TIPSTER Collections: An Application of Staged Logistic Regression Wm. S. Cooper Fredric C. Gey Aitao Chen S.L.I.S., University of California Berkeley, CA 94720 ABSTRACT: In this experiment the TIPSTER test collections were used as a vehicle for evaluating an approach to probabilistic retrieval for full-text documents. The methodology in question, called `staged logistic regression,' involves two or more stages of logistic regression analysis of a learning sample of relevance judgements. The aim is to produce effec- tive inifial probability rankings of documents, without undue computa- tional complexity at run time, by applying regression equations derived with the help of standard statistical software packages. In addition, the experiment explored the feasibility of using equations derived from train- ing data for one document collection in a different document collection for which no training data happens to be available, and of calculating docu- ment relevance probabilities accurately enough so that they can be dis- played as part of the output seen by the user. The regression equations were implemented as retrieval rules in an experimental prototype system obtained by modifying the SMART retrieval system. Introduction The Berkeley group's interest in participating in the NIST/TkEC Conference was stimulated by the opportunity it offered to gain experience with a methodology called `Staged Logistic Regression.' This technique (hereinafter abbreviated `SLR') is a sys- tematic approach to retrieval system design based on probabilistic and statistical princi- ples. It has been under study at Berkeley as a possible means of achieving effective prob- abilistic retrieval including acceptably accurate estimates of relevance probability without undue computational complexity. In order to test out the SLR methodology on the TIPSThR data base in as straight- forward a fashion as possible, attenfion was restricted to the problem of how to form the initial document ranking -- that is, the output ranking first offered by the system to the user in response to the user's original query. This problem should not be confused with the subsequent task of exploiting any relevance judgements that may be obtainable from the user once he or she has started down the output ranking and is actively examining documents. The latter problem -- the matter of how to exploit intra-search `relevance 73 feedback' -- has been the subject of some fruitful probabilistic investigations, but the question of how to create the initial ranking seems equally significant. Objectives of the Experiment A principal goal of the experiment was to investigate the retrieval effectiveness of the SLR methodology in large collections. It was hoped especially that something could be learned of the soundness and retrieval power inherent in the statistical logic that under- lies the SLR method. In view of this emphasis on logical foundations, and also because of the limited time and resources at the researchers' disposal, only a few simple and com- monly employed frequency statistics were used as retrieval clues. No attempt was made to exploit more elaborate types of linguistic or locational evidence that would have required the incorporation of a parser, a conflator, a disambiguator, a thesaurus, a phrase identifier, etc. It is not that the latter kinds of evidence could not be exploited effectively under the SLR approach. Rather, in this particular experiment the idea was to see how far one could get on the basis of careful statistical logic alone. Because of this `mainly logic' approach, the results of the experiment must be interpreted with special care. It is not the absolute retrieval effectiveness of the system that is of most interest, but rather its retrieval effectiveness relative to the amount of evi- dence used. If an SLR-based system can achieve with less clues the same level of effec- tiveness that under other design approaches requires more clues, the objective of creating a powerful underlying retrieval logic will have been demonstrated. Because regression methods are hospitable to the use of almost any kind of predictive evidence that can be expressed in statistical form, there is little doubt that the performance of the system could have been improved through the use of additional clue-types. In considering the present experiment, however, the reader is asked to bear in mind the philosophy of building a generalizable logical platform capable of extracting a maximum of retrieval power from whatever clues do happen to be available. Another objective of the SLR methodology is to keep the computational aspects of the retrieval reasonably simple and efficient. Some probabilistic schemes call for elabo- rate programming and are not impressively efficient at run-time. As a reasonable desider- atum, a truly practical probabilistic method should be no more trouble to program, and not run substantially slower than, say, a vector-processing IR system usmg comparable types of evidence. Still another goal of SLR is to partially replace traditional IR research procedures with more convenient and powerful standard statistical methods. For many years the cus- tomary experimental research paradigm in IR has been to conduct retrieval trials and apply specially invented IR effectiveness measures such as precision and recall to com- pare the trial results. It is reasonable to hope that much of this trial-and-error experimen- tation could be replaced by more efficient statistical regression analyses that use standard statistical software packages and standard measures of goodness-of-fit, thus bringing IR research more into the realm of mainstream statistical analysis. The SLR design methodology requires the use of `training data' (`learning trials', etc.) in the form of human relevance judgements for a sample of query-document pairs representative of the collection in which the proposed IR system is to operate. It is some- times objected that methods requinng training data are useless in situations where a 74 system must be designed for a collection for which no training data is available, and as a practical matter none can be gathered. This objection has considerable force but it over- looks the possibility of extrapolating the results of a regression analysis in one collection for which training data exists into another for which it does not. As the experiment developed, it cast some light on the question of whether such an extrapolation can be effected without unacceptable loss of retrieval power. A final objective of the SLR methodology is to produce estimates of relevance probability that are reliable enough to present to the system users as part of the ranked output they receive. Some IR research would appear to be premised on the notion that the output ordering of the collection is all that matters -- that the only purpose of generat- ing retrieval status values (`similarity coefficients', `ranking scores', etc.) is to achieve as effective a ranking as possible. We agree that imposing an effective order of presentation on the documents is the most essential single role of the retrieval status values, but feel that in addition the numeric scores are themselves a potentially important part of the out- put. Their significance lies in their ability to provide the user at each point in the search with information about whether it is likely to be worth while to continue the search down the ranking. Clearly, such scores will be most helpful if presented in a form that most users find readily interpretable, and interpretable moreover in a sense that bears as directly as possible on the decision of whether they should stop searching. Probability- of-relevance estimates would appear to fit this prescription admirably. For reasons that will become apparent, this final objective was not attained in the present experiment. However, experiments in small collections indicate that SLR is capable of producing well-calibrated probability estimates, and doing so remains one of the general objectives of the methodology The SLR Methodology The theoretical foundations of the SLR approach are presented in a recent paper by Cooper, Dabney, & Gey (1992). A synthesis and extension of earlier approaches to prob- abilistic retrieval, the SLR method combines the commonplace theoretical stratagem of invoking statistical simplifying assumptions with the empirical technique of applying sta- tistical regression analysis to a learning sample. The use of statistical simplifying assumptions in IR has been explored by Maron & Kuhns (1960), Robertson & Sparck Jones (1976), Yu & Salton (1976), van Rijsbergen (1979) and others (surveyed by Maron (1984), Bookstein (1985)). Examples of the use of regression analysis are to be found for example in the work of Fox (1983), Fuhr (1989), and Fuhr & Buckley (1991). A distinguishing characteristic of SLR is that it breaks the analysis of the retrieval process down into two or more distinct steps or stages. For the present experiment a sim- ple two-stage procedure was adopted. In the first stage a learning sample was used to develop a regression equation that combines elementary retrieval clues into composite clues. In the second stage, the same empirical data is used to derive another regression equation that combines these composite clues into an estimate of the desired estimate of relevance probability for each query-document pair. Thus the evidence bearing on the retrieval decision is organized first into sets of simple properties of particular descriptors, and then into combinations of such sets as determined by the particular descriptors com- mon to the query and document under consideration. 75 This split level approach allows for a natural separation of the available retrieval clues into two kinds. Statistical inferences based on properties of particular terms may be drawn first, while other kinds of evidence not confined to particular terms (e.g. document length, citedness, etc.) are saved for the second stage of statistical inference. An impor- tant virtue of this split-level approach is that the second-stage regression tends to correct for biases introduced by the statistical simplifying assumptions used to consolidate the results of the first stage. A `Bare-Bones' SLR Methodology Because the sole focus of interest in the experiment was the logic of the SLR approach, it seemed appropriate to keep all other design complications to a minimum. Except for the capacity to perform the two-level probabilistic computations requisite for SLR, therefore, the experimental system was kept as simple and automatic as possible. Thus no phrase discovery, part~of-speech tagging, disambiguation, or other linguistically sophisticated operations were incorporated, nor was a thesaurus included for the confla- tion of synonyms or other purposes, nor was the descriptor vocabulary structured in any way. There was no clustering, no knowledge base, no set of implicative rules, no net- work, nor anything else `Al-like.' All indexing was performed extractively without bene- fit of human intervention. No use was made of the manually assigned descriptors in the document collections that had them. The experimental retrieval system was implemented by modifying the SMART system (Version 10), a suite of IR programs generously provided to the IR research com- munity by researchers at Cornell University. Since the new model to be implemented was probabilistic, all features of SMART motivated by the vector space retrieval model were left unused or replaced by corresponding probabilistic operations. The SMART stop list was used as-is except for the addition of a few common words from the query vocabulary thought unlikely to be content-indicative (e.g. `document', `contains'). The SMART system's stemmer was used without modification to strip suffixes and endings off of all query and document words that survived the stop list. The search algorithm used a slightly extended form of SMART's inverted file. The retrieval speed and general programming complexity of the SMART system were left substantially unaffected by this conversion to SLR-based retrieval, which meant that the objective of run-time efficiency and complexity roughly comparable to that of a vector processing system had been achieved. The Design Equations At the heart of the design are four statistical equations. Taken together they are capable of supplying an estimate, for any query~document pair of interest, of the proba- bility that the document in question is relevant to the query in question. We shall take them up in their natural order of application. Some preliminary vocabulary: A `logodds' may be regarded simply as a probabil- P(E) ity re-expressed on a special scale. The odds 0(E) of an event E is by definition The conditional odds 0(E11E2) is P(LiI E2) The `logodds' of E, sometimes also P(E11E2) 76 called a `logit', is log 0(E), or in the case of conditional odds, log 0(E1 I E2). Natural logarithms will be used throughout the sequel. The first equation estimates the logodds that a document is relevant to a query, given just one composite clue relating query to document. For the ThEC experiment a composite clue was defined to be a set of six frequency properties of a particular stem match, where a `stem match' is understood to be the event that some word stem has been found to occur at least once in both the query and document. Consider a composite clue A~ consisting of the six elementary properties ..... , X6 that describe some stem match. Let R denote the possible event that the document under consideration is in fact relevant to the query under consideration. The first equation, derived by logistic regression from a learning sample for the Wall Street Journal (WSJ) collection, is log 0(R IA~) = log 0(R I X1, X2, X3, X4, X5, X6) - 7.08+.38x1+.O4x2+.77x3~.O7x4+l.O5x5+.23X6 where (1) X1 = the log of the absolute frequency of occurrence of the stem in the query; i.e. a simple count of its occurrences in the query, logged. X2 = the log of the relative frequency of occurrence of the stem in the query; i.e. of X1 divided by the query length, with query length defined as the total number of occurrences of all word stems in the query. X3 = the log of the absolute frequency of occurrence of the stem in the document. X4 = the log of the relative frequency of occurrence of the stem in the document. X5 = the log of the inverse document frequency of the stem in the collection; i.e. the proportion of documents containing at least one occurrence of the stem, inverted and logged. X6 = the log of the global relative frequency of the stem in the collection; i.e. the fraction of word occurrences in the entire collection that are occurrences of the stem in question, logged. The equation says roughly that if a TIPSThR query and a WSJ document were chosen at random, and a certain stem were found to occur in them both with frequency statistics x1 ,..., X6, then in the absence of other knowledge it would be appropriate to apply this formula to estimate the logodds that the document is relevant to the query. Next, suppose a query and document have N terms in common, leading via Eq. (1) to N different logodds estimates, one for each of the term matches A1 through AN. The second equation allows these estimates to be combined into a preliminary estimate of the logodds that the document is relevant to the query. It has the form N (2) log 0(RIA1 ,..., AN) = log 0(R)+X [log 0(RIA~)-log 0(R)] i=1 The only quantity in the right side that cannot be estimated using Eq. (1) is the prior logodds log 0(R). The value -6.725 was used for this parameter. 77 Eq. (2) follows from the `Assumption of Linked Dependence.' Considering for simplicity only the special case of two properties, this assumption can be expressed as the equality P(A1,A2IR) _ P(A1IR) P(A2IR) P(A1,A2Ik) P(A1I~ P(A2Ik) Intuitively this asserts that the degree of dependency that exists between properties A1 and A2 in the set of relevance-related query~document pairs is linked in a certain way with the degree of dependency that exists among the nonrelevance-related pairs. It is a weaker assumption than the independence postulates commonly encountered in the litera- ture on probabilistic IR. For discussion and a derivation of Eq. (2) from the Linked Dependence Assumption see Cooper (1991) and Cooper, Dabney & Gey (1992 op. cit.). The role of the third equation, as developed for this experiment at least, is to cor- rect for deficiencies in the second equation. There are two major sources of distortion to contend with. One is that the validity of Eq. (2) depends on the Linked Dependence Assumption, a simplifying assumption that is at best only approximately true. Especially when the number N of term matches is large, Eq. (2) (if uncorrected) is capable of grossly overestimating the logodds of relevance. The other source of distortion is that Eq. (2) as it stands fails to take into account the fact that longer documents will tend to produce more term matches than shorter ones simply by chance. If nothing were done to correct it, longer documents could receive much higher relevance probability estimates than shorter ones merely by virtue of their length. This latter failing is related to a subtle criticism that can be raised against the policy of using only term matches, never mismatches, as the composite clues. Actually, a term match (i.e. term present in both query and document) is only one of four conditions that might obtain for a term vis-a-vis a given query and document. The others are: ii) term present in query but absent from document; iii) term present in document but absent from query; and iv) term absent from both query and document. Had clues of other types been recognized, accorded their own regression equations, and allowed to make their own con- tribution to Z, it might not have been necessary to make any correction for document length. But because we have taken the computationally convenient shortcut of ignoring all evidence of types ii) - iv) at the first stage of the analysis, we must compensate some- how at the second stage for the distortion caused by that oversimplification. The corrective equation, as developed for the WSJ collection through another application of logistic regression to the learning sample, is log O(RIA1,...,AN) 6.08+3.63 log max(Z, 1) -1.45logL where (3) Z = the value of the summation expression in the right side of Eq. (2); max( Z, 1) is the larger of Z and 1; and L = the length of the document under consideration, expressed as the total number of stem occurrences in the document counting separate occurrences of the same stem separately. 78 This estimate of log 0(R I A1 ,..., AN) is understood to modify and supercede the esti- mate produced by Eq. (2). Since logodds are monotonically related to probabilities, a retrieval system could in principle probability-rank the documents of the collection for the user simply by present- ing them in descending order of the logodds values assigned to them by Eq. (3). How- ever, since probabilities are easier for users to interpret than logodds, the conditional logodds estimates of form log 0(R I ..... . , AN) produced by Eq. (3) were translated into ordinary conditional probability estimates with the help of a fourth equation, the identity P(RIAl,...,AN) = 1 1 +~~1ogO(RIAi..AN) (4) The probability estimates so obtained for the TIPSTER data appear in the `score' column of the rankings submitted by the Berkeley group. Computational Arrangements The computations required for storage and retrieval under the SLR methodology follow roughly the order of the four equations. Documents are indexed as follows. First, an incoming document is put through preparatory operations that include the removal of markup and other unwanted information, the deletion of stop words, and the stemming of the remaining words. Then for each stem, all stem statistics that can be calculated before the query is known -- specifically, X3, X4, X5, and X6 -- are collected. Finally, these statistics are used to compute the stem's indexing weight in the document using the for- mula Doc stem weight = -7.08+.77 X3-.07 X4+ 1.05 X5+.23 X6-( -6.725) This formula is just the right side of Eq. (1) without the terms for X1 and X2, and with the value of the prior logodds log 0(R) subtracted off in preparation for the application of Eq. (2). The result is stored in the inverted file as the weight of the term for this docu- ment. Query indexing is similar. After an incoming query has been stop-listed and stemmed, for each stem the two statistics X1 and X2 that are dependent upon query prop- erties are calculated. The stem is then assigned as its weight in the query the value Query stem weight = .38 X1 + .04 X2 This formula comprises the rest of the right side of Eq. (1). To effect retrieval, the query is compared against each document with which it shares at least one stem. For each stem contained in both query and document, the query stem weight and the document stem weight are added, resulting in an estimate of log 0(R I A~) - log 0(R) for the stem. Summing these estimates for individual stems over all the match stems yields the value of the summation expression in the right side of Eq. (2). This computation is analogous to the calculation of a vector product in the vector space model, except that corresponding query and document weights are added instead of multiplied. 79 Calling this sum Z, Eq. (3) is applied to obtain the estimate of the logodds that the document is relevant to the query. Using Eq. (4), this logodds is translated into a proba- bility, and the documents are sorted for the user in descending order of their estimated probabilities. Constructing the Sample The SLR method calls for the use of a learning sample of relevance judgements that can be assumed accurate. Among other requirements the sample must (a) be suffi- ciently large and include a sufficient number of queries; and (b) have a definite, complete statistical structure of some kind (e.g. that of a random or a stratified sample) that can serve as a basis for making statistical inferences. Arguably the TIPSTER data supplied for TREC experimentation fulfilled condition (a) for one or two of the five collections (WSJ and perhaps ZIFF). However, so far as could be ascertained from information made available to participants, the second condition was not satisfied for any of the five collections. The samples of relevance data that were provided had no apparent statistical structure of the kind upon which statistical reasoning is ordinarily based. The omission is understandable but unfortunate from the point of view of probabilistic retrieval. For the present experiment the lack of a known statistical sample structure was highly problematical and almost fatal. It meant that the method of interest could not be rigorously applied to the available data. The theory on which SLR is based consists of a careful chain of statistical reasoning. If one of the links is weak, scientifically sound esti- mates of relevance probability cannot be expected. Rather than abandon the experiment entirely, however, the following stopgap mea- sure was adopted. An arbitrary assumption would be made about the structure of the sample, and the analysis would be carried out on the basis of this fictitious assumption as though it were known to be true. It was thought that this desperation measure would at least allow the investigators to gain experience in applying the method to a large data set (albeit on a hypothetical basis), and to illustrate the methodology for other interested par- ties. Also, it was thought that the fictitious assumption could be chosen in such a way that the experimental evidence gathered about the comparative worth of various retrieval clues would probably have at least some value. Finally, it was hoped that the output orderings might be reasonably effective in spite of the likely miscalibration of the retrieval status values as probability estimates. While this policy allowed participation in the venture to continue, the artificiality of the constructed sample diminished to some unknown extent the retrieval effectiveness of the prototype system. This should be remembered when interpreting the results of the experiment: they do not fairly represent the SLR methodology's capabilities. Had the investigators been in a position to design their own sampling procedure, the results would presumably have been better. In the same spirit it should be kept in mind that the level of accuracy of the probability estimates themselves cannot be taken as indicative of the reli- ability of SLR. The arbitrary assumption that was settled upon for the sample construction was that for the WSJ collection the universe of all query-document pairs is separable into two sets: one a small set rich in relevance-related pairs, the other a large set containing no rele- vance-related pairs. The supplied judgements of relevance and irrelevance for the WSJ 80 were to be treated as though they were a random sample of one out of every ten members of the first of these two sets. This implies among other things the supposition that the professional searchers had found 10% of the relevance-related pairs that could have been found had the entire collection been examined for every query. The 10% figure was an unsupported guess. This working assumption made it possible to construct a two-part (stratified) sam- ple for the WSJ collection. The first portion or stratum of the sample consisted of the set of 2194 query-document pairs for which human judgements of relevance and nonrele- vance were available. In the subsequent regression analysis, each of these was weighted by a factor of 10, reflecting the assumption stated above. The second part of the con- structed sample consisted of a set of 1687 query-document pairs drawn essentially ran- domly from among all query-document pairs that shared at least one stem in common. (The latter sample was obtained by considering all query-document pairs constructible from the 52 training queries and the documents in the WSJ training collection, ordering them by query number and within query number by document number, choosing every 2444th pair from this ordering, and discarding any pairs so chosen for which there was no overlap between query and document. The result may be considered a random sample, though technically it was slightly preferable to a random sample isofar as it was stratified by query.) The pairs in this second set were assumed to lie outside the first set and to be nonrelevance-related (though not verified this assumption was probably approximately true). Each pair was accorded a weight of 2400 in the subsequent regression analysis. It may be worth reiterating that the necessity of specifying the number of rele- vance-related pairs by sheer guesswork, together with the general artificiality of the con- structed sample, dashed all hope of producing well-calibrated estimates to present to the users. The Regression Analysis As the first stage of the regression analysis, for each query-document pair in the sample the set of all stems shared in common by the query and the document was assem- bled. This resulted in an expanded sample of 30,234 query-document-stem triples. For each triple, the values of the six statistics ~ X6 were calculated and recorded along with the relevance judgement associated with the query and document in question. A weighted logistic regression analysis was performed on this data, with ~ X6 serving as the independent variables and the binary relevance judgements as the dependent vari- able. The result was the set of coefficients in Eq. (1). In other words, Eq. (1) is the regression equation that was found to be the logistic equation of maximum likelihood (the `best fit') for the data in the sample. How were the six retrieval clue-types chosen? Actually the aforedescribed regres- sion analysis was performed repeatedly for various types and combinations of indepen- dent variables before these six were settled upon. Of those tried out, these turned out in combination to exhibit the most predictive power -- that is, to offer the best prospects of yielding useful estimates of logodds of relevance. `Predictive power' was judged according to several interrelated criteria. One was the extent to which a model based on a clue- combination under investigation was found to fit the data, as measured by the -2 Log Likelihood statistic or one of its minor variants 81 such as the Akaike Information Criterion. Another was the extent to which the ordering imposed on the triples by the logodds estimates assigned to them by the model resembled the ideal ordering in which all relevant triples precede all nonrelevant triples. This was measured statistically using various rank correlation coefficients including Kendall's Tau- a and the Goodman-Kruskal Gamma. Still another property that was taken into account was the shape of a variable's graph when the observed logodds was plotted against the variable values arranged along the X-axis in deciles. In such a graph, a shape that roughly resembles a straight line is considered desirable. In some cases it was found that pretransforming a variable by taking its logarithm helped to produce such a straight line. This was in fact one of the motivations for logging the variables. Another motive for doing so was the observation that it seemed to improve the general fit of the model as measured by the other indicators. Fortunately, the various criteria used to optimize the choice of variables were rarely in qualitative conflict. But there was insufficient time to explore all variables of potential interest, or to investigate more elaborate possibilities such as interaction terms. All regression analyses were run using the SAS statistical package (Version 6.06) on an IBM 3090 mainframe computer with accelerated math capabilities. The SAS package auto- matically supplies most of the diagnostic statistics mentioned above. Detailed discus- sions of logistic model building can be found in works on logistic regression (Hosmer & Lemeshow 1989; Collet 1991). It is worth noting that all the choices among possible predictor variables were made, and their weights in the equation determined, as part of the regression analysis. There was no recourse to traditional retrieval trials based on precision, recall, and the like. The regression procedures were found to be more convenient and efficient than experimentation of the usual kind in which there is no way other than trial-and-error to converge on optimal numeric coefficients. Ideally one might think of combining the two techniques -- that is, of using traditional methods to confirm the findings of the regression studies. However, time pressures prevented the pursuit of that luxury. Next, Eq. (1) was applied to each triple in the sample to calculate the estimate of the logodds of relevance log 0(R I A~) for the query and document in question, given the characteristics of the match stem. Then in accordance with Eq. (2) the estimated prior logodds of relevance log 0(R) = - 6.725 was subtracted from each of these estimates, and the differences summed within each query-document pair to obtain the value of Z for the pair. For the second-stage regression, the value of Z calculated as just described, together with the length L of the document, were recorded for each query-document pair log Z in the sample. From these the value of was calculated for each pair. (To ensure L04 that the logarithm would always be well defined, Z was first replaced by max (Z, 1).) Using this ratio as the independent variable and the binary relevance judgements as the dependent variable, a second weighted logistic regression was run. This produced two coefficients -- an intercept and a slope -- specifying a regression equation. Simple manip- ulations were used to transform it into the form displayed earlier as Eq. (3). log Z Here are some of the considerations that led to the use of as the variable on L04 82 which to regress. If one could trust the Linked Dependence Assumption completely, and if nonmatch as well as match events had been taken into explicit account in the computa- tion of Z, one might have tried letting Z stand as-is as the desired logodds estimate. But because such is not the situation, this would fail to correct for dependency distortion and would give longer documents an unfair advantage. Thus it seemed advisable to perform a corrective linear transformation on Z, and moreover to normalize Z first by dividing it by some simple function of document length. It was found by trial and error that dividing Z by L raised to a power of around 0.4 seemed to remove most of the visible bias toward either very short documents or very long documents in the five collections. Logging the entire expression was found to improve the fit to the sample data. It would have been appropriate to include a correction for query length analogous to the one developed for document length. However, for lack of time the necessary anal- ysis could not be carried out. Extrapolation to Other Collections The regression analysis was confined to the WSJ data because relevance judge- ments were not available for most of the other collections in sufficient quantities, or for enough of the training queries, to make regression feasible. This circumstance brought with it the problem of how to extrapolate the WSJ retrieval rules to the remaining four collections. Speaking generally, the extension of retrieval formulae to other collections is a sig- nificant problem throughout the IR field. One would like to know how to transfer design parameters from one collection, for which there is enough relevance data, to another col- lection for which there may be too little data or none. If the transfer could be accom- plished without too much loss of predictive power, the almost exclusive use by IR experi- menters of special `test collections' could be justified more easily. We welcomed the dearth of TIPSTER relevance data for some of the collections as an opportunity to explore this problem. To simplify the challenge and confront it in its starkest form, we elected to ignore entirely even such data as were available for collections other than WSJ. The method used for the extrapolation was based on the well known statistical con- cept of standardization of variables. The standardized value of a variable in a population is obtained by subtracting from its observed value the variable's mean value in the popu- lation, then dividing this difference by its standard deviation in the population. The new standardized values have a mean of zero and a standard deviation of one. The working assumption that was made was that a regression equation such as Eq. (1) can be carried over and applied in another collection provided all variables involved have first been standardized in both collections. Although no variable values were actually standardized, the coefficients in Eq. (1) were recalculated for each of the other four collections in such a way as to create the same effect. The values for the population means and standard deviations used in the recalculation of the coefficients were taken from random samples of triples taken from the five collections. The samples were comparable to those in the `random' subsample of WSJ query-document triples described earlier. The algebraic details of the transformation process will not be presented here, but the resulting modifications of the right side of the earlier WSJ form of Eq. (1) may be of 83 interest. They are ForAP: -7.21 +.40X1 +.04X2+.88 x3 -.iox4+ l.09X5+.25X6 ForDOE: -7.51 +.44X1 +.05 X2+ 1.18 X3 -.12X4+.94X5+.24X6 ForFR. -6.83+.44X1 +.04X2+.57 X3 -.06X4+ 1.13X5+.24X6 ForZIFF.-6.95+.38X1+.04X2+.68X3-.07X4+1.03X5+.21X6 As an illustration of what the transformation of coefficients has accomplished, one sees that the coefficient for X3, the absolute frequency of the match term in the document, is largest for the DOE collection and smallest for the FR collection. This serves to compen- sate for the fact that the average document length is smallest in DOE and largest in FR. No modifications were made of the coefficients in the second-stage regression equation, Eq. (3), when applying it in other collections. Because adjustments had already been made for inter-collection differences in the first-stage equation, the investigators were not convinced that further adjustments would be profitable at the second stage. Indeed we are not entirely confident that the adjustments are a good idea even for all the variables at the first stage, but thought it worth the experiment. Effectiveness Scores In the TREC evaluations the 11-point average recall calculated for ad hoc retrieval by the Berkeley system was 0.151; the average number of relevant documents retrieved in the top-ranked 100 documents for each request was 40.8; and the average number of relevant documents retrieved in the top-ranked 200 documents for each request was 67.9. The comparable three figures for the medians of all systems submitting results for ad hoc retrieval were .157, 39.5, and 62.5 respectively. Comparing the Berkeley system's scores against the median scores it can be seen that for the number of relevant documents retrieved among the top 200, the Berkeley system exceeded the median by 5.4 docu- ments, an amount that can be shown (in a paired comparison t-test over the mean of the differences for the 50 topics) to be statistically significant at the 0.05 level. The other two scores do not differ from the corresponding median scores by what are customarily regarded as statistically significant amounts. Thus by one of the three measures the SLR experimental system was significantly more effective than the median system, and by the other two it was not significantly different from them. To put these results in perspective it should be remembered that unlike other sys- tems included in the comparison group, the SLR system was primitive in every respect except for its statistical logic. It involved no human reformulation of topics or other man- ual intervention, it used no relevance feedback, and it employed no special linguistic or other devices such as parsing systems, thesauri, disambiguation, phrase identification, or global/local combinations of evidence. It even forewent the use of training data from four of the five collections. Yet by a statistically careful use of simple frequency infor- mation alone, it was able to hold its own against typical systems that used much more elaborate forms of evidence. 84 Since the SLR methodology is hospitable to the introduction of additional clue- types, and indeed might be expected to wring a maximum amount of leverage out of them, the prospect for future, less primitive, SLR systems that combine many types of evidence seems promising. Future Possibilities Though the prototype system described here used only a few simple statistical clues, the SLR approach is general and in principle flexible enough to accommodate most of the clue-types that researchers have been interested in as predictors of relevance. Broadly speaking, retrieval evidence having to do with particular index terms lends itself to exploitation in the form of variables in the first-stage regression equation, while other kinds -- properties of the entire query, entire document, or their relationship -- can be accommodated as variables in the second stage. As an example of possible new evidence at the first stage, suppose by virtue of parsing, suffix analysis, or dictionary lookup some information is available about the parts of speech of the match stems in the query and document. Then an additional cate- gorical variable might be introduced into the first-level regression analysis to represent the match stem's part of speech in the document, on the hunch that some parts of speech (e.g. nouns) should be more heavily weighted than others. The general two-level form of the analysis would remain the same. A further possibility would be to introduce a vari- able to represent the event that the part of speech of the stem as it occurs in the query is the same as its part of speech in the context in which it occurs in the document. Further clues could be introduced at the second stage. In the present experiment the only retrieval evidence introduced at the second stage that was not already present in the first was the document length L, which was intended more as an antidote to a bias in Z than as an independent predictor of relevance in its own right. But nothing prevents any helpful relationship between the query and document from being brought to bear. As an example, suppose a measure of the mutual closeness of the query's match stems in the document is to be introduced on the hypothesis that the closer together the query stems tend to occur in the document, the likelier it is (other things being equal) that the docu- ment is relevant (Keen 1992). Such a measure of proximity could be added as a new variable in the second-level equation, with no other change being needed in the underly- ing statistical framework. Conclusions The TREC results indicate that the SLR methodology is capable of achieving a respectable degree of retrieval effectiveness even when the retrieval evidence is confined to a few simple frequency clues. (`Respectable' in this context means competitive with the median performance of other systems most of which use more elaborate evidence.) Since nothing prevents the incorporation of additional clue types into future SLR systems, and the regression procedure should help to com- bine them with existing clues in an optimal way, the outlook for the retrieval effec- tiveness of the SLR approach seems promising. 2. The prototype SLR system demonstrates that a probabilistic initial ranking can be achieved with a run-time efficiency approximately equivalent to that of a vector 85 processing system utilizing similar kinds of evidence. 3. Standard statistical program packages can be used for an SLR analysis even in large collections, circumventing much of the need for retrieval trials of conven- tional design and allowing the choice of variables and the determination of optimal numerical parameters to be carried out more conveniently. 4. The TREC results suggest that use of the SLR approach need not necessarily be ruled out because of a lack of training data for the particular document collection to which it must be applied. A respectable level of effectiveness can (under at least some conditions) be achieved through the extrapolation into the new collection of regression equations derived from a collection for which training data already exists. 5. The present experiment failed to demonstrate that the SLR method is capable of producing, in large collections, probability of relevance estimates sufficiently well- calibrated to be presented to the users as part of the output display. It is suspected that this failure is associated with the peculiar limitations of the training data sup- plied for this initial TREC conference. If so, use of the more extensive training data to be made available for future conferences may be sufficient to resolve this problem. Acknowledgements The theory of staged logistic regression, developed in collaboration with Dan Dab- ney of U.C.L.A, was originally stimulated by discussions with James Allen and Gerard Salton of Cornell University and Stephen Robertson of City University, London. The computer science department at Cornell University provided a hospitable environment for the early stages of the theoretical development. Ray Larson at U.C. Berkeley con- tributed experienced advice on the conversion of SMART and on general systems prob- lems. We are indebted to Chris Buckley for supporting our efforts to make SMART regressive, and to all past contributors to SMART for making this valuable research tool available. The work stations used for the experiment were DEC 5000's supplied by the Sequoia 2000 project at the University of California, a project principally funded by the Digital Equipment Corporation. A DARPA grant supported the programming effort. References Bookstein, A. Probability and fuzzy set applications to information retrieval. In M. Williams (ed.), Annual Review of Information Science and Technology, 20, White Plains, NY: Knowledge Industry Publications. 1985. Collett, D. Modelling Binary Data. London: Chapman & Hall; 1991. Cooper, W. S. Exploiting the maximum entropy principle to increase retrieval effectiveness. Journal of the American Society for Information Science, 34(1): 31-39; 1983. 86 Cooper, W. S. Inconsistencies and Misnomers in Probabilistic IR. Proceedings Fourteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Chicago: 57-62. October 1991. Cooper, W. S.; Dabney, D.; Gey, E Probabilistic retrieval based on staged logistic regression. Proceedings of the Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Copenhagen: 198-210; June 1992. Cooper, W. S.; Ruizinga, P. The maximum entropy principle and its application to the design of probabilistic retrieval Systems. Information Technology: Research and Development, 1(2): 99-112; 1982. Fox, Edward A., Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types, Ph.D. Dissertation, Com- puter Science, Cornell University, 1983. Fuhr, N. Optimal polynomial retrieval functions based on the probability ranking principle. ACM Transactions on Information Systems 7(3): 183-204; 1989. Fuhr, N.; Buckley, C. A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3): 223-248; 1991. Harper, D. J.; Van Rijsbergen, C. J. An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34(3): 189-216; 1978. Hosmer, D. W.; Lemeshow, S. Applied Logistic Regression. New York: Wiley; 1989. Kantor, P. retrieval Systems. 1984. Maximum entropy and the optimal design of automated information Information Technology: Research and Development, 3(2): 88-94; Keen, E. M. Terrn position ranking: Some new test results. Proceedings of the Fif- teenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. Copenhagen: 66-76; June 1992. Lee, J. J.; Kantor, P. A study of probabilistic information retrieval systems in the case of inconsistent expert judgement. Journal of the American Society for Information Science, 42(3), 1990. Maron, M. E. Probabilistic Retrieval Models. In B. Dervin and M. Voigt (Eds.), Progress in Communication Sciences, Vol. V, Ablex, 1984, pp. 145-176. Maron, M. E.; Kuhns, J. L. On relevance, probabilistic indexing, and information retrieval. Journal of the Association for Computing Machinery, 7(3): 216-244; 1960. Robertson, S. E; Bovey, J. D. Statistical problems in the application ofprobabilis- tic models to information retrieval. British Library Research and Development Depart- ment, Report No.5739, November 1982. Robertson, S. E.; Sparck Jones, K. Relevance weighting of search terms. Journal of the American Society for Information Science, 27(3): 129-146; 1976. van Rijsbergen, C. J. A theoretical basis for the use of co-occurrence data in infor- mation retrieval. Journal ofDocumentation, 33(2): 106-119; 1977. 87 van Rijsbergen, D. J. Information Retrieval (2nd ed.). London: Butterworth & Co. Ltd; 1979. Yu, C.T.; Buckley, C.; Lam, H.; Salton, 0. A generalized term dependence model in information retrieval. Information Technology: Research and Development, 2: 129-154: 1983. Yu, C.T.; Salton, 0. Precision Weighting -- An Effective Automatic indexing Method. Journal of the Association for Computing Machinery, 23(1): 76-88; 1976. 88