Probabilistic Retrieval in the TIPSTER Collections:
            An Application of Staged Logistic Regression

                           Wm. S. Cooper
                           Fredric C. Gey
                             Aitao Chen
                     S.L.I.S., University of California
                         Berkeley, CA 94720


          ABSTRACT: In this experiment the TIPSTER test collections were
     used as a vehicle for evaluating an approach to probabilistic retrieval for
     full-text documents. The methodology in question, called `staged logistic
     regression,' involves two or more stages of logistic regression analysis of
     a learning sample of relevance judgements. The aim is to produce effec-
     tive inifial probability rankings of documents, without undue computa-
     tional complexity at run time, by applying regression equations derived
     with the help of standard statistical software packages. In addition, the
     experiment explored the feasibility of using equations derived from train-
     ing data for one document collection in a different document collection for
     which no training data happens to be available, and of calculating docu-
     ment relevance probabilities accurately enough so that they can be dis-
     played as part of the output seen by the user. The regression equations
     were implemented as retrieval rules in an experimental prototype system
     obtained by modifying the SMART retrieval system.


Introduction
    The Berkeley group's interest in participating in the NIST/TkEC Conference was
stimulated by the opportunity it offered to gain experience with a methodology called
`Staged Logistic Regression.' This technique (hereinafter abbreviated `SLR') is a sys-
tematic approach to retrieval system design based on probabilistic and statistical princi-
ples. It has been under study at Berkeley as a possible means of achieving effective prob-
abilistic retrieval including acceptably accurate estimates of relevance probability without
undue computational complexity.
    In order to test out the SLR methodology on the TIPSThR data base in as straight-
forward a fashion as possible, attenfion was restricted to the problem of how to form the
initial document ranking -- that is, the output ranking first offered by the system to the
user in response to the user's original query. This problem should not be confused with
the subsequent task of exploiting any relevance judgements that may be obtainable from
the user once he or she has started down the output ranking and is actively examining
documents. The latter problem -- the matter of how to exploit intra-search `relevance


                                  73

feedback' -- has been the subject of some fruitful probabilistic investigations, but the
question of how to create the initial ranking seems equally significant.

Objectives of the Experiment
    A principal goal of the experiment was to investigate the retrieval effectiveness of
the SLR methodology in large collections. It was hoped especially that something could
be learned of the soundness and retrieval power inherent in the statistical logic that under-
lies the SLR method. In view of this emphasis on logical foundations, and also because
of the limited time and resources at the researchers' disposal, only a few simple and com-
monly employed frequency statistics were used as retrieval clues. No attempt was made
to exploit more elaborate types of linguistic or locational evidence that would have
required the incorporation of a parser, a conflator, a disambiguator, a thesaurus, a phrase
identifier, etc. It is not that the latter kinds of evidence could not be exploited effectively
under the SLR approach. Rather, in this particular experiment the idea was to see how
far one could get on the basis of careful statistical logic alone.
    Because of this `mainly logic' approach, the results of the experiment must be
interpreted with special care. It is not the absolute retrieval effectiveness of the system
that is of most interest, but rather its retrieval effectiveness relative to the amount of evi-
dence used. If an SLR-based system can achieve with less clues the same level of effec-
tiveness that under other design approaches requires more clues, the objective of creating
a powerful underlying retrieval logic will have been demonstrated. Because regression
methods are hospitable to the use of almost any kind of predictive evidence that can be
expressed in statistical form, there is little doubt that the performance of the system could
have been improved through the use of additional clue-types. In considering the present
experiment, however, the reader is asked to bear in mind the philosophy of building a
generalizable logical platform capable of extracting a maximum of retrieval power from
whatever clues do happen to be available.
    Another objective of the SLR methodology is to keep the computational aspects of
the retrieval reasonably simple and efficient. Some probabilistic schemes call for elabo-
rate programming and are not impressively efficient at run-time. As a reasonable desider-
atum, a truly practical probabilistic method should be no more trouble to program, and
not run substantially slower than, say, a vector-processing IR system usmg comparable
types of evidence.
    Still another goal of SLR is to partially replace traditional IR research procedures
with more convenient and powerful standard statistical methods. For many years the cus-
tomary experimental research paradigm in IR has been to conduct retrieval trials and
apply specially invented IR effectiveness measures such as precision and recall to com-
pare the trial results. It is reasonable to hope that much of this trial-and-error experimen-
tation could be replaced by more efficient statistical regression analyses that use standard
statistical software packages and standard measures of goodness-of-fit, thus bringing IR
research more into the realm of mainstream statistical analysis.
    The SLR design methodology requires the use of `training data' (`learning trials',
etc.) in the form of human relevance judgements for a sample of query-document pairs
representative of the collection in which the proposed IR system is to operate. It is some-
times objected that methods requinng training data are useless in situations where a

                                  74

system must be designed for a collection for which no training data is available, and as a
practical matter none can be gathered. This objection has considerable force but it over-
looks the possibility of extrapolating the results of a regression analysis in one collection
for which training data exists into another for which it does not. As the experiment
developed, it cast some light on the question of whether such an extrapolation can be
effected without unacceptable loss of retrieval power.
    A final objective of the SLR methodology is to produce estimates of relevance
probability that are reliable enough to present to the system users as part of the ranked
output they receive. Some IR research would appear to be premised on the notion that
the output ordering of the collection is all that matters -- that the only purpose of generat-
ing retrieval status values (`similarity coefficients', `ranking scores', etc.) is to achieve as
effective a ranking as possible. We agree that imposing an effective order of presentation
on the documents is the most essential single role of the retrieval status values, but feel
that in addition the numeric scores are themselves a potentially important part of the out-
put. Their significance lies in their ability to provide the user at each point in the search
with information about whether it is likely to be worth while to continue the search down
the ranking. Clearly, such scores will be most helpful if presented in a form that most
users find readily interpretable, and interpretable moreover in a sense that bears as
directly as possible on the decision of whether they should stop searching. Probability-
of-relevance estimates would appear to fit this prescription admirably.
    For reasons that will become apparent, this final objective was not attained in the
present experiment. However, experiments in small collections indicate that SLR is
capable of producing well-calibrated probability estimates, and doing so remains one of
the general objectives of the methodology

The SLR Methodology
    The theoretical foundations of the SLR approach are presented in a recent paper by
Cooper, Dabney, & Gey (1992). A synthesis and extension of earlier approaches to prob-
abilistic retrieval, the SLR method combines the commonplace theoretical stratagem of
invoking statistical simplifying assumptions with the empirical technique of applying sta-
tistical regression analysis to a learning sample.  The use of statistical simplifying
assumptions in IR has been explored by Maron & Kuhns (1960), Robertson & Sparck
Jones (1976), Yu & Salton (1976), van Rijsbergen (1979) and others (surveyed by Maron
(1984), Bookstein (1985)). Examples of the use of regression analysis are to be found for
example in the work of Fox (1983), Fuhr (1989), and Fuhr & Buckley (1991).
    A distinguishing characteristic of SLR is that it breaks the analysis of the retrieval
process down into two or more distinct steps or stages. For the present experiment a sim-
ple two-stage procedure was adopted. In the first stage a learning sample was used to
develop a regression equation that combines elementary retrieval clues into composite
clues. In the second stage, the same empirical data is used to derive another regression
equation that combines these composite clues into an estimate of the desired estimate of
relevance probability for each query-document pair. Thus the evidence bearing on the
retrieval decision is organized first into sets of simple properties of particular descriptors,
and then into combinations of such sets as determined by the particular descriptors com-
mon to the query and document under consideration.


                                      75

    This split level approach allows for a natural separation of the available retrieval
clues into two kinds. Statistical inferences based on properties of particular terms may be
drawn first, while other kinds of evidence not confined to particular terms (e.g. document
length, citedness, etc.) are saved for the second stage of statistical inference. An impor-
tant virtue of this split-level approach is that the second-stage regression tends to correct
for biases introduced by the statistical simplifying assumptions used to consolidate the
results of the first stage.

A `Bare-Bones' SLR Methodology
    Because the sole focus of interest in the experiment was the logic of the SLR
approach, it seemed appropriate to keep all other design complications to a minimum.
Except for the capacity to perform the two-level probabilistic computations requisite for
SLR, therefore, the experimental system was kept as simple and automatic as possible.
Thus no phrase discovery, part~of-speech tagging, disambiguation, or other linguistically
sophisticated operations were incorporated, nor was a thesaurus included for the confla-
tion of synonyms or other purposes, nor was the descriptor vocabulary structured in any
way. There was no clustering, no knowledge base, no set of implicative rules, no net-
work, nor anything else `Al-like.' All indexing was performed extractively without bene-
fit of human intervention. No use was made of the manually assigned descriptors in the
document collections that had them.
    The experimental retrieval system was implemented by modifying the SMART
system (Version 10), a suite of IR programs generously provided to the IR research com-
munity by researchers at Cornell University. Since the new model to be implemented
was probabilistic, all features of SMART motivated by the vector space retrieval model
were left unused or replaced by corresponding probabilistic operations. The SMART
stop list was used as-is except for the addition of a few common words from the query
vocabulary thought unlikely to be content-indicative (e.g. `document', `contains'). The
SMART system's stemmer was used without modification to strip suffixes and endings
off of all query and document words that survived the stop list. The search algorithm
used a slightly extended form of SMART's inverted file. The retrieval speed and general
programming complexity of the SMART system were left substantially unaffected by this
conversion to SLR-based retrieval, which meant that the objective of run-time efficiency
and complexity roughly comparable to that of a vector processing system had been
achieved.

The Design Equations
    At the heart of the design are four statistical equations. Taken together they are
capable of supplying an estimate, for any query~document pair of interest, of the proba-
bility that the document in question is relevant to the query in question. We shall take
them up in their natural order of application.
    Some preliminary vocabulary: A `logodds' may be regarded simply as a probabil-
                                                                    P(E)
ity re-expressed on a special scale. The odds 0(E) of an event E is by definition

The conditional odds 0(E11E2) is P(LiI E2)  The `logodds' of E, sometimes also
                            P(E11E2)


                                  76

called a `logit', is log 0(E), or in the case of conditional odds, log 0(E1 I E2). Natural
logarithms will be used throughout the sequel.
    The first equation estimates the logodds that a document is relevant to a query,
given just one composite clue relating query to document. For the ThEC experiment a
composite clue was defined to be a set of six frequency properties of a particular stem
match, where a `stem match' is understood to be the event that some word stem has been
found to occur at least once in both the query and document. Consider a composite clue
A~ consisting of the six elementary properties ..... , X6 that describe some stem match.
Let R denote the possible event that the document under consideration is in fact relevant
to the query under consideration. The first equation, derived by logistic regression from a
learning sample for the Wall Street Journal (WSJ) collection, is
     log 0(R IA~) =  log 0(R I X1, X2, X3, X4, X5, X6)

        -  7.08+.38x1+.O4x2+.77x3~.O7x4+l.O5x5+.23X6

where

                                                                (1)

    X1 = the log of the absolute frequency of occurrence of the stem in the query; i.e. a
    simple count of its occurrences in the query, logged.
    X2 = the log of the relative frequency of occurrence of the stem in the query; i.e. of
    X1 divided by the query length, with query length defined as the total number of
    occurrences of all word stems in the query.
    X3 = the log of the absolute frequency of occurrence of the stem in the document.
    X4 = the log of the relative frequency of occurrence of the stem in the document.
    X5 = the log of the inverse document frequency of the stem in the collection; i.e.
    the proportion of documents containing at least one occurrence of the stem,
    inverted and logged.
    X6 = the log of the global relative frequency of the stem in the collection; i.e. the
    fraction of word occurrences in the entire collection that are occurrences of the
    stem in question, logged.
The equation says roughly that if a TIPSThR query and a WSJ document were chosen at
random, and a certain stem were found to occur in them both with frequency statistics
x1 ,..., X6, then in the absence of other knowledge it would be appropriate to apply this
formula to estimate the logodds that the document is relevant to the query.
    Next, suppose a query and document have N terms in common, leading via Eq.
(1) to N different logodds estimates, one for each of the term matches A1 through AN.
The second equation allows these estimates to be combined into a preliminary estimate of
the logodds that the document is relevant to the query. It has the form
                                    N                            (2)
       log 0(RIA1 ,..., AN) = log 0(R)+X [log 0(RIA~)-log 0(R)]
                                    i=1


The only quantity in the right side that cannot be estimated using Eq. (1) is the prior
logodds log 0(R). The value -6.725 was used for this parameter.


                                  77

    Eq. (2) follows from the `Assumption of Linked Dependence.' Considering for
simplicity only the special case of two properties, this assumption can be expressed as the
equality
                   P(A1,A2IR) _ P(A1IR) P(A2IR)
                   P(A1,A2Ik)     P(A1I~ P(A2Ik)

Intuitively this asserts that the degree of dependency that exists between properties A1
and A2 in the set of relevance-related query~document pairs is linked in a certain way
with the degree of dependency that exists among the nonrelevance-related pairs. It is a
weaker assumption than the independence postulates commonly encountered in the litera-
ture on probabilistic IR. For discussion and a derivation of Eq. (2) from the Linked
Dependence Assumption see Cooper (1991) and Cooper, Dabney & Gey (1992 op. cit.).
    The role of the third equation, as developed for this experiment at least, is to cor-
rect for deficiencies in the second equation. There are two major sources of distortion to
contend with. One is that the validity of Eq. (2) depends on the Linked Dependence
Assumption, a simplifying assumption that is at best only approximately true. Especially
when the number N of term matches is large, Eq. (2) (if uncorrected) is capable of
grossly overestimating the logodds of relevance. The other source of distortion is that
Eq. (2) as it stands fails to take into account the fact that longer documents will tend to
produce more term matches than shorter ones simply by chance. If nothing were done to
correct it, longer documents could receive much higher relevance probability estimates
than shorter ones merely by virtue of their length.
    This latter failing is related to a subtle criticism that can be raised against the policy
of using only term matches, never mismatches, as the composite clues. Actually, a term
match (i.e. term present in both query and document) is only one of four conditions that
might obtain for a term vis-a-vis a given query and document. The others are: ii) term
present in query but absent from document; iii) term present in document but absent from
query; and iv) term absent from both query and document. Had clues of other types been
recognized, accorded their own regression equations, and allowed to make their own con-
tribution to Z, it might not have been necessary to make any correction for document
length. But because we have taken the computationally convenient shortcut of ignoring
all evidence of types ii) - iv) at the first stage of the analysis, we must compensate some-
how at the second stage for the distortion caused by that oversimplification.
    The corrective equation, as developed for the WSJ collection through another
application of logistic regression to the learning sample, is
       log O(RIA1,...,AN)    6.08+3.63 log max(Z, 1) -1.45logL

where

                                                                (3)

    Z = the value of the summation expression in the right side of Eq. (2);
    max( Z, 1) is the larger of Z and 1; and
    L = the length of the document under consideration, expressed as the total number
    of stem occurrences in the document counting separate occurrences of the same
    stem separately.


                                  78

This estimate of log 0(R I A1 ,..., AN) is understood to modify and supercede the esti-
mate produced by Eq. (2).
    Since logodds are monotonically related to probabilities, a retrieval system could in
principle probability-rank the documents of the collection for the user simply by present-
ing them in descending order of the logodds values assigned to them by Eq. (3). How-
ever, since probabilities are easier for users to interpret than logodds, the conditional
logodds estimates of form log 0(R I ..... . , AN) produced by Eq. (3) were translated into
ordinary conditional probability estimates with the help of a fourth equation, the identity

P(RIAl,...,AN) =

       1
1 +~~1ogO(RIAi..AN)

(4)

The probability estimates so obtained for the TIPSTER data appear in the `score' column
of the rankings submitted by the Berkeley group.

Computational Arrangements
    The computations required for storage and retrieval under the SLR methodology
follow roughly the order of the four equations. Documents are indexed as follows. First,
an incoming document is put through preparatory operations that include the removal of
markup and other unwanted information, the deletion of stop words, and the stemming of
the remaining words. Then for each stem, all stem statistics that can be calculated before
the query is known -- specifically, X3, X4, X5, and X6 -- are collected. Finally, these
statistics are used to compute the stem's indexing weight in the document using the for-
mula
    Doc stem weight = -7.08+.77 X3-.07 X4+ 1.05 X5+.23 X6-( -6.725)

This formula is just the right side of Eq. (1) without the terms for X1 and X2, and with
the value of the prior logodds log 0(R) subtracted off in preparation for the application of
Eq. (2). The result is stored in the inverted file as the weight of the term for this docu-
ment.
    Query indexing is similar.  After an incoming query has been stop-listed and
stemmed, for each stem the two statistics X1 and X2 that are dependent upon query prop-
erties are calculated. The stem is then assigned as its weight in the query the value
                  Query stem weight = .38 X1 + .04 X2

This formula comprises the rest of the right side of Eq. (1).
    To effect retrieval, the query is compared against each document with which it
shares at least one stem. For each stem contained in both query and document, the query
stem weight and the document stem weight are added, resulting in an estimate of
log 0(R I A~) - log 0(R) for the stem. Summing these estimates for individual stems
over all the match stems yields the value of the summation expression in the right side of
Eq. (2). This computation is analogous to the calculation of a vector product in the vector
space model, except that corresponding query and document weights are added instead of
multiplied.


                                       79

    Calling this sum Z, Eq. (3) is applied to obtain the estimate of the logodds that the
document is relevant to the query. Using Eq. (4), this logodds is translated into a proba-
bility, and the documents are sorted for the user in descending order of their estimated
probabilities.

Constructing the Sample
    The SLR method calls for the use of a learning sample of relevance judgements
that can be assumed accurate. Among other requirements the sample must (a) be suffi-
ciently large and include a sufficient number of queries; and (b) have a definite, complete
statistical structure of some kind (e.g. that of a random or a stratified sample) that can
serve as a basis for making statistical inferences. Arguably the TIPSTER data supplied
for TREC experimentation fulfilled condition (a) for one or two of the five collections
(WSJ and perhaps ZIFF). However, so far as could be ascertained from information
made available to participants, the second condition was not satisfied for any of the five
collections. The samples of relevance data that were provided had no apparent statistical
structure of the kind upon which statistical reasoning is ordinarily based. The omission is
understandable but unfortunate from the point of view of probabilistic retrieval.
    For the present experiment the lack of a known statistical sample structure was
highly problematical and almost fatal. It meant that the method of interest could not be
rigorously applied to the available data. The theory on which SLR is based consists of a
careful chain of statistical reasoning. If one of the links is weak, scientifically sound esti-
mates of relevance probability cannot be expected.
    Rather than abandon the experiment entirely, however, the following stopgap mea-
sure was adopted. An arbitrary assumption would be made about the structure of the
sample, and the analysis would be carried out on the basis of this fictitious assumption as
though it were known to be true. It was thought that this desperation measure would at
least allow the investigators to gain experience in applying the method to a large data set
(albeit on a hypothetical basis), and to illustrate the methodology for other interested par-
ties. Also, it was thought that the fictitious assumption could be chosen in such a way
that the experimental evidence gathered about the comparative worth of various retrieval
clues would probably have at least some value. Finally, it was hoped that the output
orderings might be reasonably effective in spite of the likely miscalibration of the
retrieval status values as probability estimates.
    While this policy allowed participation in the venture to continue, the artificiality
of the constructed sample diminished to some unknown extent the retrieval effectiveness
of the prototype system. This should be remembered when interpreting the results of the
experiment: they do not fairly represent the SLR methodology's capabilities. Had the
investigators been in a position to design their own sampling procedure, the results would
presumably have been better. In the same spirit it should be kept in mind that the level of
accuracy of the probability estimates themselves cannot be taken as indicative of the reli-
ability of SLR.
    The arbitrary assumption that was settled upon for the sample construction was that
for the WSJ collection the universe of all query-document pairs is separable into two sets:
one a small set rich in relevance-related pairs, the other a large set containing no rele-
vance-related pairs. The supplied judgements of relevance and irrelevance for the WSJ


                                   80

were to be treated as though they were a random sample of one out of every ten members
of the first of these two sets. This implies among other things the supposition that the
professional searchers had found 10% of the relevance-related pairs that could have been
found had the entire collection been examined for every query. The 10% figure was an
unsupported guess.
     This working assumption made it possible to construct a two-part (stratified) sam-
ple for the WSJ collection. The first portion or stratum of the sample consisted of the set
of 2194 query-document pairs for which human judgements of relevance and nonrele-
vance were available. In the subsequent regression analysis, each of these was weighted
by a factor of 10, reflecting the assumption stated above. The second part of the con-
structed sample consisted of a set of 1687 query-document pairs drawn essentially ran-
domly from among all query-document pairs that shared at least one stem in common.
(The latter sample was obtained by considering all query-document pairs constructible
from the 52 training queries and the documents in the WSJ training collection, ordering
them by query number and within query number by document number, choosing every
2444th pair from this ordering, and discarding any pairs so chosen for which there was no
overlap between query and document. The result may be considered a random sample,
though technically it was slightly preferable to a random sample isofar as it was stratified
by query.) The pairs in this second set were assumed to lie outside the first set and to be
nonrelevance-related (though not verified this assumption was probably approximately
true). Each pair was accorded a weight of 2400 in the subsequent regression analysis.
     It may be worth reiterating that the necessity of specifying the number of rele-
vance-related pairs by sheer guesswork, together with the general artificiality of the con-
structed sample, dashed all hope of producing well-calibrated estimates to present to the
users.

The Regression Analysis
     As the first stage of the regression analysis, for each query-document pair in the
sample the set of all stems shared in common by the query and the document was assem-
bled. This resulted in an expanded sample of 30,234 query-document-stem triples. For
each triple, the values of the six statistics ~ X6 were calculated and recorded along
with the relevance judgement associated with the query and document in question. A
weighted logistic regression analysis was performed on this data, with ~ X6 serving
as the independent variables and the binary relevance judgements as the dependent vari-
able. The result was the set of coefficients in Eq. (1). In other words, Eq. (1) is the
regression equation that was found to be the logistic equation of maximum likelihood
(the `best fit') for the data in the sample.
     How were the six retrieval clue-types chosen? Actually the aforedescribed regres-
sion analysis was performed repeatedly for various types and combinations of indepen-
dent variables before these six were settled upon. Of those tried out, these turned out in
combination to exhibit the most predictive power -- that is, to offer the best prospects of
yielding useful estimates of logodds of relevance.
     `Predictive power' was judged according to several interrelated criteria. One was
the extent to which a model based on a clue- combination under investigation was found
to fit the data, as measured by the -2 Log Likelihood statistic or one of its minor variants


                                    81

such as the Akaike Information Criterion. Another was the extent to which the ordering
imposed on the triples by the logodds estimates assigned to them by the model resembled
the ideal ordering in which all relevant triples precede all nonrelevant triples. This was
measured statistically using various rank correlation coefficients including Kendall's Tau-
a and the Goodman-Kruskal Gamma. Still another property that was taken into account
was the shape of a variable's graph when the observed logodds was plotted against the
variable values arranged along the X-axis in deciles. In such a graph, a shape that
roughly resembles a straight line is considered desirable. In some cases it was found that
pretransforming a variable by taking its logarithm helped to produce such a straight line.
This was in fact one of the motivations for logging the variables. Another motive for
doing so was the observation that it seemed to improve the general fit of the model as
measured by the other indicators.
    Fortunately, the various criteria used to optimize the choice of variables were rarely
in qualitative conflict. But there was insufficient time to explore all variables of potential
interest, or to investigate more elaborate possibilities such as interaction terms.  All
regression analyses were run using the SAS statistical package (Version 6.06) on an IBM
3090 mainframe computer with accelerated math capabilities. The SAS package auto-
matically supplies most of the diagnostic statistics mentioned above. Detailed discus-
sions of logistic model building can be found in works on logistic regression (Hosmer &
Lemeshow 1989; Collet 1991).
    It is worth noting that all the choices among possible predictor variables were
made, and their weights in the equation determined, as part of the regression analysis.
There was no recourse to traditional retrieval trials based on precision, recall, and the
like. The regression procedures were found to be more convenient and efficient than
experimentation of the usual kind in which there is no way other than trial-and-error to
converge on optimal numeric coefficients. Ideally one might think of combining the two
techniques -- that is, of using traditional methods to confirm the findings of the regression
studies. However, time pressures prevented the pursuit of that luxury.
    Next, Eq. (1) was applied to each triple in the sample to calculate the estimate of
the logodds of relevance log 0(R I A~) for the query and document in question, given the
characteristics of the match stem. Then in accordance with Eq. (2) the estimated prior
logodds of relevance log 0(R) = - 6.725 was subtracted from each of these estimates,
and the differences summed within each query-document pair to obtain the value of Z for
the pair.
    For the second-stage regression, the value of Z calculated as just described,
together with the length L of the document, were recorded for each query-document pair
                                        log Z
in the sample. From these the value of       was calculated for each pair. (To ensure
                                        L04
that the logarithm would always be well defined, Z was first replaced by max (Z, 1).)
Using this ratio as the independent variable and the binary relevance judgements as the
dependent variable, a second weighted logistic regression was run. This produced two
coefficients -- an intercept and a slope -- specifying a regression equation. Simple manip-
ulations were used to transform it into the form displayed earlier as Eq. (3).
                                                               log Z
    Here are some of the considerations that led to the use of      as the variable on
                                                                L04


                                           82

which to regress. If one could trust the Linked Dependence Assumption completely, and
if nonmatch as well as match events had been taken into explicit account in the computa-
tion of Z, one might have tried letting Z stand as-is as the desired logodds estimate. But
because such is not the situation, this would fail to correct for dependency distortion and
would give longer documents an unfair advantage. Thus it seemed advisable to perform
a corrective linear transformation on Z, and moreover to normalize Z first by dividing it
by some simple function of document length. It was found by trial and error that dividing
Z by L raised to a power of around 0.4 seemed to remove most of the visible bias toward
either very short documents or very long documents in the five collections. Logging the
entire expression was found to improve the fit to the sample data.
    It would have been appropriate to include a correction for query length analogous
to the one developed for document length. However, for lack of time the necessary anal-
ysis could not be carried out.

Extrapolation to Other Collections
    The regression analysis was confined to the WSJ data because relevance judge-
ments were not available for most of the other collections in sufficient quantities, or for
enough of the training queries, to make regression feasible. This circumstance brought
with it the problem of how to extrapolate the WSJ retrieval rules to the remaining four
collections.
    Speaking generally, the extension of retrieval formulae to other collections is a sig-
nificant problem throughout the IR field. One would like to know how to transfer design
parameters from one collection, for which there is enough relevance data, to another col-
lection for which there may be too little data or none. If the transfer could be accom-
plished without too much loss of predictive power, the almost exclusive use by IR experi-
menters of special `test collections' could be justified more easily. We welcomed the
dearth of TIPSTER relevance data for some of the collections as an opportunity to
explore this problem. To simplify the challenge and confront it in its starkest form, we
elected to ignore entirely even such data as were available for collections other than WSJ.
    The method used for the extrapolation was based on the well known statistical con-
cept of standardization of variables. The standardized value of a variable in a population
is obtained by subtracting from its observed value the variable's mean value in the popu-
lation, then dividing this difference by its standard deviation in the population. The new
standardized values have a mean of zero and a standard deviation of one. The working
assumption that was made was that a regression equation such as Eq. (1) can be carried
over and applied in another collection provided all variables involved have first been
standardized in both collections.
    Although no variable values were actually standardized, the coefficients in Eq. (1)
were recalculated for each of the other four collections in such a way as to create the
same effect. The values for the population means and standard deviations used in the
recalculation of the coefficients were taken from random samples of triples taken from
the five collections. The samples were comparable to those in the
`random' subsample of WSJ query-document triples described earlier.
    The algebraic details of the transformation process will not be presented here, but
the resulting modifications of the right side of the earlier WSJ form of Eq. (1) may be of

                                  83

interest. They are
ForAP:   -7.21 +.40X1 +.04X2+.88 x3 -.iox4+ l.09X5+.25X6

ForDOE: -7.51 +.44X1 +.05 X2+ 1.18 X3 -.12X4+.94X5+.24X6

ForFR.   -6.83+.44X1 +.04X2+.57 X3 -.06X4+ 1.13X5+.24X6

ForZIFF.-6.95+.38X1+.04X2+.68X3-.07X4+1.03X5+.21X6

As an illustration of what the transformation of coefficients has accomplished, one sees
that the coefficient for X3, the absolute frequency of the match term in the document, is
largest for the DOE collection and smallest for the FR collection. This serves to compen-
sate for the fact that the average document length is smallest in DOE and largest in FR.
    No modifications were made of the coefficients in the second-stage regression
equation, Eq. (3), when applying it in other collections. Because adjustments had already
been made for inter-collection differences in the first-stage equation, the investigators
were not convinced that further adjustments would be profitable at the second stage.
Indeed we are not entirely confident that the adjustments are a good idea even for all the
variables at the first stage, but thought it worth the experiment.

Effectiveness Scores
    In the TREC evaluations the 11-point average recall calculated for ad hoc retrieval
by the Berkeley system was 0.151; the average number of relevant documents retrieved
in the top-ranked 100 documents for each request was 40.8; and the average number of
relevant documents retrieved in the top-ranked 200 documents for each request was 67.9.
The comparable three figures for the medians of all systems submitting results for ad hoc
retrieval were .157, 39.5, and 62.5 respectively. Comparing the Berkeley system's scores
against the median scores it can be seen that for the number of relevant documents
retrieved among the top 200, the Berkeley system exceeded the median by 5.4 docu-
ments, an amount that can be shown (in a paired comparison t-test over the mean of the
differences for the 50 topics) to be statistically significant at the 0.05 level. The other two
scores do not differ from the corresponding median scores by what are customarily
regarded as statistically significant amounts. Thus by one of the three measures the SLR
experimental system was significantly more effective than the median system, and by the
other two it was not significantly different from them.
    To put these results in perspective it should be remembered that unlike other sys-
tems included in the comparison group, the SLR system was primitive in every respect
except for its statistical logic. It involved no human reformulation of topics or other man-
ual intervention, it used no relevance feedback, and it employed no special linguistic or
other devices such as parsing systems, thesauri, disambiguation, phrase identification, or
global/local combinations of evidence. It even forewent the use of training data from
four of the five collections. Yet by a statistically careful use of simple frequency infor-
mation alone, it was able to hold its own against typical systems that used much more
elaborate forms of evidence.


                                    84

     Since the SLR methodology is hospitable to the introduction of additional clue-
types, and indeed might be expected to wring a maximum amount of leverage out of
them, the prospect for future, less primitive, SLR systems that combine many types of
evidence seems promising.

Future Possibilities
     Though the prototype system described here used only a few simple statistical
clues, the SLR approach is general and in principle flexible enough to accommodate most
of the clue-types that researchers have been interested in as predictors of relevance.
Broadly speaking, retrieval evidence having to do with particular index terms lends itself
to exploitation in the form of variables in the first-stage regression equation, while other
kinds -- properties of the entire query, entire document, or their relationship -- can be
accommodated as variables in the second stage.
     As an example of possible new evidence at the first stage, suppose by virtue of
parsing, suffix analysis, or dictionary lookup some information is available about the
parts of speech of the match stems in the query and document. Then an additional cate-
gorical variable might be introduced into the first-level regression analysis to represent
the match stem's part of speech in the document, on the hunch that some parts of speech
(e.g. nouns) should be more heavily weighted than others. The general two-level form of
the analysis would remain the same. A further possibility would be to introduce a vari-
able to represent the event that the part of speech of the stem as it occurs in the query is
the same as its part of speech in the context in which it occurs in the document.
     Further clues could be introduced at the second stage. In the present experiment
the only retrieval evidence introduced at the second stage that was not already present in
the first was the document length L, which was intended more as an antidote to a bias in
Z than as an independent predictor of relevance in its own right. But nothing prevents
any helpful relationship between the query and document from being brought to bear. As
an example, suppose a measure of the mutual closeness of the query's match stems in the
document is to be introduced on the hypothesis that the closer together the query stems
tend to occur in the document, the likelier it is (other things being equal) that the docu-
ment is relevant (Keen 1992). Such a measure of proximity could be added as a new
variable in the second-level equation, with no other change being needed in the underly-
ing statistical framework.

Conclusions
     The TREC results indicate that the SLR methodology is capable of achieving a
     respectable degree of retrieval effectiveness even when the retrieval evidence is
     confined to a few simple frequency clues. (`Respectable' in this context means
     competitive with the median performance of other systems most of which use more
     elaborate evidence.) Since nothing prevents the incorporation of additional clue
     types into future SLR systems, and the regression procedure should help to com-
     bine them with existing clues in an optimal way, the outlook for the retrieval effec-
     tiveness of the SLR approach seems promising.
2.   The prototype SLR system demonstrates that a probabilistic initial ranking can be
     achieved with a run-time efficiency approximately equivalent to that of a vector


                                   85

     processing system utilizing similar kinds of evidence.
3.   Standard statistical program packages can be used for an SLR analysis even in
     large collections, circumventing much of the need for retrieval trials of conven-
     tional design and allowing the choice of variables and the determination of optimal
     numerical parameters to be carried out more conveniently.
4.   The TREC results suggest that use of the SLR approach need not necessarily be
     ruled out because of a lack of training data for the particular document collection to
     which it must be applied. A respectable level of effectiveness can (under at least
     some conditions) be achieved through the extrapolation into the new collection of
     regression equations derived from a collection for which training data already
     exists.
5.   The present experiment failed to demonstrate that the SLR method is capable of
     producing, in large collections, probability of relevance estimates sufficiently well-
     calibrated to be presented to the users as part of the output display. It is suspected
     that this failure is associated with the peculiar limitations of the training data sup-
     plied for this initial TREC conference. If so, use of the more extensive training
     data to be made available for future conferences may be sufficient to resolve this
     problem.

Acknowledgements
     The theory of staged logistic regression, developed in collaboration with Dan Dab-
ney of U.C.L.A, was originally stimulated by discussions with James Allen and Gerard
Salton of Cornell University and Stephen Robertson of City University, London. The
computer science department at Cornell University provided a hospitable environment for
the early stages of the theoretical development. Ray Larson at U.C. Berkeley con-
tributed experienced advice on the conversion of SMART and on general systems prob-
lems. We are indebted to Chris Buckley for supporting our efforts to make SMART
regressive, and to all past contributors to SMART for making this valuable research tool
available. The work stations used for the experiment were DEC 5000's supplied by the
Sequoia 2000 project at the University of California, a project principally funded by the
Digital Equipment Corporation. A DARPA grant supported the programming effort.


References
     Bookstein, A. Probability and fuzzy set applications to information retrieval. In
M. Williams (ed.), Annual Review of Information Science and Technology, 20, White
Plains, NY: Knowledge Industry Publications. 1985.
     Collett, D. Modelling Binary Data. London: Chapman & Hall; 1991.
     Cooper, W. S. Exploiting the maximum entropy principle to increase retrieval
effectiveness. Journal of the American Society for Information Science, 34(1): 31-39;
1983.


                                   86

    Cooper, W. S. Inconsistencies and Misnomers in Probabilistic IR. Proceedings
Fourteenth Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval. Chicago: 57-62. October 1991.
    Cooper, W. S.; Dabney, D.; Gey, E Probabilistic retrieval based on staged logistic
regression. Proceedings of the Fifteenth Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval. Copenhagen: 198-210; June
1992.
    Cooper, W. S.; Ruizinga, P. The maximum entropy principle and its application to
the design of probabilistic retrieval Systems. Information Technology: Research and
Development, 1(2): 99-112; 1982.
    Fox, Edward A., Extending the Boolean and Vector Space Models of Information
Retrieval with P-Norm Queries and Multiple Concept Types, Ph.D. Dissertation, Com-
puter Science, Cornell University, 1983.
    Fuhr, N. Optimal polynomial retrieval functions based on the probability ranking
principle. ACM Transactions on Information Systems 7(3): 183-204; 1989.
    Fuhr, N.; Buckley, C. A probabilistic learning approach for document indexing.
ACM Transactions on Information Systems, 9(3): 223-248; 1991.
    Harper, D. J.; Van Rijsbergen, C. J. An evaluation of feedback in document
retrieval using co-occurrence data. Journal of Documentation, 34(3): 189-216; 1978.
    Hosmer, D. W.; Lemeshow, S. Applied Logistic Regression. New York: Wiley;
1989.

    Kantor, P.
retrieval Systems.
1984.

Maximum entropy and the optimal design of automated information
 Information Technology: Research and Development, 3(2): 88-94;

    Keen, E. M. Terrn position ranking: Some new test results. Proceedings of the Fif-
teenth Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval. Copenhagen: 66-76; June 1992.
    Lee, J. J.; Kantor, P. A study of probabilistic information retrieval systems in the
case of inconsistent expert judgement. Journal of the American Society for Information
Science, 42(3), 1990.
    Maron, M. E. Probabilistic Retrieval Models. In B. Dervin and M. Voigt (Eds.),
Progress in Communication Sciences, Vol. V, Ablex, 1984, pp. 145-176.
    Maron, M. E.; Kuhns, J. L. On relevance, probabilistic indexing, and information
retrieval. Journal of the Association for Computing Machinery, 7(3): 216-244; 1960.
    Robertson, S. E; Bovey, J. D. Statistical problems in the application ofprobabilis-
tic models to information retrieval. British Library Research and Development Depart-
ment, Report No.5739, November 1982.
    Robertson, S. E.; Sparck Jones, K. Relevance weighting of search terms. Journal
of the American Society for Information Science, 27(3): 129-146; 1976.
    van Rijsbergen, C. J. A theoretical basis for the use of co-occurrence data in infor-
mation retrieval. Journal ofDocumentation, 33(2): 106-119; 1977.


                                 87

    van Rijsbergen, D. J. Information Retrieval (2nd ed.). London: Butterworth & Co.
Ltd; 1979.
    Yu, C.T.; Buckley, C.; Lam, H.; Salton, 0. A generalized term dependence model
in information retrieval.  Information Technology: Research and Development, 2:
129-154: 1983.
    Yu, C.T.; Salton, 0. Precision Weighting -- An Effective Automatic indexing
Method. Journal of the Association for Computing Machinery, 23(1): 76-88; 1976.


                                        88