Full Text Retrieval based on Probabilistic Equations
with Coefficients fitted by Logistic Regression
Wm. S. Cooper
Aitao Chen
Fredric C. Gey
S.L.I.S., University of California
Berkeley, CA 94720
ABSTRACT
The experiments described here are part of a research
program whose objective is to develop a full-text
retrieval methodology that is statistically sound and
powerful, yet reasonably simple. The methodology is
based on the use of a probabilistic model whose
parameters are fitted empirically to a learning set of
relevance judgements by logistic regression. The
method was applied to the TIPSTER data with opti-
mally relativized frequencies of occurrence of match
stems as the regression variables. In a routing
retrieval experiment, these were supplemented by
other variables corresponding to sums of logodds asso-
ciated with particular match stems.
Introduction
The full-text retrieval design problem is largely a
problem in the combination of statistical evidence. With
this as its premise, the Berkeley group has concentrated on
the challenge of finding a statistical methodology for com-
bining retrieval clues in as powerful a way as possible,
consistent with reasonable analytic and computational
simplicity. Thus our research focus has been on the gen-
eral logic of how to combine clues, with no attempt made
at this stage to exploit as many clues as possible. We feel
that if a straightforward statistical methodology can be
found that extracts a maximum of retrieval power from a
few good clues, and the methodology is clearly hospitable
to the introduction of further clues in future, progress will
have been made.
We join Fuhr and Buckley (1991, 1992) in thinking
that an especially promising path to such a methodology is
to combine a probabilistic retrieval model with the tech-
niques of statistical regression. Under this approach a
57
probabilistic model is used to deduce the general form that
the document-ranking equation should take, after which
regression analysis is applied to obtain empincally-based
values for the constants that appear in the equation. In
this way the probabilistic theory is made to constrain the
universe of logically possible retrieval rules that could be
chosen, and the regression techniques complete the choice
by optimizing the model's fit to the learning data.
The probabilistic model adopted by the Berkeley
group is derived from a statistical assumption of `linked
dependence'. This assumption is weaker than the historic
independence assumptions usually discussed. In its sim-
plest form the Berkeley model also differs from most tra-
ditional models in that it is of `Type 0'-- meaning that the
analysis is carried out w.r.t. sets of query-document pairs
rather than w.r.t. particular queries or particular docu-
ments. (For a fuller explanation of this typology see
Robertson, Maron & Cooper 1982.) But when relevance
judgement data specialized to the currently submitted
search query is available, say in the form of relevance
feedback or routing history data, the model is flexible
enough to accommodate it (resulting in `Type 2' retrieval.)
Logistic regression (see e.g. Hosmer & Lemeshow
(1989)) is the most appropriate type of regression for this
kind of IR prediction. Although standard multiple regres-
sion analysis has been used successfully by others in com-
parable circumstances ~uhr & Buckley op. ciL), we
believe logistic regression to be logically more appropri-
ate for reasons set forth elsewhere (Cooper, Dabney &
Gey 1992). Logistic regression, which accepts binary
training data and yields probability estimates in the form
of logodds values, goes hand-in-glove with a probabilistic
IR model that is to be fitted to binary relevance judgement
data and whose predictor variables are themselves
logodds.
The Fundamental Equation Equation (1):
As the starting point of our probabilistic retrieval
model we adopt (with reservations to be explained
presently) a statistical simplifying assumption called the
`Assumption of Linked Dependence' (Cooper 1991).
Intuitively, it states that in the universe of all query-
document pairs, the degree of statistical dependency that
exists between properties of pairs in the subset of all rele-
vance-related pairs is linked in a certain way with the
degree of dependency that exists between the same pair-
properties in the subset of all nonrelevance-related pairs.
Under at least one crude measure of degree of dependence
(specifically, the ratio of the joint probability to the prod-
uct of individual probabilities), the linkage in question is
one of identity. That is, the claim made by the assumption
is that the degree of the dependency is the same in both
sets.
For N pair-properties ..... . , AN, the mathematical
statement of the assumption is as follows.
ASSUM~flON OF LINKED DEPENDENCE:
P(A1....ANIR) _ P(A1IR) P(ANIR)
-xx
P(A1,..., ANIR) - P(A1IR) P(ANIA)
If one thinks of a query and document drawn at random, R
is the event that the document is relevant to the query,
while each clue A~ is some respect in which the document
is similar or dissimilar to, or somehow related to, the
query.
For most clue-types likely to be of interest this sim-
plifying assumption is at best only approximately true.
Though not as implausible as the strong conditional inde-
pendence assumptions traditionally discussed in proba-
bilistic IR, still it should be regarded with skepticism. In
practice, when the number N of clues to be combined
becomes large the assumption can become highly unreal-
istic, with a distorting tendency that often causes the prob-
ability of relevance estimates produced by the resulting
retrieval model to be grossly overstated for documents
near the top of the output ranking. But it will be expedi-
ent here to adopt the assumption anyway, on the under-
standing that later on we shall have to find a way to curb
its probability-inflating tendencies.
From the Assumption of Linked Dependence it is
possible to derive the basic equation underlying the proba-
bilistic model:
58
log 0(RIA1....AN) =
N
log 0(R) + ~ [log 0(R I A~) - log 0(R)]
where for any events E and E' the odds 0(E I E') is by
P(E I E')
definition P(E I E') Using this equation, the evidence of
N separate clues can be combined as shown in the right
side to yield the logodds of relevance based on all clues,
shown on the left. Query-document pairs can be ranked
by this logodds estimate, and a ranking of documents for a
particular query by logodds is of course a probability
ranking in the IR sense. For further discussion of the
linked dependence assumption and a formal derivation of
Eq. (1) from it, see Cooper (1991) and Cooper, Dabney &
Gey (1992). An empirical investigation of independence
and dependence assumptions in IR is reported by Gey
(1993).
Application to Term Properties
We shall consider as a ftrst application of the model-
ing equation the problem of how to exploit the properties
of match terms. By a `match' term we mean a stem (or
word, phrase, etc.) that occurs in the query and also, in
identical or related form, in the document to which the
query is being compared. The retrieval properties of a
match term could include its frequency characteristics in
the query, in the document, or in the collection as a whole,
its grammatical characteristics, the type or degree of
`match' involved, etc. If there are M match terms that
relate a certain query to a certain document, Eq. (1)
becomes applicable with N set to M. Each of the proper-
ties ~..... ,AM then represents a composite clue or set of
properties concerning one of the query-document pair's
match terms.
Suppose a `learning set' of human judgements of
relevance or nonrelevance is available for a sample of rep-
resentative query-document pairs. (However, for the time
being we assume no such learning data is available for the
very queries on which the retrieval is to be performed.)
Logistic regression can be applied to the data to develop
an equation capable of using the match term clues to esti-
mate, for any query-document pair, values for each of the
expressions in the right side of Eq. (1). Eq. (1) then yields
a probability estimate for the pair.
TABLE I: Equations to Approximate the Components of Eq. (1)
log 0(R) [b0+b1M]
log 0(R 1A1) - log 0(R) [a0 + a1X1,1 + . + aKX1,K + aK+lM J - [b0+b1M]
log 0(R lAM) - log 0(R) [a0 + alXM,l + . +aKXM,K + aK+lM]
- [b0+b1MJ
M
log0(RIA1,...,A~) - [a0M + a1 ~ Xm,i
m=1
What must be done, then, is to find approximating
expressions for the various components in the right side of
Eq. (1). Table I (above) shows the interrelationships
among the quantities involved. In this array, the material
to the left of the column of approximation signs just
restates Eq. (1), for Eq. (1) asserts that the expressions
above the horizontal line sum to the expression below the
line. On the right of each approximation sign is a simple
formula that might be used to approximate the expression
on the left. (More complex formulae could of course also
be considered.) Each right-side formula involves a linear
combination of K variables Xm.i,... , Xm.K corresponding
to the K retrieval properties (frequency values, etc.) used
to characterize the term in question. The idea is to apply
logistic regression to find the values for the coefficients
(the a's and b's) in the right side that produce the best
regression fit to the learning sample. Once these constants
have been determined, retrieval can be carried out by sum-
ming the right sides to obtain the desired estimate of
log 0(R I A1.....AN) for any given query-document
palr.
It may not be immediately apparent why terms in M
have been included in the approximating formulas on the
right. In Eq. (1), the number N of avallable retrieval clues
is actually part of the conditionalizing information, a fact
that could have been stressed by writing 0(R I N) in place
of 0(R) and 0(R I A~, N) in place of 0(R I A~). So the
approximating formula for log 0(R) is really an approxi-
mation for log 0(RI M) and must involve a term in M.
(Simple functions of M such as or log (M) might also
be considered.) Similar remarks apply to the formulas for
M
+ + aK X Xm,K + aK+1M2] - (M - 1) [ b0 + b1M]
m=1
approximating log (R I Am).
There are two approaches to the problem of finding
coefficients to use in the approximating expressions. The
first is what might be called the `triples-then-palrs'
approach. It starts out with a logistic regression analysis
performed directly on a sample of query-document-term
triples constructed from the learning set. In the sample,
each triple is accompanied by the clue-values associated
with the match term in the query and document, and by
the human relevance judgement for the query and docu-
menL The clue-values are used as the values of the inde-
pendent variables in the regression, and the relevance
judgements as the values of the the dependent variable.
After the desired coefficients have been obtained, the
resulting regression equation can be applied to evaluate
the left-side expressions, which can in turn be summed
down the M terms to obtain a preliminary estimate of
log0(RIA1,...,A~).
A second regression analysis is then performed on a
sample of query-document pairs, using the preliminary
logodds prediction from the first (triples) analysis as an
independent variable. This second analysis serves the pur-
pose of correcting for distortions introduced by the
Assumption of Linked Dependence. It also allows
retrieval properties not associated with particular match
terms, such as document length, age, etc., to be intro-
duced. The triples-then-pairs approach is the one used by
the Berkeley group in its Trec-l research (Cooper, Gey &
Chen 1992). The theory behind it is presented in more
detail in (Cooper, Dabney & Gey 1992).
59
The second way of determining the coefficients --
the one used in Trec-2 -- is the `pairS-only' approach. It
requires only one regression analysis, performed on a
sample of query-document pairs. It is based on the trivial
observation that in the right side of the above array
instead of adding across rows and then down the resulting
column of sums, one can equivalently add down columns
and across the resulting row of sums. Under either proce-
dure the grand total value for log O(R I A1....AM) will
be the same.
Summing down the columns and then across the
totals gives the expression shown in the bottom line of the
array. It simplifies to
logO(RIA1........AN)
M M
a1 ~ Xm,l+...+aK X Xm,K
m=l m=l
+ (a0 - b0 + b1)M + (aK+l - b1)M2
Since there is no need to keep the a~ coefficients segre-
gated from the b ~ coefficients to get a predictive equation,
this suggests the adoption of a regression equation of form
logO(RIA1,..., AM)
M M
C0 + C1 ~ Xm,i + ... + CK X Xm,K
m=l +CK+lM + CK+2M2
The coefficients C0....CK+2 may be found by a logistic
regression on a sample of query-document pairs con-
structed from the learning sample. In the sample each pair
is accompanied by its K different Xm,k-values each
already summed over all match terms for the pair, the val-
ues of M and M2, and (to serve as the dependent variable
in the regression) the human relevance judgement for the
pair.
But if only one level of regression analysis is to be
performed, where is the correction for the Assumption of
Linked Dependence to take place? That assumption
causes mischief because it creates a tendency for the pre-
dicted logodds of relevance to increase roughly linearly
with the number of match terms, whereas the true increase
is less than linear. This can be corrected by modifying the
variables in such a way that their values rise less rapidly
than the number of match terms as the number of match
terms increases. The variables can, for instance, be multi-
plied by some function f(M) that drops gently with
increasing M, say 1 1 . The exact form of
- or
~ 1+logM
60
the function can be decided during the course of the
regression analysis.
With such a damping factor included, the foregoing
regression equation becomes
Equation (2):
logO(RIA1....AM)
M M
CO+Clf(M)XXm,l+~~~+CKf(M) XXm,K
m=l
+ CK+lM + CK+2M2
In our experiments, this simple modification was found to
improve the regression fit and the precisiontrecall perfor-
mance. It would appear therefore to be a worth-while
refinement of the basic model. Note, however, that this
adjustment only removes a general bias. It does nothing
to exploit the possibility of measuring dependencies
between particular stems to improve retrieval effective-
ness. To exploit individual dependencies would be desir-
able in principle, but would require a substantial elabora-
tion of the model for what might well turn out to be an
insignificant improvement in effectiveness (for discussion
see Cooper (1991)).
Optimally Relativized Frequencies
The philosophy of the project called for the use of a
few well-chosen retrieval clues. The most obvious clues
to be exploited in connection with match terms are their
frequencies of occurrence in the query and document.
What is not so obvious is the exact form the frequencies
should take. For instance, should they be absolute or rela-
tive frequencies, or something in between?
The IR literature mentions two ways in which fre-
quencies might be relativized for use in retrieval. The first
is to divide the absolute frequency of occurrence of the
term in the query or document by the length of the query
or document, or some parameter closely associated with
length. The second is to divide the relative frequency so
obtained by the relative frequency of occurrence of the
term in the entire collection considered as one long run-
ning text. Both kinds of relativization seem potentially
beneficial, but the question remains whether these rela-
tivizations, if they are indeed helpful, should be carried
out in full strength, or whether some sort of blend of abso-
lute and relative frequencies might serve beuer.
To answer this question, regression techniques were
used in a side investigation to discover the optimal extent
to which each of the two kinds of relativization might best
be introduced. To investigate the first kind of relativiza-
tion -- relativization with respect to query or document
length -- a semi-relativized stem frequency variable of the
form
absolute frequency
document length + C
was adopted. If the constant C is chosen to be zero, one
has full relativization, whereas a value for C much greater
than the document length will cause the variable to behave
in the regression analysis as though it were an absolute
frequency. Several logistic regressions were run to dis-
cover by trial-and-error the value of C that produced the
best regression fit to the TIPSTER learning data and the
highest precision and recall. It was found that a value of
around C = 35 for queries, and C = 80 for documents,
optimized performance.
For the relativization with respect to the global rela-
tive frequency in the collection, the relativizing effect of
the denominator was weakened by a slighfly different
method -- by raising it to a power less than 1.0. For the
document frequencies, for instance, a variable of form
absolute frequency in doc
document length + 80
[relative frequency in collection ] D
with D < 1 was in effect used. Actually, it was the loga-
rithm of this expression that was ultimately adopted,
which allowed the variable to be broken up into a differ-
ence of two logarithmic expressions. The optimal value
of the power D was therefore obtainable in a single
regression as the coefficient of the logged denominator.
Thus the variables ultimately adopted consisted in
essence of sums over two optimally relativized frequen-
cies (`ORF's) -- one for match stem frequency in the
query and one for match stem frequency in the document.
Because of the logging this breaks up mathematically into
three variables. A final logistic regression using sums
over these variables as indicated in Eq. (2) produced the
ranking equation
Equation (3):
log O(RI A1,...,AM)
-3.51 + 1 ~ + 0.0929M
61
where ~ is the expression
M M M
37.4 X Xm,i +0.330 X Xm,2 -0.1937 X Xm,3
m=1 m=1 m=1
Here
Xm,i = number of times the m'th stem occurs in the
query, divided by (total number of all stem occur-
rences in query + 35);
Xm,2 = number of times the m'th stem occurs in the
document, divided by (total number of all stem
occurrences in document + 80), quotient logged;
Xm,3 = number of times the m'th stem occurs in the
collection, divided by the total number of all stem
occurrences in the collection, quotient logged;
M = number of distinct stems common to both
query and document.
Although Eq. (2) calls for an M2 term as well, such a term
was found not to make a statistically significant contribu-
tion and so was eliminated.
Eq. (3) provided the ranking rule used in the ad hoc
run (labeled `Brkly3') and the first routing run (`Brkly4')
submitted for Trec-2. The equation is notable for the spar-
sity of the information it uses. Essentially it exploits only
two ORF values for each match stem, one for the stem's
frequency in the query and the other for its frequency in
the document. Other variables were tried including the
inverse document frequency (both logged and unlogged),
a variable consisting of a count of all two-stem phrase
matches, and several variables for measuring the tendency
of the match stems to bunch together in the document. All
of these exhibited predictive power when used in isola-
tion, but were discarded because in the presence of the
ORF's none produced any detectable improvement in the
regression fit or the precisionfrecall performance. Some
attempts at query expansion using the WordNet thesaurus
also failed to produce noticable improvement, even when
care was taken to create separate variables with sepan~te
coefficients for the synonym-match counts as opposed to
the exact-match counts.
The quality of a retrieval output ranking matters
most near the top of the ranking where it is likely to affect
the most users, and maUers hardly at all far down the
ranking where hardly any users are apt to search. Because
of this it is desirable to adopt a sampling methodology that
produces an especially good regression fit to the sample
data for documents with relatively high predicted proba-
bilities of relevance. In an attempt to achieve this empha-
sis, the regression analysis was carried out in two steps. A
logistic regression was first performed on a stratified sam-
pie of about one-third of the then-available TIPSThR rele-
vance judgements to obtain a preliminary ranking rule.
This preliminary equation was then applied as a screening
rule to the entire available body of judged pairs, and for
each query all but the highest-ranked 500 documents were
discarded. The resulting set of 50,000 judged query-
document pairs (100 topics x 500 top-ranked docs per
topic) served as the learning sample data for the final
regression equation displayed above as Eq. (3).
Application to Query-Specific Learning Data
To this point it has been assumed that relevance
judgement data is available for a sample of queries typical
of the future queries for which retrieval is to be per-
formed, but not for those very queries themselves. We
consider next the contrasting situation in which the learn-
ing data that is available is a set of past relevance judge-
ments for the very queries for which new documents are
to be retrieved. Such data is often available for a routing
or SDI request for which relevance feedback has been col-
lected in the past, or for a retrospective feedback search
that has already been under way long enough for some
feedback to have been obtained.
For a query so equipped with its very own learning
sample, it is possible to gather separate data about each
individual term in the query. Such data reflects the term's
special retrieval characteristics in the context of that
query. For instance, for a query term T~ we may count the
number n ( T~, R) of documents in the sample that con-
tain the term and are also relevant to the query, the num-
ber of documents n(T~, RTh that contain the term but are not
relevant to the query, and so forth.
The odds that a future document will turn out to be
be relevant to the query, given that it contains the term,
can be estimated crudely as
n(T~,R)
O(RIT~) n(T~,R)
To refine this estimate, a Bayesian trick adapted from
Good (1965) is useful. One simply adds the two figures
used in expressing the prior odds of relevance (i.e. n (R)
and n (~) into the numerator and denominator respec-
tively with an arbitrary weight P as follows:
62
O(RIT~) n(T~, R)+~n(R)
The smaller the value assigned to p, the more influ-
ence the sample evidence will have in moving the esti-
mate away from the prior odds. The adjustment causes
large samples to have a greater effect than small, as seems
natural. It also forestalls certain absurdities implicit in the
unadjusted formula, for instance the possibility of calcu-
lating infinite odds estimates.
Suppose now that a query contains Q distinct terms
T1....TM, TM+1....TQ, numbered in such a way that
the first M of them are match terms that occur in the docu-
ment against which the query is to be compared. The fun-
damental equation can be applied by taking N to be Q,
and interpreting the retrieval clues ..... . , AN as the pres-
ence or absence of particular query terms in the document.
The pertinent data can be organized as shown in Table II
(next page).
Thus we are led to postulate, for query-specific
learning data, a retrieval equation of form
Equation (4):
logQ(RIA1......AQ)
co+cif(Q)[~i +~2-~3]
where
M n(T,R)+pn(R)
~i= ~log
m=1
Q
~2= m=XM+l log n(T~, ~ + P n(A)
n(R)
~= (Q-1) log
and where f(Q) is some restraining function intended as
before to subdue the inflationary effect of the Assumption
of Linked Dependence. The coefficients C0, C1 are found
by a logistic regression over a sample of query-document
pairs involving many queries.
Combining Query-Specific and Query-Nonspecific
Data
If a query-specific set of relevance judgements is
available for the query to be processed, a larger
TABLE II: Reinterpretation of the Components of Eq. (1) for Routing Retrieval
n(R)
log 0(R) log
log 0(RIA1)- log 0(R)
log nfl(TT1l,~RJ)++Pp n~(R~)
n(R)
log
log 0(RIAM)- log 0(R) log
n(TM, R) + p n(R)
n(TM, ~ + p n(A)
n(TM+1,R)+ P n(R)
log 0(R I AM+1) - log 0(R) log n(TM+l,R)+ p n(R)
fl(TQ, R) + p n(R)
log O(RIAQ) - log 0(R) log n(TQ,~+fl n(R)
n(R)
- log-
n(R)
n(R)
- log
n(R)
- log
n(R)
M n(T~,R)+fl n(R) Q n(TTh,R)+pn(R) n(R)
log 0(RIA1 ,..., AQ) - ~log + ~ log
nonspecific set may well be available at the same time. If
so the theory developed in the foregoing section can be
applied in conjunction with the earlier theory to capture
the benefits of both kinds of learaing sets. The retrieval
equation will then contain variables not only of the kind
occurring in Eq. (4) but also of the Eq. (2) kind.
It is convenient to formulate this equation in such a
way that it contains as one of its terms the entire ranking
expression developed earlier for the nonspecific learning
dat~ For the ~IlPSTER data the combined equation takes
the form:
Equation (S):
log 0(RIA1,...,AM, A'1....
0.688 ~4+0.344[~1 +~2-~3]+0.0623
where ~ is the entire right side of Eq. (3) and ~1, ~ ~3
are as defined in Eq. (4). This form for the equation is
computationally convenient if Eq. (3) is to be used as a
preliminary screening rule to eliminate unpromising
63
documents, with Eq. (5) in its entirety applied only to rank
those that survive the screening.
Eq. (5) was used to produce the Trec-2 routing run
`Brkly5'. Its coefficients were determined by a logistic
regression constrained in such a way as to make the
query-specific variables contribute about twice as heavily
as the nonspecific, when contribution is measured by stan-
dardized coefficient size. This emphasis was largely arbi-
trary; finding the optimal balance between the query-
specific and the general contributions remains a topic for
future research. A value of 201n(R) was used for p. This
choice too was arbitrary, and it would be interesting to try
to optimize it experimentally for some typical collection
(trying out, perhaps, numbers larger than 20, divided by
the total number of documents in the query's learning set).
No restraining function f(Q) was used in the final form of
Eq. (5) because none that were tried out produced any dis-
cernible improvement in fit or retrieval effectiveness in
this context.
Calibration of Probability Estimates
The most important role of the relevance probability
estimates Calculated by a probabilistic IR system is to rank
the output documents in as effective a search order as pos-
sible. For this ranking function it is only the relative sizes
of the probability estimates that matter, not their absolute
magnitudes. However, it is also desirable that the absolute
sizes of these estimates be at least somewhat realistic. If
they are, they can be displayed to provide guidance to the
users in their decisions as to when to stop searching down
the ranking. This capability is a potentially important side
benefit of the probabilistic approach.
One way of testing the realism of the probability
estimates is to see whether they are `well-Calibrated'.
Good calibration means that when a large number of prob-
ability estimates whose magnitudes happen to fall in a cer-
tain small range are examined, the proportion of the trials
in question with positive outcomes also falls in or close to
that range. To test the calibration of the probability pre-
dictions produced by the Berkeley approach, the 50,000
query-document pairs in the ad hoc entry Brkly3 along
with their accompanying relevance probability estimates
were sorted in descending order of magnitude of estimate.
Pairs for which human judgements of relevance-
relatedness were unavailable were discarded; this left
22,352 sorted pairs for which both the system's probabil-
ity estimates of relevance and the `correct' binary judge-
ments of relevance were available. This shorter list was
divided into blocks of 1,000 pairs each -- the thousand
pairs with the highest probability estimates, the thousand
with the next highest, and so forth. Within each block the
`actual' probability was estimated as the proportion of the
1,000 pairs that had been judged to be relevance-related
by humans. This was compared against the mean of all
the system probability estimates in the block. For a well-
calibrated system these figures should be approximately
equal.
The results of the comparison are displayed in Table
III. It can be seen that the system's probability predic-
tions, while not wildly inaccurate, are generally somewhat
higher than the actual proportions of relevant pairs. The
same phenomenon of mild overestimation was observed
when the runs Brkly4 and BrklyS were tested for well-
calibratedness in a similar way.
Since no systematic overestimation was observed
when the calibration of the formula was originally tested
against the learning data, it seems likely that the
64
TABLE III: Calibration of Ad Hoc
Relevance~Probability Estimates
Query-doc Mean System Proportion
Pair Probability Actually
Ranks Estimate Relevant
1 to 1,000 0.66 0.60
1,001 to 2,000 0.63 0.47
2,001 to 3,000 0.61 0.44
3,001 to 4,000 0.58 0.41
4,001 to 5,000 0.55 0.38
5,001 to 6,000 0.53 0.34
6,001 to 7,000 0.50 0.36
7,001 to 8,000 0.48 0.36
8,001 to 9,000 0.46 0.36
9,001 to 10,000 0A4 0.38
10,001 to 11,000 0.42 0.39
11,001 to 12,000 OA1 0.36
12,001 to 13,000 0.39 0.37
13,001 to 14,000 0.37 0.36
14,001 to 15,000 0.36 0.35
15,001to16,000 0.34 0.31
16,001 to 17,000 0.32 0.29
17,Ooltol8,000 0.31 0.28
18,001 to 19,000 0.29 0.23
19,001 to 20,000 0.28 0.22
20,001 to 21,000 0.25 0.21
21,001 to 22,000 0.23 0.23
22,001 to 22,352 0.18 0.19
overestimation seen in the table is due mainly to the shift
from learning data to test data. Naturally, predictive for-
mulae that have been fine-tuned to a certain set of learning
data will perform less well when applied to a new set of
data to which they have not been fine-tuned. If this is
indeed the root cause of the observed overestimation, it
could perhaps be compensated for (at least to an extent
sufficient for practical pwposes) by the crude expedient of
lowering all predicted probabilities to, say, around four
fifths of their originally calculated values before display-
ing them to the users.
Computational Experience
The statistical program packages used in the course
of the analysis included SAS, S, and BLSS. Of these,
SAS provides the most complete built-in diagnostics for
logistic regression. BLSS was found to be especially con-
venient for interactive use in a UNIX environment, and
ended up being the most heavily used.
The prototype retrieval system itself was imple-
mented as a modification of the SMART system with
SMART's vector-similarity subroutines replaced by the
probabilistic computations of Eqs. (3) and (5). For the
runs Brkiy3 and Brkly4, which used only Eq. (3), only
minimal modifications of SMART were needed, and at
run time the retrieval efficiency remained essentially the
same as for the unmodified SMART system. This demon-
strates that probabilistic retrieval need be no more cum-
bersome computationally than the vector processing alter-
natives. For BrklyS, which used Eq. (5), the modifications
were somewhat more extensive and retrieval took about
20% longer.
Retrieval Effectiveness
The Berkeley system achieved an average precision
over all documents (an `11-point average') of 32.7% for
the ad hoc retrieval run Brkly3, and 29.0% and 35.4%
respectively for the routing runs Brkly4 and Brkly5. The
distinct improvement in effectiveness of BrklyS over
Brkly4 suggests that in routing retrieval the use of fre-
quency information about individual query stems is worth
while.
At the `0 recall level' a precision of 84.7%, the
highest recorded at the conference, was achieved in the ad
hoc run. The high effectiveness of the Berkeley system
for the first few retrieved documents may be explainable
in terms of the practice, mentioned earlier, of redoing the
regression analysis for the highest-ranked 500 documents
for each query. This technique ensures an especially good
regression fit for the query~document pairs that are espe-
cially likely to be relevant, thus emphasizing good perfor-
mance near the top of the ranking where it is most impor-
tanL
The generally high retrieval effectiveness of the
Berkeley system should be interpreted in the light of the
fact that the system probably uses less evidence -- that is,
fewer retrieval clues -- than any of the other high-
performing TREC-2 systems. In fact, the only clues used
were the frequency characteristics of single stems (not
even phrases were included). What this suggests is that
the underlying probabilistic logic may have the capacity to
exploit exceptionally fully whatever clues may be
65
available.
Summary and Conclusions
The Berkeley design approach is based on a proba-
bilistic model derived from the linked dependence
assumption. The variables of the probability-ranking
retrieval equation and their coefficients are determined by
logistic regression on a judgement sample. Though the
model is hospitable to the utilization of other kinds of evi-
dence, in this particular investigation the only variables
used were optimally relativized frequencies (ORF's) of
match stems.
The approach was found to have the following
advantages:
Experimental Efficiency. Since the numeric coeffi-
cients in a regression equation are determined
simultaneously in one computation, trial-and-error
experimentation involving the evaluation of
retrieval output to optimize parameters is largely
avoidable.
2. Computational Simplicity. For ad hoc retrieval and
routing retrieval that does not involve individual
stem statistics, the computational simplicity and
efficiency achieved by the model at run time are
comparable to that of simple vector processing
retrieval models. For routing retrieval that exploits
individual stem frequencies the programming is
somewhat more complicated and runs slightly
slower.
3 Effective Retrieval. The level of retrieval effective-
ness as measured by precision and recall is high rel-
ative to the simple clue-types used.
4. potential for Well-Calibrated Probability Estimates.
In-the-ballpark estimates of document relevance
probabilities suitable for output display would
appear to be within reach.
Acknowledgements
We are indebted to Ray Larson and Chris Plaunt for
helpful systems advice, as well as to the several col-
leagues already acknowledged in our Trec-1 report. The
work stations used for the experiment were supplied by
the Sequoia 2000 project at the University of California, a
project principally funded by the Digital Equipment Cor-
poration. A DARPA grant supported the programming
effort.
References
Cooper, W. 5. (1991). Inconsistencies and Mis-
nomers in Probabilistic IR. Proceedings Fourteenth
Annual inlernational ACM SIGIR Conference on
Research and Development in Information Retrieval.
Chicago, 57-62.
Cooper, W. S.; Dabney, D.; Gey, F. (1992). Proba-
bilistic retrieval based on staged logistic regression. Pro-
ceedings of the Fifteenth Annual International ACM
SIGIR Confrrence on Research and Development in Infor-
mation Retrieval. Copenhagen, 198-210.
Cooper, W. S.; Gey, F. C.; Chen, A. (1993). Proba-
bilistic retrieval in the TIPSTBR Collections: An Applica-
tion of Staged Logistic Regression. The First Text
REtrieval Conference (TREC-1). MST Special Publica-
tion 500-207. Washington, 73-88.
Fuhr, N.; Buckley, C. (1993). Optimizing Ioocu-
ment Indexing and Search Term Weighting Based on
Probabilistic Models. The First Text REtrieval Confer-
ence (TREC-1). MST Special Publication 500-207.
Washington, 89-100.
Fuhr, N.; Buckley, C. (1991). A probabilistic learn-
mg approach for document indexing. ACM Transactions
on Information Systems, 9,223-248.
Gey, F. C. (1993). Probabilistic Dependence and
Logistic Inference in Information Retrieval Ph.D. Disser-
tation, Uni~ of California, Berkeley, California.
Good, I. J. (1965). The estimation of probabilities;
An essay on modern Bayesian methods. Cambridge,
Mass.; M.I.T. Press.
Hosmer, D. W.; Lemeshow, 5. (1989). Applied
Logistic Regression. New York: Wiley.
Robertson, S. E.; Maron, M. E.; Cooper, W. S.
(1982). Probability of relevance: A unification of two
competing models for document retrieval. Information
Technology: Research and Development 1, 1-21.
66