Optimizing Document Indexing and Search Term Weighting
Based on Probabilistic Models
Norbert Fllhr* Chris Buckleyf
Abstract
We describe the application of probabilistic indexing and retrieval methods to the TREC
material. For document indexing, we apply a description-oriented approach which uses relevance
feedback information from previous queries run on the same collection. This method is also very
flexible w.r.t. the underlying document representation. In our experiments, we consider single
words and phrases and use polynomial functions for mapping the statistical parameters of these
terms onto probabilistic indexing weights. Based on these weights, a linear (utility-theoretic)
retrieval function is applied when no relevance feedback data is available for the specific query.
Otherwise, the retrieval~with~probabilistic~indexing model can be used. The experimental results
show excellent performance in both cases, but also indicate possible improvements.
1 Learning in IR
terms
documents
- - 1*.......
- -`I........
learning
application
terms
- 1 documents
k -
learning
application
queries queries
routing queries ad-hoc queries
search term weighting description-oriented
from relevance feedback indexing
Abbildung 1: Learning approaches in IR
Figure 1 shows two major learning approaches that are used in IR, both of which are applicable to
the tasks to be performed within TREc. For the routing queries, we have relevance feedback data for
some documents w.r.t. a specific query, and then the system has to rank further documents for the
same query. As indicated by the third dimension, our knowledge is restricted to the terms we have
*University of Dortmund, Informatik VI, P.O. Box 500500, W-4600 Dortmund 50, Germany, fuhr~ls6.informatik.uni
dortmund.de
tDepartment of Computer Science, Upson Hail, Cornell University, Ithaca, NY 14853, USA, chnsb~cs.coniell.edu
89
seen in the learning sample. If the new documents would contain totally different terms, the learning
sample would be of no use for coping with these documents. For dealing with the routing queries,
mostly search term weighting methods are applied, which are well establisbed in IR.
For the ad-hoc queries, the task is more difficult: Given relevance information for some query-document
pairs, this information has to be exploited in order to rank new documents w.r.t. new queries; further-
more, most of these new query-document pairs will involve new terms. As a method for dealing with
this type of task, description-oriented indexing has been developed ([Fuhr & Buckley 91]). The major
concept of this approach is abstraction. For the routing queries, we abstract from specific documents
by regarding the presence or absence of terms. In description-oriented indexing, we have to abstract
in addition from specific queries and terms. This can be done by regarding features of these objects
instead of the objects itself. Similar to pattern recognition methods, documents, queries and terms
are described by sets of features here. This way, new documents, queries and terms can be mapped
onto sets of features which we have already seen in the learning sample.
In principle, description-oriented learning could be combined with most IR models. Only the proba-
bilistic model, however, relates directly to retrieval quality. The probability ranking principle described
in [Robertson 77] states that optimum retrieval performance is achieved when documents are ranked
according to descending values of their probability of relevance. So our probabilistic approach uses
the learning data in order to optimize retrieval quality for the test sample. This statement also holds
for our work with the routing queries, where we apply the retrieval-with-probabilistic-indexing (RPI)
model for estimating the query term weights.
In the following section, we describe the description-oriented document indexing method which yields
probabilistic weights for terms w.r.t. documents. In order to use these weights in retrieval, two methods
are applied here. When no relevance information for the specific query is available, a utility-theoretic
retrieval function can be used, where the utility weights for the terms from the query are derived by
some heuristics. With relevance feedback data, however, the RPI model can be applied. Experiments
with the ad-hoc queries are described in section 3.1, followed by the presentation of our work with
the routing queries in section 3.2.
2 Probabilistic document indexing
2.1 General approach
We first give a brief outline of the description-oriented indexing approach, which is presented in full de-
tail in [Fuhr & Buckley 91]. Based on the binary independence indexing model ([Maron & Kuhns 60],
[Fuhr 89]), one can define probabilistic document indexing weights as follows: let dm denote a doc-
ument, t~ a term and R the fact that a query-document pair is judged relevant, then P(RIi~, dm)
denotes the probability that document drn will be judged relevant w.r.t. an arbitrary query that con-
tains term i~ These weights can hardly be estimated directly, since there will not be enough relevance
information available for a specific document. Instead, the description-oriented indexing approach is
applied, where the indexing task is divided into two steps, namely a description step and a decision
step.
In the description step, term-document pairs (i~, dm) are mapped onto so-called relevance descriptions
~ dm). The elements X~ of the relevance description contain values of features of ~ dm and their
relationship, like e.g.
90
if within-document frequency (wdf) of ~
logidf = log(inverse document frequency),
lognumierms = log(number of different terms in dm),
imaxif = 1/(maximum wdf of a term in dm).
In the experiments described in the following, we only use these simple parameters. However, it should
be emphasized that the concept of relevance description is a means for coping with rather complex
forms of document representations, e.g. when advanced natural language processing techniques are
applied.
In the decision step, probabilistic indexing weights of the form P(RIx~t~, dm)) are estimated. Instead of
the probability P(RIt~, drn) which relates to a specific document and a specific term, P(RIx~ii, dm)) is
the probability that an arbitrary term-document pair having relevance description i will be involved
in a relevant query-document relationship. This probability is estimated by a s~called indexing
function u(x). Different regression methods or probabilistic classification algorithms can serve as
indexing function. Here, we use polynomial regression. For this purpose, a polynomial structure
= ....... , VL) has to be defined as
v(x) = (1, x1, x2, . . . , XN, X21, X1X2,
where N is the number of dimensions of x. The class of polynomials is given by the components
X3m ... (i,j,k,... E [1, N]; l,m,n,... > 0) which are to be included in the polynomial.
The indexing function now yields u(x~ - ~ v(i), where a = (.1...., aL)T is the coefficient vector
which has to be estimated in a separate training phase preceding the application of the indexing
function. So P(RIx~ is approximated by the polynomial
u(x~ = al+a2 xl +a3.x2+...+aN+1 .XN+aN+2 x21 +aN+3 xl x2+..
In the training phase, a learning sample is derived from relevance feedback data in the following way:
for every query-document pair (q~, drn), a learning sample element (x~i~, dm), y) is generated for each
term t~ common to q~ and dm with y = 1, if dm is relevant w.r.t. q~, and y = 0 otherwise. Given this
set of learning sample elements, a linear regression method is applied in order to minimize the squared
error (y - u(x))2 (see ~Fuhr & Buckley 91] for the details). The result of the regression method is the
coefficient vector a which defines the indexing function.
In the application phase, for each term occurring in a document, a relevance description is constructed
first and then the indexing function gives the indexing weight of the term in the document.
2.2 Experiments
We developed two types of document indexing, one with single terms only and one with both single
words and phrases.
For the single words, we initially used a linear indexing function, which gave us the function:
u(x(ii,dm)) = -0.0019844 +
0.0000296 if +
0.0001079 imaxif +
0.0003519 logidf +
0.0003068 lognumierms.
Retrieval experiments with this type of document indexing, however, revealed that the few extremly
long documents (Federal Register documents with up to 2.6 MB) were always retrieved in the top
91
ranks. This effect was due to the positive coefficient of lognumierms, which gave indexing weights
larger than 1 for the terms in these documents, thus yielding high retrieval status value for these
documents w.r.t. most queries.
Obviously a linear indexing function is not well suited for coping with such extreme distributions of
the elements of the relevance description. Besides the effect on the few very long documents, other
documents are also affected by this distribution, for the following reason: Many terms in the long
documents will be assigned indexing weights larger than 1. For the computation of the squared error
that is to be minimized, the quadratic difference (y - u(x~)2 between the indexing weight u(x) and
the actual relevance decision y (0 or 1) in the learning sample is considered. Even if y = 1, there is
still the difference (u(x) - 1)2, for u(x~ > 1. In order to reduce this error and thus the overall error,
a larger error for other indexing weights is taken into account. There are three possible strategies for
coping with this problem:
* Some experiments performed with other material have shown that overall indexing quality can
be improved by excluding outliers from the learning sample ([Pfeifer 91]). As outliers, not only
those pairs with weights lying on the "correct side" of the interval [0,1] (i.e. either y = 0 and
u(x) < 0 or y = 1 and u(x~ > 1) should be regarded. Also the exclusion of those pairs with
weights lying on the "wrong side" (i.e. either y = 0 and u(x~ > 1 or y = 1 and u(x~ < 0) yields
better results.
* Using logistic functions instead of linear functions overcomes the problem of indexing weights
outside the interval [0,1]. However, we have some experimental evidence ([Pfeifer 90]) that even
in this case the removal of outliers from the learning sample (where the outliers are defined w.r.t.
a linear indexing function) improves the indexing quality.
* For the experiments described here, we switched from linear to polynomial functions, which also
gave fairly good results.
The function finally used for indexing with single terms only is
u(i) = 0.00042293 +
0.00150083 if logidf imaxif +
-0.00150665 if imaxif +
0.00010465 logidf +
-0.00122627 lognumierms imaxif.
F or indexing with phrases, we first had to derive the set of phrases to be considered. We took all
adjacent non-stopwords that occurred at least 25 times in the D1 (training) document set. Then
an indexing function common for single words and phrases was developed by introducing additional
binary factors is~single (is~phrase) having the value 1 if the term is a single word (a phrase) and 0
otherwise. The parameters if and logidf were defined for phrases in the same way as for single words,
and in the computation of imaxif and lognumierms for a document both phrases and single words
were considered. This procedure produced the indexing function
u(x~ = 0.00034104 +
0.00141097 is~single if logidf imaxif +
-0.00119826 is~single if imaxif +
0.00014122 is~single logidf +
-0.00120413 lognurnierms zmaxif +
0.00820515 is~phrase if logidf . imaxif +
-0.04188466 is~phrase if imaxif +
0.00114585 is~phrase logidf.
92
For each phrase occurring in a document, indexing weights for the phrase as well as for its two
components (as single words) were computed.
In the following, we will refer to the indexing with single words only as "word indexing" and to the
indexing using both single words and phrases as "phrase indexing".
3 Retrieval
For the probabilistic document indexing weights as described above, there are specific retrieval func-
tions which yield a probabilistic or utility-theoretic ranking of documents w.r.t. a query. These func-
tions differ in the aspect whether or not they consider relevance feedback information for the specific
query. For this reason different retrieval functions were applied for the ad-hoc queries and the routing
queries.
3.1 Ad-hoc queries
For the ad-hoc queries, we used a linar utility-theoretic retrieval function described in [Wong & Yao 89).
Let qT~ denote the set of terms occurring in the query, ~ the set of document terms and ti~,,,, the
indexing weight u(x~i~, dm)). If c~k gives the utility of term t~ for the actual query q~, then the utility
of document dm w.r.t. query q~ can be computed by the retrieval function
Cik Uim. (1)
t~qT~ndT~
The only problem here is the estimation of the utility weights Cik, for which we do not have a theoreti-
cally justified method. As a heuristic approach, we used the SMART if.idf weights, where If denotes
the number of occurrences of the term I~ in the query q~ here [Salton & Buckley 88]. We assume that
there are other choices for this parameter which could significantly improve retrieval quality.
run if.idf ~uhra1 ~uhrpi
(words) (phrases)
average precision for recall points:
11-pt Avg. 0.1750 0.1943 0.2054
3-pt Avg. 0.1159 0.1399 0.1664
query-wise comparison with median:
11-pt Avg: 32:14 32:17
Prec. @ 100 docs 25:21 35:13
Best/worst results:
11-pt Avg: 6(2)11(1) 4(1)12
Prec. @ 100 docs 2(2)12 8(3)1(2)
single words vs. phrases:
11-pt Avg: 25:25
Prec. @ 100 docs 21:28
Tabelle 1: Results for adhoc queries
The retrieval function (1) was applied for both kinds of document indexing, where words only were
considered for run fuhral and both words and phases in the run fuhrpi. Figure 2 shows the recall-
precision curves for both runs in comparison to a standard SMART if idf run. It can be seen that
93
both probabilistic indexing methods clearly outperform the if idf run, with the exception of the
low-recall end for the phrase run (see below). Comparing word indexing with phrase indexing shows
mixed results (see also table 1): For the recall-precision averages, phrases perform better again with
the exception of the low recall end. More information is given by the recall and precision values at
different numbers of documents retrieved (see appendix of this volume). Precision for phrases is worse
or about equal to that of words, but recall is always better. This means that phrases perform better
for narrow queries (with a small number of relevant documents).
precision
0.8- if idf ~
0.7- iuhrai (words) I
~uhrp1 (phrases) p
0.6-
0.5-
0.4-
0.3-
0.2-
0.1-
0.2 0.4 0.6 0~ 1 recall
Abbildung 2: Recall-precision curves for probabilistic indexing in comparison to if. idf run
0
In general, one would expect that using phrases in addition to single words would never decrease
precision. We blame the current definition of the retrieval function (1) for the observed behaviour.
This function assumes that all the terms considered are independent, which is obviously not true when
we consider a phrase in addition to its two components as terms. Since the size of the error made
here depends on the indexing weights of the components, this fact may explain the precision decrease
for the highest ranked documents (i.e. at low recall levels). A better strategy would be to ignore the
components in case the phrase occurs in the document.
The query-wise comparison with the median (see table 1) shows a large scattering of the results for
the single queries. A preliminary analysis has indicated that the relative performance for a specific
query depends significantly on the fact whether or not the query statement contains negated terms.
For example, in topic 86 we have... it is a bank, as opposed to another financial institution such as a
savings and loan or credit union ... and topic 87 reads ... Civil actions, such as creditor claims filed
for recovery of losses, are not relevant ... . In both cases, our system yields one of the worst results
(for word indexing). Since our query indexing procedure extracts only the terms from the query and is
not able to recognize negation, the system explicitly searches for documents containing these negated
terms. Theoretically, it is obvious that negated terms should be given a negative utility weight.
However, the examples cited here show that it is a difficult task to recognize negation automatically.
3.2 Routing queries
For the routing queries, the retrieval-with-probabilistic-indexing (RPI) model described in [Fuhr 89]
and [Fuhr 92) was applied. This model combines query-specific relevance feedback information with
94
probabilistic document indexing in order to estimate query term weights. Let Pik denote the average
indexing weight of term i~ in the documents judged relevant w.r.t. query q~ and r~k the average
indexing weight of i~ in the nonrelevant documents. Now the query term weight is computed by the
formula
= Pik(1 - rik)
rik(l-Pik) -1
and the RPI retrieval function yields
~qk,dm)= ~ log(cikuim + 1). (2)
tiE q~TndT~
In our experiments, due to the lack of time, we used the standard SMART if idf document indexing
here (Salton & Buckley 88] (single words only) instead of the probabilistic indexing described above.
After an initial retrieval run with both if. idf weights for queries and documents, relevance feedback
information was used for computing the feedback query term weights Cik. Only the terms occurring in
the query were considered here, so no query expansion took place. Theoretically, the RPI formula can
also be applied in case of query expansion. However, the additional terms should be treated differently
when estimating their query term weights. This problem has not been investigated yet for the RPI
model.
# queries: 49
query-wise comparison with median:
11-pt Avg: 43:5
Prec. @ 100 docs 43:5
Best/worst results:
11-pt Avg: 12/1
Prec. C~ 100 docs 16/1
Tabelle 2: Results for routing queries
Table 2 shows the results for the run ~uhra2 with routing queries. It can be seen that this approach
works very well for almost every query. The single worst result is for topic # 50, where we did
not retrieve any relevant document; this outcome is due to the fact that there was only one rele-
vant document in the training sample, which is obviously not sufficient for probabilistic parameter
estimation.
A Operational details of runs
A.1 Overall
All runs were done completely automatically, treating the text portions of both documents and queries
as fiat text without structure. This made everything much simpler, but was not ideal given the
complexity in both form and meaning of the queries. Recall-precision definitely suffered.
All actual indexing (as opposed to weighting) and retrieval was done with the Cornell SMART Version
11.0 system, using the standard SMART procedures (eg, stopwords, stemming, inverted file retrieval).
All runs were made on a Sun Sparc 2 with 64 Mbytes of memory. All times reported are CPU time.
95
A.2 Ad-hoc runs
Two automatic ad-hoc runs were done; one in which the documents and queries were indexed with
single terms only, and one in which they were indexed with both single terms and adjacency phrases.
The overall procedure for both runs was:
1. Index D1 (the learning document set) and Qi (the learning query set).
2. For each document d ~ D1
2.1 For each q C Qi
2.1.1 Determine the relevance value r of d to q
2.1.2 For each term tin common between qT (set of query terms) and
~ (set of document terms)
2.1.2.1 Find values of the elements of the relevance description
involved in this run and add values plus relevance
information to the least squares matrix being constructed
3. Solve the least squares matrix to find the coefficient vector a
4. Index D1 U D2 (both sets of documents together) with term-freq weights.
5. For each document d C D1 U D2 (both sets of documents together)
5.1 For each term t C (1T
5.1.1 Find values of the relevance description x~t, d) involved in run.
5.1.2 Give t the weight a~ v~i(t, d))
where a is the value determined in step 3.
5.2 Add d to the inverted file.
6. Weight Q2 (test query set) with tf. idf weights (ntc variant).
7. Run an inner product inverted file similarity match of Q2 against
the inverted file formed in step 5, retrieving the top 200 documents.
In an operational environment, the learning document set and test document set would be the same,
so step 1 (or step 4) would be omitted. Once coefficients have been found, they should remain valid
unless the character of the collection changes. So new documents can be added to a dynamic collection
by just going through step 5 for each new document.
The algorithm above is different from that implemented in the past for this learning approach. Earlier
versions iterated over queries (instead of documents) in step 2 and used inverted files for speed (with
the document vectors still being needed). However, the size of TREC required a reformulation since
document vectors and inverted files could not be kept in memory.
The coefficient determination is valid only if the relevant judgements used are representative of the
entire set of reolevance judgements. In this first TREC, the initial judgements are fragmentary and
not very representative; it's impossible to tell how much this affects things.
A.3 Single term automatic ad-hoc run
The single term ad-hoc run used 5 factors (described in 2.1):
* constant
* tf logidf. irnaxtf
* if imaxif
* logidf
* lognurnierms imaxif
96
Determining the coefficients for the factors in the single term run took about 1.7 hours (steps 2 + 3
above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about
6.4 hours from scratch. Note that this is only about 1.0 hours longer than construction of a normal
(for SMART) if idf weighted inverted file.
Query indexing and weighting used the normal SMART procedures and took about 1.5 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors, Description were used to index the query, with
no distinction made between fields.
Retrieval plus ranking took 383 seconds.
A.4 Phrase automatic ad-hoc run
The phrases being used were tw~term SMART adjacency phrases. Phrases were adjacent non-
stopwords, term components stemmed, that occurred at least 25 times in the D1 document set.
The term components were put into alphabetical order, thus the text phrases "information retrieval"
and "retrieving information" both mapped to the same phrase concept. The phrases were treated as
a separate ctype within an indexed vector, and had their own dictionary and inverted file separate
from those of the single terms.
Determination of phrases took 5.8 hours, finding 4,700,000 phrases occurring in D1 at least once. Of
those phrases 158,000 occurred at least 25 times. These phrases were then put into a dictionary and
used as controlled vocabulary for phrases when doing the indexing of D1 (step 1) and D1 U D2 (step
4). The single term indexing remained exactly as it was in the single term run (terms occurring in
phrases were not removed from the vector).
The phrase term ad-hoc run used 8 factors (described in 2.2).
* consi ant
* is~single if logidf. irnaxif
* is~single . if imaxif
* is~sing1e . lo9idf
* lognumierms . imaxif
* is~phrase if. logidf imaxif
* is~phrase if. imaxif
* is~ph'~ase . logidf
Determining the coefficients for the factors in the phrase run took about 2.4 hours (steps 2 + 3 above).
Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 12.6
hours from scratch. Note that this is only about 1.7 hours longer than construction of a normal (for
SMART) if idf weighted inverted file with phrases.
Query indexing and weighting used the normal SMART procedures and took about 2.7 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors and Description were used to index the query,
with no distinction made between fields.
Retrieval plus ranking took 374 CPU seconds. (This was less than the single term run, but for no
apparent reason. Perhaps the machine was less loaded.)
A.5 Automatic routing run
There was one automatic routing run done. It was totally unconnected to the factor weighting
approach described above. Basically, it was very easy to implement and run so we decided we might
as well submit it. The actual weighting function had to be programmed, but even so less than 3
person-days in total was spent on routing.
97
The routing experiment format was treated almost exactly as a normal relevance feedback experimental
run. The overall procedure was:
1. Index query set Qi and document set D1 with if idf weights
2. For each query q ~ Qi
2.1 For each term I £ qT (set of query terms)
2.1.1 Reweight term I using the RPI relevance weighting formula and
the (fragmentary) relevance information supplied.
3 Index document set D2 with if. idf weights. Note that the collection
frequency information used in the idf weight was derived from occurrences
in D1 only (in actual routing the collection frequencies within D2 would
not be known)
4. Run the reweighted queries of Qi (step 2) against the inverted file (step 3),
returning the top 200 documents for each query.
This approach differs from true routing in that
A. All documents of D2 were indexed at once instead of individually indexing them and comparing
at each query sequentially. Thus the indexing/retrieval times obtained for the above algorithm
are pretty meaningless.
B. True routing is really a binary decision most often implemented as a similarity threshold, and
retrieving the top 200 documents (in ranked order) wouldn't normally be done. However, this
difference from true routing is required for evaluation purposes, and was thus required for TREC.
Note that the approach was completely automatic with the query and documents treated as fiat text
(no structure). Differing from the ad-hoc runs above, the queries were indexed including the words
from all topic sections.
It's unknown what effect the fragmentary relevance information had on the query reformulation. The
strength of the effect will depend on whether the top similarity documents according to the if idf
weights used had been judged and included in the fragmentary judgements.
Step 1 took 3.0 hours, step 2 about 1304 seconds, step 3 about 1.9 hours, and step 4 about 312 seconds.
References
Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM
Transactio,,s 01, Information Systems 9(3), pages 223-248.
Fuhr, N. (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and
Management ~5(1), pages 55-72.
Fulir, N. (1992). Integration of Probabilistic Fact and Text Retrieval. In: Belkin, N.; Ingwersen, P.;
Pejtersen, M. (eds.): Proceedings of the Fifieenth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval, pages 211-222. ACM, New York.
Maron, M.; Kuhus, J. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval.
Journal of the ACM 7, pages 216-244
Pfeifer, U. (1990). Development of Log-Linear and Linear-Iterat:ve !ndexing Functions (in German).
Diploma thesis, TH Darmstadt, FB Informatik, Datenverwaltungssysteme II.
Pfeifer, U. (1991). Entwicklung linear iterativer und logistischer Indexierungsfunktionen. In: Fuhr,
N. (ed.): Information Retrieval, pages 23-37. Springer, Berlin et al.
Robertson, S. (1977). The Probability Ranking Principle in IR. Journal of Documentation 33,
pages 294-304.
98
Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Infor-
ma~ion Processing and Managemen~ 24(5), pages 513-523.
Wong, S.; Yao, Y. (1989). A Probability Distribution Model for Information Retrieval. Informaiton
Processing and Managemen~ 25(1), pages 39-53.
99