Optimizing Document Indexing and Search Term Weighting
                      Based on Probabilistic Models
                        Norbert Fllhr*                Chris Buckleyf


                                              Abstract
       We describe the application of probabilistic indexing and retrieval methods to the TREC
    material. For document indexing, we apply a description-oriented approach which uses relevance
    feedback information from previous queries run on the same collection. This method is also very
    flexible w.r.t. the underlying document representation. In our experiments, we consider single
    words and phrases and use polynomial functions for mapping the statistical parameters of these
    terms onto probabilistic indexing weights. Based on these weights, a linear (utility-theoretic)
    retrieval function is applied when no relevance feedback data is available for the specific query.
    Otherwise, the retrieval~with~probabilistic~indexing model can be used. The experimental results
    show excellent performance in both cases, but also indicate possible improvements.


1   Learning in IR

                     terms

                         documents

                       - - 1*.......
                       - -`I........
                     learning
                         application

                                               terms

                                                   - 1  documents

                                                k -
                                               learning

                                               application

                 queries                     queries
                    routing queries            ad-hoc queries

                    search term weighting      description-oriented
                    from relevance feedback    indexing

                        Abbildung 1: Learning approaches in IR

Figure 1 shows two major learning approaches that are used in IR, both of which are applicable to
the tasks to be performed within TREc. For the routing queries, we have relevance feedback data for
some documents w.r.t. a specific query, and then the system has to rank further documents for the
same query. As indicated by the third dimension, our knowledge is restricted to the terms we have
  *University of Dortmund, Informatik VI, P.O. Box 500500, W-4600 Dortmund 50, Germany, fuhr~ls6.informatik.uni
dortmund.de
  tDepartment of Computer Science, Upson Hail, Cornell University, Ithaca, NY 14853, USA, chnsb~cs.coniell.edu


                                             89

seen in the learning sample. If the new documents would contain totally different terms, the learning
sample would be of no use for coping with these documents. For dealing with the routing queries,
mostly search term weighting methods are applied, which are well establisbed in IR.
For the ad-hoc queries, the task is more difficult: Given relevance information for some query-document
pairs, this information has to be exploited in order to rank new documents w.r.t. new queries; further-
more, most of these new query-document pairs will involve new terms. As a method for dealing with
this type of task, description-oriented indexing has been developed ([Fuhr & Buckley 91]). The major
concept of this approach is abstraction. For the routing queries, we abstract from specific documents
by regarding the presence or absence of terms. In description-oriented indexing, we have to abstract
in addition from specific queries and terms. This can be done by regarding features of these objects
instead of the objects itself. Similar to pattern recognition methods, documents, queries and terms
are described by sets of features here. This way, new documents, queries and terms can be mapped
onto sets of features which we have already seen in the learning sample.
In principle, description-oriented learning could be combined with most IR models. Only the proba-
bilistic model, however, relates directly to retrieval quality. The probability ranking principle described
in [Robertson 77] states that optimum retrieval performance is achieved when documents are ranked
according to descending values of their probability of relevance. So our probabilistic approach uses
the learning data in order to optimize retrieval quality for the test sample. This statement also holds
for our work with the routing queries, where we apply the retrieval-with-probabilistic-indexing (RPI)
model for estimating the query term weights.
In the following section, we describe the description-oriented document indexing method which yields
probabilistic weights for terms w.r.t. documents. In order to use these weights in retrieval, two methods
are applied here. When no relevance information for the specific query is available, a utility-theoretic
retrieval function can be used, where the utility weights for the terms from the query are derived by
some heuristics. With relevance feedback data, however, the RPI model can be applied. Experiments
with the ad-hoc queries are described in section 3.1, followed by the presentation of our work with
the routing queries in section 3.2.


2    Probabilistic document indexing
2.1   General approach
We first give a brief outline of the description-oriented indexing approach, which is presented in full de-
tail in [Fuhr & Buckley 91]. Based on the binary independence indexing model ([Maron & Kuhns 60],
[Fuhr 89]), one can define probabilistic document indexing weights as follows: let dm denote a doc-
ument, t~ a term and R the fact that a query-document pair is judged relevant, then P(RIi~, dm)
denotes the probability that document drn will be judged relevant w.r.t. an arbitrary query that con-
tains term i~  These weights can hardly be estimated directly, since there will not be enough relevance
information available for a specific document. Instead, the description-oriented indexing approach is
applied, where the indexing task is divided into two steps, namely a description step and a decision
step.
In the description step, term-document pairs (i~, dm) are mapped onto so-called relevance descriptions
~ dm). The elements X~ of the relevance description contain values of features of ~ dm and their
relationship, like e.g.


                                            90

 if               within-document frequency (wdf) of ~
 logidf           = log(inverse document frequency),
 lognumierms      = log(number of different terms in dm),
 imaxif           = 1/(maximum wdf of a term in dm).
In the experiments described in the following, we only use these simple parameters. However, it should
be emphasized that the concept of relevance description is a means for coping with rather complex
forms of document representations, e.g. when advanced natural language processing techniques are
applied.
In the decision step, probabilistic indexing weights of the form P(RIx~t~, dm)) are estimated. Instead of
the probability P(RIt~, drn) which relates to a specific document and a specific term, P(RIx~ii, dm)) is
the probability that an arbitrary term-document pair having relevance description i will be involved
in a relevant query-document relationship.   This probability is estimated by a s~called indexing
function u(x).  Different regression methods or probabilistic classification algorithms can serve as
indexing function.  Here, we use polynomial regression. For this purpose, a polynomial structure
     = ....... , VL) has to be defined as

                            v(x) = (1, x1, x2, . . . , XN, X21, X1X2,

where N is the number of dimensions of x.    The class of polynomials is given by the components
    X3m     ...   (i,j,k,... E [1, N];  l,m,n,... > 0) which are to be included in the polynomial.
The indexing function now yields u(x~ - ~    v(i), where a = (.1...., aL)T is the coefficient vector
which has to be estimated in a separate training phase preceding the application of the indexing
function. So P(RIx~ is approximated by the polynomial

         u(x~ = al+a2     xl +a3.x2+...+aN+1 .XN+aN+2        x21 +aN+3 xl x2+..

In the training phase, a learning sample is derived from relevance feedback data in the following way:
for every query-document pair (q~, drn), a learning sample element (x~i~, dm), y) is generated for each
term t~ common to q~ and dm with y = 1, if dm is relevant w.r.t. q~, and y = 0 otherwise. Given this
set of learning sample elements, a linear regression method is applied in order to minimize the squared
error (y - u(x))2 (see ~Fuhr & Buckley 91] for the details). The result of the regression method is the
coefficient vector a which defines the indexing function.
In the application phase, for each term occurring in a document, a relevance description is constructed
first and then the indexing function gives the indexing weight of the term in the document.

2.2    Experiments
We developed two types of document indexing, one with single terms only and one with both single
words and phrases.
For the single words, we initially used a linear indexing function, which gave us the function:

                        u(x(ii,dm)) =    -0.0019844                 +
                                          0.0000296   if            +
                                          0.0001079   imaxif        +
                                          0.0003519   logidf        +
                                          0.0003068   lognumierms.

Retrieval experiments with this type of document indexing, however, revealed that the few extremly
long documents (Federal Register documents with up to 2.6 MB) were always retrieved in the top


                                                  91

ranks. This effect was due to the positive coefficient of lognumierms, which gave indexing weights
larger than 1 for the terms in these documents, thus yielding high retrieval status value for these
documents w.r.t. most queries.
Obviously a linear indexing function is not well suited for coping with such extreme distributions of
the elements of the relevance description. Besides the effect on the few very long documents, other
documents are also affected by this distribution, for the following reason: Many terms in the long
documents will be assigned indexing weights larger than 1. For the computation of the squared error
that is to be minimized, the quadratic difference (y - u(x~)2 between the indexing weight u(x) and
the actual relevance decision y (0 or 1) in the learning sample is considered. Even if y = 1, there is
still the difference (u(x) - 1)2, for u(x~ > 1. In order to reduce this error and thus the overall error,
a larger error for other indexing weights is taken into account. There are three possible strategies for
coping with this problem:
  * Some experiments performed with other material have shown that overall indexing quality can
    be improved by excluding outliers from the learning sample ([Pfeifer 91]). As outliers, not only
    those pairs with weights lying on the "correct side" of the interval [0,1] (i.e. either y = 0 and
    u(x) < 0 or y = 1 and u(x~ > 1) should be regarded. Also the exclusion of those pairs with
    weights lying on the "wrong side" (i.e. either y = 0 and u(x~ > 1 or y = 1 and u(x~ < 0) yields
    better results.
  * Using logistic functions instead of linear functions overcomes the problem of indexing weights
    outside the interval [0,1]. However, we have some experimental evidence ([Pfeifer 90]) that even
    in this case the removal of outliers from the learning sample (where the outliers are defined w.r.t.
    a linear indexing function) improves the indexing quality.
  * For the experiments described here, we switched from linear to polynomial functions, which also
    gave fairly good results.
The function finally used for indexing with single terms only is

           u(i) =    0.00042293                                       +
                     0.00150083    if                logidf    imaxif +
                   -0.00150665     if               imaxif            +
                     0.00010465    logidf                             +
                   -0.00122627     lognumierms      imaxif.

F or indexing with phrases, we first had to derive the set of phrases to be considered. We took all
adjacent non-stopwords that occurred at least 25 times in the D1 (training) document set. Then
an indexing function common for single words and phrases was developed by introducing additional
binary factors is~single (is~phrase) having the value 1 if the term is a single word (a phrase) and 0
otherwise. The parameters if and logidf were defined for phrases in the same way as for single words,
and in the computation of imaxif and lognumierms for a document both phrases and single words
were considered. This procedure produced the indexing function

           u(x~ =  0.00034104                                           +
                   0.00141097  is~single       if      logidf   imaxif +
                  -0.00119826  is~single       if      imaxif           +
                   0.00014122  is~single       logidf                   +
                  -0.00120413  lognurnierms    zmaxif                   +
                   0.00820515  is~phrase       if      logidf  . imaxif +
                  -0.04188466  is~phrase       if      imaxif           +
                   0.00114585  is~phrase       logidf.


                                             92

For each phrase occurring in a document, indexing weights for the phrase as well as for its two
components (as single words) were computed.
In the following, we will refer to the indexing with single words only as "word indexing" and to the
indexing using both single words and phrases as "phrase indexing".


3   Retrieval
For the probabilistic document indexing weights as described above, there are specific retrieval func-
tions which yield a probabilistic or utility-theoretic ranking of documents w.r.t. a query. These func-
tions differ in the aspect whether or not they consider relevance feedback information for the specific
query. For this reason different retrieval functions were applied for the ad-hoc queries and the routing
queries.

3.1  Ad-hoc queries
For the ad-hoc queries, we used a linar utility-theoretic retrieval function described in [Wong & Yao 89).
Let qT~ denote the set of terms occurring in the query, ~ the set of document terms and ti~,,,, the
indexing weight u(x~i~, dm)). If c~k gives the utility of term t~ for the actual query q~, then the utility
of document dm w.r.t. query q~ can be computed by the retrieval function
                                               Cik Uim.                                  (1)

                                        t~qT~ndT~


The only problem here is the estimation of the utility weights Cik, for which we do not have a theoreti-
cally justified method. As a heuristic approach, we used the SMART if.idf weights, where If denotes
the number of occurrences of the term I~ in the query q~ here [Salton & Buckley 88]. We assume that
there are other choices for this parameter which could significantly improve retrieval quality.

                    run               if.idf  ~uhra1     ~uhrpi
                                              (words)    (phrases)
                    average precision for recall points:
                    11-pt Avg.        0.1750  0.1943     0.2054
                    3-pt Avg.         0.1159  0.1399     0.1664
                    query-wise comparison with median:
                    11-pt Avg:                32:14      32:17
                    Prec. @ 100 docs          25:21      35:13
                    Best/worst results:
                    11-pt Avg:                6(2)11(1)  4(1)12
                    Prec. @ 100 docs          2(2)12     8(3)1(2)
                    single words vs. phrases:
                    11-pt Avg:                       25:25
                    Prec. @ 100 docs                 21:28

                          Tabelle 1: Results for adhoc queries

The retrieval function (1) was applied for both kinds of document indexing, where words only were
considered for run fuhral and both words and phases in the run fuhrpi. Figure 2 shows the recall-
precision curves for both runs in comparison to a standard SMART if  idf run. It can be seen that


                                           93

both probabilistic indexing methods clearly outperform the if idf run, with the exception of the
low-recall end for the phrase run (see below). Comparing word indexing with phrase indexing shows
mixed results (see also table 1): For the recall-precision averages, phrases perform better again with
the exception of the low recall end. More information is given by the recall and precision values at
different numbers of documents retrieved (see appendix of this volume). Precision for phrases is worse
or about equal to that of words, but recall is always better. This means that phrases perform better
for narrow queries (with a small number of relevant documents).

       precision

       0.8-                                                  if idf ~

       0.7-                                 iuhrai (words)          I
                                           ~uhrp1 (phrases)         p
       0.6-

       0.5-

       0.4-

       0.3-

       0.2-
       0.1-

                    0.2       0.4       0.6                  0~        1   recall
   Abbildung 2: Recall-precision curves for probabilistic indexing in comparison to if. idf run

          0

In general, one would expect that using phrases in addition to single words would never decrease
precision. We blame the current definition of the retrieval function (1) for the observed behaviour.
This function assumes that all the terms considered are independent, which is obviously not true when
we consider a phrase in addition to its two components as terms. Since the size of the error made
here depends on the indexing weights of the components, this fact may explain the precision decrease
for the highest ranked documents (i.e. at low recall levels). A better strategy would be to ignore the
components in case the phrase occurs in the document.
The query-wise comparison with the median (see table 1) shows a large scattering of the results for
the single queries. A preliminary analysis has indicated that the relative performance for a specific
query depends significantly on the fact whether or not the query statement contains negated terms.
For example, in topic 86 we have... it is a bank, as opposed to another financial institution such as a
savings and loan or credit union ... and topic 87 reads ... Civil actions, such as creditor claims filed
for recovery of losses, are not relevant ... . In both cases, our system yields one of the worst results
(for word indexing). Since our query indexing procedure extracts only the terms from the query and is
not able to recognize negation, the system explicitly searches for documents containing these negated
terms.  Theoretically, it is obvious that negated terms should be given a negative utility weight.
However, the examples cited here show that it is a difficult task to recognize negation automatically.

3.2    Routing queries
For the routing queries, the retrieval-with-probabilistic-indexing (RPI) model described in [Fuhr 89]
and [Fuhr 92) was applied. This model combines query-specific relevance feedback information with


                                          94

probabilistic document indexing in order to estimate query term weights. Let Pik denote the average
indexing weight of term i~ in the documents judged relevant w.r.t. query q~ and r~k the average
indexing weight of i~ in the nonrelevant documents. Now the query term weight is computed by the
formula
                                        = Pik(1 - rik)
                                          rik(l-Pik) -1

and the RPI retrieval function yields

                                ~qk,dm)=   ~    log(cikuim + 1).                   (2)
                                         tiE q~TndT~


In our experiments, due to the lack of time, we used the standard SMART if idf document indexing
here (Salton & Buckley 88] (single words only) instead of the probabilistic indexing described above.
After an initial retrieval run with both if. idf weights for queries and documents, relevance feedback
information was used for computing the feedback query term weights Cik. Only the terms occurring in
the query were considered here, so no query expansion took place. Theoretically, the RPI formula can
also be applied in case of query expansion. However, the additional terms should be treated differently
when estimating their query term weights. This problem has not been investigated yet for the RPI
model.

                                        # queries: 49
                                query-wise comparison with median:
                                11-pt Avg:               43:5
                                Prec. @ 100 docs         43:5
                                Best/worst results:
                                11-pt Avg:               12/1
                                Prec. C~ 100 docs        16/1

                                Tabelle 2: Results for routing queries

Table 2 shows the results for the run ~uhra2 with routing queries. It can be seen that this approach
works very well for almost every query. The single worst result is for topic # 50, where we did
not retrieve any relevant document; this outcome is due to the fact that there was only one rele-
vant document in the training sample, which is obviously not sufficient for probabilistic parameter
estimation.


A    Operational details of runs
A.1   Overall
All runs were done completely automatically, treating the text portions of both documents and queries
as fiat text without structure.  This made everything much simpler, but was not ideal given the
complexity in both form and meaning of the queries. Recall-precision definitely suffered.
All actual indexing (as opposed to weighting) and retrieval was done with the Cornell SMART Version
11.0 system, using the standard SMART procedures (eg, stopwords, stemming, inverted file retrieval).
All runs were made on a Sun Sparc 2 with 64 Mbytes of memory. All times reported are CPU time.


                                                95

A.2   Ad-hoc runs
Two automatic ad-hoc runs were done; one in which the documents and queries were indexed with
single terms only, and one in which they were indexed with both single terms and adjacency phrases.
The overall procedure for both runs was:

1. Index D1 (the learning document set) and Qi (the learning query set).
2. For each document d ~ D1
  2.1 For each q C Qi
     2.1.1 Determine the relevance value r of d to q
     2.1.2 For each term tin common between qT (set of query terms) and
         ~ (set of document terms)
         2.1.2.1 Find values of the elements of the relevance description
               involved in this run and add values plus relevance
               information to the least squares matrix being constructed
3. Solve the least squares matrix to find the coefficient vector a
4. Index D1 U D2 (both sets of documents together) with term-freq weights.
5. For each document d C D1 U D2 (both sets of documents together)
  5.1 For each term t C (1T
     5.1.1 Find values of the relevance description x~t, d) involved in run.
     5.1.2 Give t the weight a~ v~i(t, d))
         where a is the value determined in step 3.
  5.2 Add d to the inverted file.
6. Weight Q2 (test query set) with tf. idf weights (ntc variant).
7. Run an inner product inverted file similarity match of Q2 against
  the inverted file formed in step 5, retrieving the top 200 documents.


In an operational environment, the learning document set and test document set would be the same,
so step 1 (or step 4) would be omitted. Once coefficients have been found, they should remain valid
unless the character of the collection changes. So new documents can be added to a dynamic collection
by just going through step 5 for each new document.
The algorithm above is different from that implemented in the past for this learning approach. Earlier
versions iterated over queries (instead of documents) in step 2 and used inverted files for speed (with
the document vectors still being needed). However, the size of TREC required a reformulation since
document vectors and inverted files could not be kept in memory.
The coefficient determination is valid only if the relevant judgements used are representative of the
entire set of reolevance judgements. In this first TREC, the initial judgements are fragmentary and
not very representative; it's impossible to tell how much this affects things.

A.3   Single term automatic ad-hoc run
The single term ad-hoc run used 5 factors (described in 2.1):
   * constant
   * tf logidf. irnaxtf
   * if imaxif
   * logidf
   * lognurnierms imaxif


                                          96

Determining the coefficients for the factors in the single term run took about 1.7 hours (steps 2 + 3
above). Construction of the inverted file (steps 4+5) containing the factor weighted terms took about
6.4 hours from scratch. Note that this is only about 1.0 hours longer than construction of a normal
(for SMART) if idf weighted inverted file.
Query indexing and weighting used the normal SMART procedures and took about 1.5 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors, Description were used to index the query, with
no distinction made between fields.
Retrieval plus ranking took 383 seconds.

A.4   Phrase automatic ad-hoc run
The phrases being used were tw~term SMART adjacency phrases.  Phrases were adjacent non-
stopwords, term components stemmed, that occurred at least 25 times in the D1 document set.
The term components were put into alphabetical order, thus the text phrases "information retrieval"
and "retrieving information" both mapped to the same phrase concept. The phrases were treated as
a separate ctype within an indexed vector, and had their own dictionary and inverted file separate
from those of the single terms.
Determination of phrases took 5.8 hours, finding 4,700,000 phrases occurring in D1 at least once. Of
those phrases 158,000 occurred at least 25 times. These phrases were then put into a dictionary and
used as controlled vocabulary for phrases when doing the indexing of D1 (step 1) and D1 U D2 (step
4). The single term indexing remained exactly as it was in the single term run (terms occurring in
phrases were not removed from the vector).
The phrase term ad-hoc run used 8 factors (described in 2.2).
  * consi ant
  * is~single if    logidf. irnaxif
  * is~single . if  imaxif
  * is~sing1e . lo9idf
  * lognumierms . imaxif
  * is~phrase if. logidf imaxif
  * is~phrase if. imaxif
  * is~ph'~ase . logidf
Determining the coefficients for the factors in the phrase run took about 2.4 hours (steps 2 + 3 above).
Construction of the inverted file (steps 4+5) containing the factor weighted terms took about 12.6
hours from scratch. Note that this is only about 1.7 hours longer than construction of a normal (for
SMART) if  idf weighted inverted file with phrases.
Query indexing and weighting used the normal SMART procedures and took about 2.7 seconds. The
fields Topic, Nationality, Narrative, Concepts, Factors and Description were used to index the query,
with no distinction made between fields.
Retrieval plus ranking took 374 CPU seconds. (This was less than the single term run, but for no
apparent reason. Perhaps the machine was less loaded.)

A.5   Automatic routing run
There was one automatic routing run done.  It was totally unconnected to the factor weighting
approach described above. Basically, it was very easy to implement and run so we decided we might
as well submit it. The actual weighting function had to be programmed, but even so less than 3
person-days in total was spent on routing.


                                               97

The routing experiment format was treated almost exactly as a normal relevance feedback experimental
run. The overall procedure was:
1. Index query set Qi and document set D1 with if idf weights
2. For each query q ~ Qi
  2.1 For each term I � qT (set of query terms)
     2.1.1 Reweight term I using the RPI relevance weighting formula and
          the (fragmentary) relevance information supplied.
3 Index document set D2 with if. idf weights. Note that the collection
  frequency information used in the idf weight was derived from occurrences
  in D1 only (in actual routing the collection frequencies within D2 would
  not be known)
4. Run the reweighted queries of Qi (step 2) against the inverted file (step 3),
  returning the top 200 documents for each query.

This approach differs from true routing in that
  A. All documents of D2 were indexed at once instead of individually indexing them and comparing
     at each query sequentially. Thus the indexing/retrieval times obtained for the above algorithm
     are pretty meaningless.
  B. True routing is really a binary decision most often implemented as a similarity threshold, and
     retrieving the top 200 documents (in ranked order) wouldn't normally be done. However, this
     difference from true routing is required for evaluation purposes, and was thus required for TREC.
Note that the approach was completely automatic with the query and documents treated as fiat text
(no structure). Differing from the ad-hoc runs above, the queries were indexed including the words
from all topic sections.
It's unknown what effect the fragmentary relevance information had on the query reformulation. The
strength of the effect will depend on whether the top similarity documents according to the if idf
weights used had been judged and included in the fragmentary judgements.
Step 1 took 3.0 hours, step 2 about 1304 seconds, step 3 about 1.9 hours, and step 4 about 312 seconds.


References
Fuhr, N.; Buckley, C. (1991). A Probabilistic Learning Approach for Document Indexing. ACM
  Transactio,,s 01, Information Systems 9(3), pages 223-248.
Fuhr, N.  (1989). Models for Retrieval with Probabilistic Indexing. Information Processing and
  Management ~5(1), pages 55-72.
Fulir, N. (1992). Integration of Probabilistic Fact and Text Retrieval. In: Belkin, N.; Ingwersen, P.;
  Pejtersen, M. (eds.): Proceedings of the Fifieenth Annual International ACM SIGIR Conference on
  Research and Development in Information Retrieval, pages 211-222. ACM, New York.
Maron, M.; Kuhus, J. (1960). On Relevance, Probabilistic Indexing, and Information Retrieval.
  Journal of the ACM 7, pages 216-244
Pfeifer, U. (1990). Development of Log-Linear and Linear-Iterat:ve !ndexing Functions (in German).
  Diploma thesis, TH Darmstadt, FB Informatik, Datenverwaltungssysteme II.
Pfeifer, U. (1991). Entwicklung linear iterativer und logistischer Indexierungsfunktionen. In: Fuhr,
  N. (ed.): Information Retrieval, pages 23-37. Springer, Berlin et al.
Robertson, S.  (1977). The Probability Ranking Principle in IR.   Journal of Documentation 33,
  pages 294-304.


                                         98

Salton, G.; Buckley, C. (1988). Term Weighting Approaches in Automatic Text Retrieval. Infor-
 ma~ion Processing and Managemen~ 24(5), pages 513-523.
Wong, S.; Yao, Y. (1989). A Probability Distribution Model for Information Retrieval. Informaiton
 Processing and Managemen~ 25(1), pages 39-53.


                                     99