Effective and Efficient Retrieval from
            Large and Dynamic Document Collections
                                Daniel Knaus, Peter Scha~uble
                                (knaus schauble)Šinf.ethz.ch
                        Swiss Federal Institute of Technology (ETH)
                                CH-8092 Zu~ rich, Switzerland

Abstract
A new retrieval method together with a new access
structure is presented that is aimed at a high update
efficiency, a high retrieval efficiency and a high retrieval
effectiveness. The access structure consists of signatures
and non-inverted descriptions. This access structure can
be updated efficiently because the description of a single
document is stored in a compact form. The signatures
are used to compute approximate retrieval status values
first, and the non-invert'ed descriptions are then used to
determine the final list of documents ranked by the ex-
act retrieval status values. Our basic approach based
on the standard tf * idf weighting scheme has been im-
proved in in both retrieval effectiveness and retrieval ef-
ficiency. On an average, the time for retrieving the top
ranked document is clearly below two seconds while the
document collection can be updated in 10 msec. (insert-
ing, deleting, or modifying a document description).


1   Introduction
Information retrieval systems are more and more used
to retrieve information from l~rge ~ d~m~mic data
collections, e.g. in office automation or in newscast-
ing companies.    Since in such environments, the data
collections may have to be updated many times every
second, the update efficiency is an important evaluation
criterion that should be taken into account in addition
to the traditional evaluation criteria (retrieval effective-
ness, retrieval efficiency, etc.) [10, pp.161], [12, pp.144].
  Before introducing our approach, we show that the
update efficiency is conflicting with the retrieval effec-
tiveness and with the retrieval efficiency. In the case of
static data collections, the retrieval effectiveness crite-
rion is usually met by means of weighted retriev~l in-
cluding relevance feedback [4], [9] whereas the retrieval
efficiency criterion is met by precomputing the docu-
ment descriptions and by organizing these descriptions
as an inverted file [5]. In the case of dynamic data collec-
tions, however, a transaction updating the inverted file


                                                163

may block other retrieval and update transactions for an
unacceptable long time because adding the description
of a new document to an inverted file is time consuming,
particularly if the document is long. A possible remedy
is to use signatures instead of an inverted file [2]. Sig-
natures can be updated efficiently; however, they do
not allow document feature weighting and hence, the
retrieval effectiveness of the signature based method is
inferior to a fully weighted retrieval method [8]. Thus,
it is difficult to achieve simultaneously high retrieval
effectiveness, high retrieval efficiency, and high update
efficiency.
 In our approach, we focus on the retrieval from large
and dynamic data collections where we face the prob-
lem of achieving simultaneously high retrieval effective-
ness, high retrieval efficiency, and high update efficiency.
Addressing this problem is justified by the need for ap-
propriate retrieval capabilities in dynamic environments
such as in office automation or in newscasting compa-
nies. Within the SPIDER project [11] we have devel-
oped a retrieval method and an access structure which
supports fully weighted retrieval and which achieves
short response times even when the collection is updated
several times every second. In Section 2, we describe our
basic approach, in Section 3, we focus on refinements of
this basic approach, in Section 4, we present the results,
and in Section 5, we draw some conclusions.


2   SPIDER's Query Evaluation
    Algorithm
In this section,  we present a retrieval method to-
gether with a new access structure facilitating both fast
weighted retrieval and efficient updates of the access
structure. A preliminary version was presented in [11].
Our access structure consists of 8ig~~t~res CL~d ThOTL-
imverted descvi~tio~s of the documents. The signatures
are used to compute approximate retrieval status values
RSV0(q, d~) first and the non-inverted document de-
scriptions are then used to determine the exact retrieval
status values RSV(q, d~). The trick is that the approx-

imate retrieval status values provide fairly tight upper
boumda for the exact retrieval status values. We will see
how we can take advantage of these upper bounds such
that only a few exact retrieval status values have to be
computed.
 The retrieval method determining the approximate
retrieval status values and the retrieval method deter-
mining the exact retrieval status values are given by the
functions RSV0 and RSV respectively. Let

                                              R~V0

                                             d7 -


                                             d13-

                                                             R V

                                                                        S

                   {~O~  .,`Pm-1}                                          -d13

be the indexing vocabulary (e.g. a set of terms) and let

                                             d5 -

                                                                  - d7

          D  := ~

be the current document collection.  We assume that
the signatures ~(d5) consist of w bits where the bit at
position p has the value ~(d5)[p].


Every indexing feature ~ is assigned a bit position p =
h(~) by means of a hash function h(~).


The function h specifies a signature ~(d5) for every doc-
ument d5 by setting the bit at position p iff d~ contains
a feature ~ which is hashed to this position.

                { Ol

if~~ Ed~ :h(~)=p
otherwise

 We now define the approximate retrieval staLus
value RSV0(q, d5) and the exact retrieval status value
RSV(q,d5) as follows.

Figure 1: Computation of the exact retrieval status val-
ues in decreasing order of the approximate retrieval sta-
tus values.

(2).  The TLidf(~)'s are determined by the document
frequencies df(~~).  The normalizations of the feature
frequency component and the normalizations of the doc-
ument frequency component do not affect the retrieval
status values because they cancel out when using the
cosine measure.
 Because of these normahzations, the approximate
document weights a?5 are always upper bounds of the
exact document weights aij, but they do not depend on
the feature frequencies ff('p~, d5).

                       O<ajj<~a,9j
                       0 <b~

             1
            _*
RSV0(q,d~) := Id~ I

a,95 *b~   (1)      It is easy
                  ~(d5) = 0.
                  implies r(d5

to show that RSV(q,d5) = 0 if ~(q) A
Furthermore, from the fact that ~ E d5
     =1 and from a,~ < a,9~ follows

           1
RSV(q,d5):=~~*     ~    a~~*bj (2)
                                             RSV(q,d5) < RSV0(q,d5)    (3)
          I d51  ~~Eq:~Ed~

where a,9~ denote the approximate weights of document
features, aij denote the exact document weights, and b~
denote the query weights. The size Id; I of the descri~
tion vector d; is defined as usual (see Table 1).
 Our basic approach consists of a tf * idf weighting
scheme, i.e.  the query features are "ntn" weighted
whereas the document features are "ntc" weighted (see
Table 1).
 The query weights b~ depend on the feature frequen-
cies ff(~i, q) and on the normalized inverse document
frequencies nidf(tp~). The exact document weights a,,'
depend on the feature frequencies ff(~i, d5) and on the
nidf(~). In addition, they (aij) are cosine normalized,
which is expressed by the division by  d4 I in (1) and


                                                   164

 In what follows, we describe the evaluation algorithm
by means of an example. Let q be the user's query.
In a first step, bitwise AND-operations are performed
with the query signature ~(q) and the signature ~(d5)
of every document d5 E D. If ~(q) A ~(d5) is non-zero
then the approximate retrieval status value RSV0(q, d5)
is computed. The first step is finished by ordering the
documents in decreasing order of the approximate re-
trieval status values. Remember that ~(q) A ~(d5) = 0
implies that RSV(q, d5) = 0 and hence, d5 can be ig-
nored in the sequel.
 In the second step, the exact retrieval status values
are computed in the order that was previously deter-
mined by means of the approximate retrieval status val-
ues. As shown in the following example, only as many

exact retrieval status values are computed as necessary.
Assume that d7 is the first (i.e.  d7 has the highest
Rsv0), d13 is the second, and d5 is the third docu-
ment as shown in Figure 1   In this case, RSV(q, d7)
is computed first and next, RSV(q, d13) is computed.
According to Figure 1, RSV(q, d13) is the highest ex-
act retrieval status value of those two exact retrieval
status values that are known at this moment.  Fur-
thermore, all unknown exact retrieval status values are
less than or equal to RSV0(q, d5) because of (3). Since
RSV(q, d13) > RSV0(q, d5), we can conclude that d13
has the highest exact retrieval status value of {Lll docu-
ments. In this example, only two exact retrieval status
values had to be computed to determine the top ranked
document. In practice, more exact retrieval status val-
ues have to be computed.
 The evaluation algorithm is based on an access strvc-
tvre that specifies the following functions.

                 d5 i. ~d~)
                                     : ~ e d~}
         midf: tpj I    nidf(~)
         size: d5        d;

The functions ~ and ~ determine the signatures and the
non-inverted document descriptions respectively. The
functions nidf and size provide scaling and normaliza-
tion factors. The function values ~(d~) and ~(d5) are
determined completely by the document d5 itself. The
function values niŁ~f(~) and size (d5), however, depend
on the entire document collection. Fortunately, the vari-
ation of nidf(~~) is small if the data collection is suf-
ficiently large.  Thus, nidf(~) has to be recomputed
only if the domain of the data collection is shifting and
size(d5) has to be recomputed if either TLidf was recalcu-
lated or d5 was modified. In the next section, we will see
how this basic query evaluation scheme can be refined.


3   Refinements
The basic evaluation algorithm described in Section 2
achieves an excellent update efficiency because the de-
scription of a single document is stored in a compact
form which can be updated efficiently. However, the
retrieval effectiveness as well as the retrieval efficiency
need further refinements in the case of the TREC experi-
ment where extremely large queries are matched against
a large document collection containing some large doc-
uments.
 To improve the retrieval effective~ess, we adapted the
basic evaluation algorithm to the best global weighting
scheme achieved by the SMART system at TREC-1 [1].
The query features are "ltn" weighted, i.e. the query
weights depend on the logarithm of the feature frequen-
cies and on the inverse document frequencies, and they

are not cosine normalized. The "lnc" weighted docu-
ment features depend on the logarithm of the feature
frequencies. They are cosine normalized, but they do
not have a document frequency component. The de-
tailed definitions of the feature weights are given in Ta-
ble 1.

          In order to achieve a good retrieval efflcie~cv for the
        TREC collection we have reduced the indexing vocab-
        ulary in such a way that on one hand the retrieval
        efficiency is improved and on the other hand, the re-
        trieval effectiveness is not deteriorated very much. Our
        indexing vocabulary was not only reduced by apply-
        ing a stop word list and a word reduction algorithm,
        but also by eliminating further indexing terms. From
        another project [3] and from the experiences of the
        SMART system at TREC-1 [1] we knew that the re-
        moval of the indexing features having a very high doc-
        ument frequency does not deteriorate the retrieval ef-
        fectiveness very much. The reasons why this restriction
(4)     has a small effect on the retrieval effectiveness is the
(5)     following. The elimination of the high document fre-
(6)     quency features changes the retrieval status values very
        little because these features have a small inverse docu-
(7)     ment frequency and hence, they contribute little to the
        retrieval status values. Such an elimination of the high
        document frequency features is simil& to the applica-
        tion of a collection dependent stop word list in addition
        to a general stop word list. Thus, this restriction has
        only a small effect on the retrieval status values.
          In what follows, we focus on the retrieval efficiency
        and why it is improved by this restriction. With a re-
        duced indexing vocabulary,

          1. fewer documents have a positive approximate re-
           trieval status value,

          2. fewer features ~j contribute an error (a~Q~ - a~~) *
           to the approximate retrieval status values, and

          3. fewer false matches (caused by collisions) occur
           when computing the approximate retrieval status
           values.

        Both scanning the signatures and sorting the approxi-
        mate retrieval status values become faster when fewer
        approximate retrieval status values are positive. Fur-
        thermore, the approximate retrieval status values be-
        come tighter upper bounds when fewer false matches
        occur and when the sum of the errors (~,Q5 - aij) *
        is reduced. Having tighter upper bounds means that
        fewer exact retrieval status values have to be computed
        until the top ranked document is determined. The re-
        trieval effectiveness and the retrieval efficiency achieved
        by means of the reduced indexing vocabulary are pre-
        sented in the next section.

165

4   Experiments
In this section, we present the evaluation of the method
described above and we compare it to methods to other
weighting schemes. We focus on the efficiency of modi-
fying documents and on the correlation between the re-
trieval efficiency and the retrieval effectiveness. We will
also see what is the influence of the vocabulary restric-
tion on the retrieval effectiveness and on the retrieval
efficiency. For the final evaluations we concentrated on
the adhoc queries.
  Before discussing the results, we define what we mean
by a p~rtitio~ a vttm and an ezperimemt. The document
collection has been split up into several p~vtition8, each
consisting of at most 100,000 documents.  Thus, the
large collections DOEl (Department of Energy, Disk 1)
and ZIFF3 (Ziff-Davids Publishing, Disk 3) were divided
into three and two partitions respectively. A r~m con-
sists of the evaluation of 50 queries (either the set of
routing queries or the set of ad hoc queries) against all
documents of one partition. For each query, the 1000
top ranked documents have been retrieved. An e~e~-
ime~t consists of several runs and the merging of the
lists of ranked documents for each query. For TREC-2,
the two sets of experiments "Topics 51-100 versus Disk
3" and "Topics 101-150 versus Disks 1 and 2" have been
evaluated.
  All efficiency evaluations are based on CPU time
rather than on real time in order to eliminate side effects
froni other jobs running on the same machine. In these
experiments, we used a SUN SPARCserver MP690 with
128 MBytes RAM.
  We derived the document descriptions directly from
the CD's. The indexing process included the elimination
of stop words (van Rusbergen's stop list [12, pp.18]) and
Porter's word reduction algorithm [6]. The normalized
inverse document frequencies have been derived from
the documents of disks 1 and 2 only. Uncompressing
and indexing a single document needs around 100 msec
on an average depending on the length of the document.
The computation of the inverse document frequencies
from the descriptions took about 1.5 hours of CPU time.
  The average time for inserting a document description
into the access structure is on a scale of 10 msec - again
depending on the number of features ~ per document.
Inserting a document description into an inverted file
would need more time because the postings had to be
inserted into the different lists associated to each fea-
ture.
  The restriction of the vocabulary was accomplished
by omitting features occurring in more than 15% of all
documents (from the disks 1 and 2), i.e. in more than
111'337 documents. We have chosen 15% of the collec-
tion although also a stronger limit of 10% should not
affect the retrieval effectiveness [1]. In our experiments
we compare the 15% limit ("dflS") to a non restricted


                                                   166

vocabulary ("all").
  We now have three parameters which can be com-
bined to specify eight different retrieval methods. Each
method can be identified by a string built from the labels
for the document feature weighting, the query feature
weighting and the vocabulary restriction:

  ~doc~feat~weight~. (quer~feat~weight) . (vocab)

In what follows, we present the results of the following
nine methods:

           MO ntc.ntn.all

           Ml lnc.ltn.all
           M2 lnc.ltn.dflS
           M3 lnc.ntn.all
           M4 lnc.ntn.dflS

           MS ltc.ltn.all
           M6 ltc.ltn.dflS
           M7 ltc.ntn.all
           M8 ltc.ntn.dflS

  First, we compare the retrieval effectiveness of our
method (Ml) described in Section 2 to the standard
tf * idf method (MO) by means of the precision-recall
graph in Figure 2. As expected, the method Ml is more
effective than MO and achieves a retrieval effectiveness
among the best methods presented at TREC-2. In order
to find out the reason for this difference in the retrieval
effectiveness we must have a closer look at the influ-
ences of each parameter (document and query feature
weighting).
  In Figure 3' the 1 1-pts average precisions of each
method (MO to M8) are plotted on the left axis, and
they are connected to the median response times (for
the top ranked document) plotted on the right axis. The
most obvious conclusion from this graph is the follow-
ing: the higher the precision, the slower the response,
and vice versa. The method MO performs clearly worse
than the methods Ml to M8 in respect to both retrieval
effectiveness and retrieval efficiency.
  We concentrate on the response times of the top
ranked document because the response times of all fur-
ther ranked documents are of secondary interest, since
a user is supposed to read the top ranked document be-
fore looking at the other documents and the retrieval
system can retrieve further documents while the user is
reading the top ranked document.
  We can also see from Figure 3, what are the imfl~-
ences of the different p~rameters on the ~ver~ge prec?-
sion. Regarding the weighting of the document features,
the "inc" weighting achieves a 4-10% higher precision
than the "ltc" weighting. The "ntc" looses 5% of pre-
cision compared to "ltc". In the case of query feature
weighting, again the logarithmic "ltn" weighting is more

  Precision

    1
                             Ml inc.ltn.all -
  0.8                       MO ntc.ntn.all *


  0.6


  0.4

                                                                     average     median response
  0.2
                                                                    precision    time for 1. rank

    0

    0      0.2     0.4    0.6     0.8      1
                     Recall

                                                               0
                                                               0.33-

                                                                              - 2.2 sec.

Figure 2: Precision recall graphs of the most effective
method (Ml) and of the least effective method (Mo).


effective than the "ntn" weighting (10-15%). Restrict-
ing the vocabulary results in a 2-7% lower precision. We
can summ&ize the experiences as follows:

* The rndf(~)'s in the document feature weights have
  a bad influence on the retrieval effectiveness. It could
  be that for long documents the estimation of the
  lengths d5 is inappropriate when the nidf(~)'s are
  taken into account.

* Logarithmic feature weighting is more appropriate
  for the long TREC documents and queries than lin-
  ear feature weighting. Logarithmic feature weighting
  avoids an overweighting of features occurring very fre-
  quently within a document.

* Restricting the indexing vocabulary by ommitting fea-
  tures with a high document frequency df(~) has a
  noticeable influence on the average precision.

  In what follows, we discuss the iśLflueThceB of the diffe?-
emt p~?CLmeter8 om the reapom~e time (as shown in Fig-
ure 3). Restricting the vocabulary accelerates the query
evaluation (by 9-14%) for the reasons described in Sec-
tion 2. The "ntn" weighting of the query features is
also 9-14% faster than the "ltn" weighting. For doc-
ument feature weighting3 the "lnc" weighting is 5-10%
slower than the "ltc" weighting. The "ntc" weighting of
is even slower. These results can be explained in terms
of the approximation error.

* It is obvious that for the "ltc" weighting the approxi-
  mation error (a?5  aij )*b~ is smaller than for the "lnc"


                                              167

Ml lnc.ltn.all    -                 2.1 sec.

                                    2.0 sec.
M2 lnc.ltn.dflS
              0.30-                 1.9 sec.

MS ltc.ltn.all
                                    1.8 sec.
M3 lnc.ntn all
M6 ltc.ltn.df15~                    1.7 sec.


              0.27-                 1.6 sec.
M4 inc.ntn.dflS
M7 ltc.ntn.all~
                                    1.5 sec.
M8 ltc.ntn.dflS     7ž>
MO ntc.ntn.all~


              0.24-

- 1.4 sec.

 1.3 sec.

  0

Figure 3: Average precisions and response times of the
first ranked document.

   Partition                                        Partition

APi      IEIII                                   APi                                 11.9
AP2                                              AP2                              1 11.0
DOE1.1                                           DOEi.i           I 4.2
DOE1.2                                           DOEi.2           14.3
DOE1.3 `III...                                   DOEi.3 ~i.i
FRi                                              FRi              14.4
FR2     [`Eli -                                  FR2            I 3.4
WSJi                                             WSJi                                *ii.9
WSJ2                                             WSJ2                       18.6
ZIFFi                                            ZIFFi                      8.0
ZIFF2                                            ZIFF2                :5.9

           I  I  I  I  I  I  I  I  I I
        0  1  2  3  4  5  6  7  8  9 10 sec.


Figure 4: Response times of the first ranked document
per run (adhoc queries versus the listed partitions) for
the fastest method (M8 nltc.ntc.dfiS).


  weighting because of the scaling with the midf(~).
  On the other hand, the approximation error for the
  "ntc" weighting is very large compared to both the
  "ltc" and the Ł`lnc" weighting because of the normal-
  ized linear weighting instead of the normalized loga-
  rithmic weighting.

  The box plots [7, pp.336] presented in Figure 4 show
the distribution of the response times required to deter-
mine the top ranked document. A box plot visualizes
the median (the line within the box), half of all samples
(the box), and outliers (the dots) of a sample collection.
In our case we have 50 samples: the response times of
the 50 adhoc queries run against a partition. For most
queries the top ranked document was retrieved in less
than two seconds. The few outliers were all produced by
the same queries (#103, #136, #138, #144). In gen-
eral, the response times become shorter if a partition
contains less documents and if the document descrip-
tions consist of less postings on an average (see Fig-
ure 5).


5    Conclusions
In our approach we stressed the update efficiency. We
have shown that the retrieval effectiveness does not have
to be sacrificed to achieve a high update efficiency when
coping with highly dynamic document collections. Our
approach could probably be further improved by find-


                                          168

0

5

10    *106
    postings

    Figure 5: Number of postings per partition.


mg a weighting scheme that, on the one hand, achieves a
very good retrieval effectiveness and that, on the other
hand, can be approximated by frequency independent
weights with only little variation from the exact weights.
The retrieval efficiency could be improved by better
partitioning the document collections according to the
lengths of the documents. Our approach seems to be
very amenable to parallel processing. We may think of
several configurations (partitioning the query, partition-
ing the document collection, etc.). At this moment, it is
not clear which configuration is appropriate for which
requirements. Furthermore, we do not know yet how
dynamic the document collection must be such that our
access structure outperforms inverted files.


References
 [1] C. Buckley, G. Salton, and J. Allan. Automatic Re-
    trieval With Locality Information Using SMART.
    In TREC-1 Proceedi~g8, pages 59-72,1992.

 [2] W. B. Croft and P. Savino. Implementing Ranking
    Strategies Using Text Signatures. ACM Tra~ac-
    iio~ OIL IILfOrm~iOIL S~8tem5, 6(i):42-62, 1988.

 [3] U. Glavitsch and P. Schauble. A System for Re-
    trieving Speech Documents. In N. Belkin, P. In-
    gwersen, and A. M. Pejtersen, editors, ACM SI-
    CIR CoifereILce OIL R~D jIL IILformatioIL Retrieval,
    pages 168-176, 1992.

 [4] D. Harman.   Relevance Feedback Revisited. In
    N. Belkin, P. Ingwersen, and A. M. Pejtersen, ed-

   itors, ACM SIGIR Conferemce om R&D im i?Lfor-
   matiom Retriev~l, pages 1-10,1992.

[5] D. Harman,      E. Fox, R. A. Baeza-Yates, and
   W. Lee. Inverted Files. In W. B. Frakes and R. A.
   Baeza-Yates, editors, Imform~iom Retriev~l, ~
   StriŁct~res & Algorithms, pages 28-43. Prentice-
   Hall, Englewood Cliffs, NJ, 1992.

[6] M. F. Porter. An Algorithm for Suffix Stripping.
   Progr~m, 14(3):130-137, 1980.

[7] J. A. Rice. M~hem~ic~l St~i~tics CLILd DGtCL AmCLl-
   ysis.  Wadsworth & Brooks, Pacific Grove, CA,
   1988.

[8] G. Salton and C. Buckley.  Parallel Text Search
   Methods. CommumicCLtioms of the ACM, 31(2):202-
   215,1988.

[9] G. Salton and C. Buckley. Improving Retrieval Per-
   formance by Relevance Feedback.  JOtŁrmCLl of the
   ASIS, 41(4):288-297, 1990.

[10] G. Salton and M. McGill. Imtroductiom to Modern
   ImformCLtiom RetrievaL   McGraw-Hill, New York,
   1983.

[11] P. Schauble.   SPIDER: A Multiuser Information
   Retrieval System for Semistructured and Dynamic
   Data. In ACM SIGIR Conference on R&D in In-
   formation Retrieval, pages 318-327,1993.

[12] C. J. van Rusbergen. Information RetrievaL But-
   terworths, London, second edition, 1979.


                                                 169

1. Exact and Approximate Retrieval Status Values

"e~~": RSV(q, d~)

 1
I d~I  ~

*

                  Ua~~o~~: RSV0(q) d~)         1             a~95 *
                                             jd~

2. Document Length


                                     Id;     AI~%2

3. Weights of Query Features

                             :          ff(~j,q)*nsdf(~i)
                        altfl~ : b1     ~

4. Exact and Approximate Weights of Document Features

                 "nic" :


                 "inc"

                       0
                      a~5

                                   ff(~, d~)
                              max{!f(~h, d~) `9h E d5} * nidf(~)
                              1*TLidf(~~)
                                   (1 +log(ff(tp~,d5)))
                              (1 + log(max{ff(~h, d~) I   Edj}))
                              1

                  "itc" :

                          :=       (1 + iog(f f(~,d,)))
                              (1 + iog(max{ff(~~, d~) I ~h E d~}))

                         a19~     1* nidf(~)

5. Normalized Inverse Document Frequency

                                                       * nidf(~)

                                             log(1 + df(tp1))
                              nsdf(~)     1    log(1+n)


                           Table 1: Classification of weighting schemes.


                                          170