Effective and Efficient Retrieval from Large and Dynamic Document Collections Daniel Knaus, Peter Scha~uble (knaus schauble)©inf.ethz.ch Swiss Federal Institute of Technology (ETH) CH-8092 Zu~ rich, Switzerland Abstract A new retrieval method together with a new access structure is presented that is aimed at a high update efficiency, a high retrieval efficiency and a high retrieval effectiveness. The access structure consists of signatures and non-inverted descriptions. This access structure can be updated efficiently because the description of a single document is stored in a compact form. The signatures are used to compute approximate retrieval status values first, and the non-invert'ed descriptions are then used to determine the final list of documents ranked by the ex- act retrieval status values. Our basic approach based on the standard tf * idf weighting scheme has been im- proved in in both retrieval effectiveness and retrieval ef- ficiency. On an average, the time for retrieving the top ranked document is clearly below two seconds while the document collection can be updated in 10 msec. (insert- ing, deleting, or modifying a document description). 1 Introduction Information retrieval systems are more and more used to retrieve information from l~rge ~ d~m~mic data collections, e.g. in office automation or in newscast- ing companies. Since in such environments, the data collections may have to be updated many times every second, the update efficiency is an important evaluation criterion that should be taken into account in addition to the traditional evaluation criteria (retrieval effective- ness, retrieval efficiency, etc.) [10, pp.161], [12, pp.144]. Before introducing our approach, we show that the update efficiency is conflicting with the retrieval effec- tiveness and with the retrieval efficiency. In the case of static data collections, the retrieval effectiveness crite- rion is usually met by means of weighted retriev~l in- cluding relevance feedback [4], [9] whereas the retrieval efficiency criterion is met by precomputing the docu- ment descriptions and by organizing these descriptions as an inverted file [5]. In the case of dynamic data collec- tions, however, a transaction updating the inverted file 163 may block other retrieval and update transactions for an unacceptable long time because adding the description of a new document to an inverted file is time consuming, particularly if the document is long. A possible remedy is to use signatures instead of an inverted file [2]. Sig- natures can be updated efficiently; however, they do not allow document feature weighting and hence, the retrieval effectiveness of the signature based method is inferior to a fully weighted retrieval method [8]. Thus, it is difficult to achieve simultaneously high retrieval effectiveness, high retrieval efficiency, and high update efficiency. In our approach, we focus on the retrieval from large and dynamic data collections where we face the prob- lem of achieving simultaneously high retrieval effective- ness, high retrieval efficiency, and high update efficiency. Addressing this problem is justified by the need for ap- propriate retrieval capabilities in dynamic environments such as in office automation or in newscasting compa- nies. Within the SPIDER project [11] we have devel- oped a retrieval method and an access structure which supports fully weighted retrieval and which achieves short response times even when the collection is updated several times every second. In Section 2, we describe our basic approach, in Section 3, we focus on refinements of this basic approach, in Section 4, we present the results, and in Section 5, we draw some conclusions. 2 SPIDER's Query Evaluation Algorithm In this section, we present a retrieval method to- gether with a new access structure facilitating both fast weighted retrieval and efficient updates of the access structure. A preliminary version was presented in [11]. Our access structure consists of 8ig~~t~res CL~d ThOTL- imverted descvi~tio~s of the documents. The signatures are used to compute approximate retrieval status values RSV0(q, d~) first and the non-inverted document de- scriptions are then used to determine the exact retrieval status values RSV(q, d~). The trick is that the approx- imate retrieval status values provide fairly tight upper boumda for the exact retrieval status values. We will see how we can take advantage of these upper bounds such that only a few exact retrieval status values have to be computed. The retrieval method determining the approximate retrieval status values and the retrieval method deter- mining the exact retrieval status values are given by the functions RSV0 and RSV respectively. Let R~V0 d7 - d13- R V S {~O~ .,`Pm-1} -d13 be the indexing vocabulary (e.g. a set of terms) and let d5 - - d7 D := ~ be the current document collection. We assume that the signatures ~(d5) consist of w bits where the bit at position p has the value ~(d5)[p]. Every indexing feature ~ is assigned a bit position p = h(~) by means of a hash function h(~). The function h specifies a signature ~(d5) for every doc- ument d5 by setting the bit at position p iff d~ contains a feature ~ which is hashed to this position. { Ol if~~ Ed~ :h(~)=p otherwise We now define the approximate retrieval staLus value RSV0(q, d5) and the exact retrieval status value RSV(q,d5) as follows. Figure 1: Computation of the exact retrieval status val- ues in decreasing order of the approximate retrieval sta- tus values. (2). The TLidf(~)'s are determined by the document frequencies df(~~). The normalizations of the feature frequency component and the normalizations of the doc- ument frequency component do not affect the retrieval status values because they cancel out when using the cosine measure. Because of these normahzations, the approximate document weights a?5 are always upper bounds of the exact document weights aij, but they do not depend on the feature frequencies ff('p~, d5). O RSV0(q, d5), we can conclude that d13 has the highest exact retrieval status value of {Lll docu- ments. In this example, only two exact retrieval status values had to be computed to determine the top ranked document. In practice, more exact retrieval status val- ues have to be computed. The evaluation algorithm is based on an access strvc- tvre that specifies the following functions. d5 i. ~d~) : ~ e d~} midf: tpj I nidf(~) size: d5 d; The functions ~ and ~ determine the signatures and the non-inverted document descriptions respectively. The functions nidf and size provide scaling and normaliza- tion factors. The function values ~(d~) and ~(d5) are determined completely by the document d5 itself. The function values ni£~f(~) and size (d5), however, depend on the entire document collection. Fortunately, the vari- ation of nidf(~~) is small if the data collection is suf- ficiently large. Thus, nidf(~) has to be recomputed only if the domain of the data collection is shifting and size(d5) has to be recomputed if either TLidf was recalcu- lated or d5 was modified. In the next section, we will see how this basic query evaluation scheme can be refined. 3 Refinements The basic evaluation algorithm described in Section 2 achieves an excellent update efficiency because the de- scription of a single document is stored in a compact form which can be updated efficiently. However, the retrieval effectiveness as well as the retrieval efficiency need further refinements in the case of the TREC experi- ment where extremely large queries are matched against a large document collection containing some large doc- uments. To improve the retrieval effective~ess, we adapted the basic evaluation algorithm to the best global weighting scheme achieved by the SMART system at TREC-1 [1]. The query features are "ltn" weighted, i.e. the query weights depend on the logarithm of the feature frequen- cies and on the inverse document frequencies, and they are not cosine normalized. The "lnc" weighted docu- ment features depend on the logarithm of the feature frequencies. They are cosine normalized, but they do not have a document frequency component. The de- tailed definitions of the feature weights are given in Ta- ble 1. In order to achieve a good retrieval efflcie~cv for the TREC collection we have reduced the indexing vocab- ulary in such a way that on one hand the retrieval efficiency is improved and on the other hand, the re- trieval effectiveness is not deteriorated very much. Our indexing vocabulary was not only reduced by apply- ing a stop word list and a word reduction algorithm, but also by eliminating further indexing terms. From another project [3] and from the experiences of the SMART system at TREC-1 [1] we knew that the re- moval of the indexing features having a very high doc- ument frequency does not deteriorate the retrieval ef- fectiveness very much. The reasons why this restriction (4) has a small effect on the retrieval effectiveness is the (5) following. The elimination of the high document fre- (6) quency features changes the retrieval status values very little because these features have a small inverse docu- (7) ment frequency and hence, they contribute little to the retrieval status values. Such an elimination of the high document frequency features is simil& to the applica- tion of a collection dependent stop word list in addition to a general stop word list. Thus, this restriction has only a small effect on the retrieval status values. In what follows, we focus on the retrieval efficiency and why it is improved by this restriction. With a re- duced indexing vocabulary, 1. fewer documents have a positive approximate re- trieval status value, 2. fewer features ~j contribute an error (a~Q~ - a~~) * to the approximate retrieval status values, and 3. fewer false matches (caused by collisions) occur when computing the approximate retrieval status values. Both scanning the signatures and sorting the approxi- mate retrieval status values become faster when fewer approximate retrieval status values are positive. Fur- thermore, the approximate retrieval status values be- come tighter upper bounds when fewer false matches occur and when the sum of the errors (~,Q5 - aij) * is reduced. Having tighter upper bounds means that fewer exact retrieval status values have to be computed until the top ranked document is determined. The re- trieval effectiveness and the retrieval efficiency achieved by means of the reduced indexing vocabulary are pre- sented in the next section. 165 4 Experiments In this section, we present the evaluation of the method described above and we compare it to methods to other weighting schemes. We focus on the efficiency of modi- fying documents and on the correlation between the re- trieval efficiency and the retrieval effectiveness. We will also see what is the influence of the vocabulary restric- tion on the retrieval effectiveness and on the retrieval efficiency. For the final evaluations we concentrated on the adhoc queries. Before discussing the results, we define what we mean by a p~rtitio~ a vttm and an ezperimemt. The document collection has been split up into several p~vtition8, each consisting of at most 100,000 documents. Thus, the large collections DOEl (Department of Energy, Disk 1) and ZIFF3 (Ziff-Davids Publishing, Disk 3) were divided into three and two partitions respectively. A r~m con- sists of the evaluation of 50 queries (either the set of routing queries or the set of ad hoc queries) against all documents of one partition. For each query, the 1000 top ranked documents have been retrieved. An e~e~- ime~t consists of several runs and the merging of the lists of ranked documents for each query. For TREC-2, the two sets of experiments "Topics 51-100 versus Disk 3" and "Topics 101-150 versus Disks 1 and 2" have been evaluated. All efficiency evaluations are based on CPU time rather than on real time in order to eliminate side effects froni other jobs running on the same machine. In these experiments, we used a SUN SPARCserver MP690 with 128 MBytes RAM. We derived the document descriptions directly from the CD's. The indexing process included the elimination of stop words (van Rusbergen's stop list [12, pp.18]) and Porter's word reduction algorithm [6]. The normalized inverse document frequencies have been derived from the documents of disks 1 and 2 only. Uncompressing and indexing a single document needs around 100 msec on an average depending on the length of the document. The computation of the inverse document frequencies from the descriptions took about 1.5 hours of CPU time. The average time for inserting a document description into the access structure is on a scale of 10 msec - again depending on the number of features ~ per document. Inserting a document description into an inverted file would need more time because the postings had to be inserted into the different lists associated to each fea- ture. The restriction of the vocabulary was accomplished by omitting features occurring in more than 15% of all documents (from the disks 1 and 2), i.e. in more than 111'337 documents. We have chosen 15% of the collec- tion although also a stronger limit of 10% should not affect the retrieval effectiveness [1]. In our experiments we compare the 15% limit ("dflS") to a non restricted 166 vocabulary ("all"). We now have three parameters which can be com- bined to specify eight different retrieval methods. Each method can be identified by a string built from the labels for the document feature weighting, the query feature weighting and the vocabulary restriction: ~doc~feat~weight~. (quer~feat~weight) . (vocab) In what follows, we present the results of the following nine methods: MO ntc.ntn.all Ml lnc.ltn.all M2 lnc.ltn.dflS M3 lnc.ntn.all M4 lnc.ntn.dflS MS ltc.ltn.all M6 ltc.ltn.dflS M7 ltc.ntn.all M8 ltc.ntn.dflS First, we compare the retrieval effectiveness of our method (Ml) described in Section 2 to the standard tf * idf method (MO) by means of the precision-recall graph in Figure 2. As expected, the method Ml is more effective than MO and achieves a retrieval effectiveness among the best methods presented at TREC-2. In order to find out the reason for this difference in the retrieval effectiveness we must have a closer look at the influ- ences of each parameter (document and query feature weighting). In Figure 3' the 1 1-pts average precisions of each method (MO to M8) are plotted on the left axis, and they are connected to the median response times (for the top ranked document) plotted on the right axis. The most obvious conclusion from this graph is the follow- ing: the higher the precision, the slower the response, and vice versa. The method MO performs clearly worse than the methods Ml to M8 in respect to both retrieval effectiveness and retrieval efficiency. We concentrate on the response times of the top ranked document because the response times of all fur- ther ranked documents are of secondary interest, since a user is supposed to read the top ranked document be- fore looking at the other documents and the retrieval system can retrieve further documents while the user is reading the top ranked document. We can also see from Figure 3, what are the imfl~- ences of the different p~rameters on the ~ver~ge prec?- sion. Regarding the weighting of the document features, the "inc" weighting achieves a 4-10% higher precision than the "ltc" weighting. The "ntc" looses 5% of pre- cision compared to "ltc". In the case of query feature weighting, again the logarithmic "ltn" weighting is more Precision 1 Ml inc.ltn.all - 0.8 MO ntc.ntn.all * 0.6 0.4 average median response 0.2 precision time for 1. rank 0 0 0.2 0.4 0.6 0.8 1 Recall 0 0.33- - 2.2 sec. Figure 2: Precision recall graphs of the most effective method (Ml) and of the least effective method (Mo). effective than the "ntn" weighting (10-15%). Restrict- ing the vocabulary results in a 2-7% lower precision. We can summ&ize the experiences as follows: * The rndf(~)'s in the document feature weights have a bad influence on the retrieval effectiveness. It could be that for long documents the estimation of the lengths d5 is inappropriate when the nidf(~)'s are taken into account. * Logarithmic feature weighting is more appropriate for the long TREC documents and queries than lin- ear feature weighting. Logarithmic feature weighting avoids an overweighting of features occurring very fre- quently within a document. * Restricting the indexing vocabulary by ommitting fea- tures with a high document frequency df(~) has a noticeable influence on the average precision. In what follows, we discuss the i¶LflueThceB of the diffe?- emt p~?CLmeter8 om the reapom~e time (as shown in Fig- ure 3). Restricting the vocabulary accelerates the query evaluation (by 9-14%) for the reasons described in Sec- tion 2. The "ntn" weighting of the query features is also 9-14% faster than the "ltn" weighting. For doc- ument feature weighting3 the "lnc" weighting is 5-10% slower than the "ltc" weighting. The "ntc" weighting of is even slower. These results can be explained in terms of the approximation error. * It is obvious that for the "ltc" weighting the approxi- mation error (a?5 aij )*b~ is smaller than for the "lnc" 167 Ml lnc.ltn.all - 2.1 sec. 2.0 sec. M2 lnc.ltn.dflS 0.30- 1.9 sec. MS ltc.ltn.all 1.8 sec. M3 lnc.ntn all M6 ltc.ltn.df15~ 1.7 sec. 0.27- 1.6 sec. M4 inc.ntn.dflS M7 ltc.ntn.all~ 1.5 sec. M8 ltc.ntn.dflS 7¾> MO ntc.ntn.all~ 0.24- - 1.4 sec. 1.3 sec. 0 Figure 3: Average precisions and response times of the first ranked document. Partition Partition APi IEIII APi 11.9 AP2 AP2 1 11.0 DOE1.1 DOEi.i I 4.2 DOE1.2 DOEi.2 14.3 DOE1.3 `III... DOEi.3 ~i.i FRi FRi 14.4 FR2 [`Eli - FR2 I 3.4 WSJi WSJi *ii.9 WSJ2 WSJ2 18.6 ZIFFi ZIFFi 8.0 ZIFF2 ZIFF2 :5.9 I I I I I I I I I I 0 1 2 3 4 5 6 7 8 9 10 sec. Figure 4: Response times of the first ranked document per run (adhoc queries versus the listed partitions) for the fastest method (M8 nltc.ntc.dfiS). weighting because of the scaling with the midf(~). On the other hand, the approximation error for the "ntc" weighting is very large compared to both the "ltc" and the £`lnc" weighting because of the normal- ized linear weighting instead of the normalized loga- rithmic weighting. The box plots [7, pp.336] presented in Figure 4 show the distribution of the response times required to deter- mine the top ranked document. A box plot visualizes the median (the line within the box), half of all samples (the box), and outliers (the dots) of a sample collection. In our case we have 50 samples: the response times of the 50 adhoc queries run against a partition. For most queries the top ranked document was retrieved in less than two seconds. The few outliers were all produced by the same queries (#103, #136, #138, #144). In gen- eral, the response times become shorter if a partition contains less documents and if the document descrip- tions consist of less postings on an average (see Fig- ure 5). 5 Conclusions In our approach we stressed the update efficiency. We have shown that the retrieval effectiveness does not have to be sacrificed to achieve a high update efficiency when coping with highly dynamic document collections. Our approach could probably be further improved by find- 168 0 5 10 *106 postings Figure 5: Number of postings per partition. mg a weighting scheme that, on the one hand, achieves a very good retrieval effectiveness and that, on the other hand, can be approximated by frequency independent weights with only little variation from the exact weights. The retrieval efficiency could be improved by better partitioning the document collections according to the lengths of the documents. Our approach seems to be very amenable to parallel processing. We may think of several configurations (partitioning the query, partition- ing the document collection, etc.). At this moment, it is not clear which configuration is appropriate for which requirements. Furthermore, we do not know yet how dynamic the document collection must be such that our access structure outperforms inverted files. References [1] C. Buckley, G. Salton, and J. Allan. Automatic Re- trieval With Locality Information Using SMART. In TREC-1 Proceedi~g8, pages 59-72,1992. [2] W. B. Croft and P. Savino. Implementing Ranking Strategies Using Text Signatures. ACM Tra~ac- iio~ OIL IILfOrm~iOIL S~8tem5, 6(i):42-62, 1988. [3] U. Glavitsch and P. Schauble. A System for Re- trieving Speech Documents. In N. Belkin, P. In- gwersen, and A. M. Pejtersen, editors, ACM SI- CIR CoifereILce OIL R~D jIL IILformatioIL Retrieval, pages 168-176, 1992. [4] D. Harman. Relevance Feedback Revisited. In N. Belkin, P. Ingwersen, and A. M. Pejtersen, ed- itors, ACM SIGIR Conferemce om R&D im i?Lfor- matiom Retriev~l, pages 1-10,1992. [5] D. Harman, E. Fox, R. A. Baeza-Yates, and W. Lee. Inverted Files. In W. B. Frakes and R. A. Baeza-Yates, editors, Imform~iom Retriev~l, ~ Stri£ct~res & Algorithms, pages 28-43. Prentice- Hall, Englewood Cliffs, NJ, 1992. [6] M. F. Porter. An Algorithm for Suffix Stripping. Progr~m, 14(3):130-137, 1980. [7] J. A. Rice. M~hem~ic~l St~i~tics CLILd DGtCL AmCLl- ysis. Wadsworth & Brooks, Pacific Grove, CA, 1988. [8] G. Salton and C. Buckley. Parallel Text Search Methods. CommumicCLtioms of the ACM, 31(2):202- 215,1988. [9] G. Salton and C. Buckley. Improving Retrieval Per- formance by Relevance Feedback. JOt£rmCLl of the ASIS, 41(4):288-297, 1990. [10] G. Salton and M. McGill. Imtroductiom to Modern ImformCLtiom RetrievaL McGraw-Hill, New York, 1983. [11] P. Schauble. SPIDER: A Multiuser Information Retrieval System for Semistructured and Dynamic Data. In ACM SIGIR Conference on R&D in In- formation Retrieval, pages 318-327,1993. [12] C. J. van Rusbergen. Information RetrievaL But- terworths, London, second edition, 1979. 169 1. Exact and Approximate Retrieval Status Values "e~~": RSV(q, d~) 1 I d~I ~ * Ua~~o~~: RSV0(q) d~) 1 a~95 * jd~ 2. Document Length Id; AI~%2 3. Weights of Query Features : ff(~j,q)*nsdf(~i) altfl~ : b1 ~ 4. Exact and Approximate Weights of Document Features "nic" : "inc" 0 a~5 ff(~, d~) max{!f(~h, d~) `9h E d5} * nidf(~) 1*TLidf(~~) (1 +log(ff(tp~,d5))) (1 + log(max{ff(~h, d~) I Edj})) 1 "itc" : := (1 + iog(f f(~,d,))) (1 + iog(max{ff(~~, d~) I ~h E d~})) a19~ 1* nidf(~) 5. Normalized Inverse Document Frequency * nidf(~) log(1 + df(tp1)) nsdf(~) 1 log(1+n) Table 1: Classification of weighting schemes. 170