Compression, Fast Indexing, and Structured Queries on a Gigabyte of Text Alan Kent Alistair Moffat Ron Sacks-Davis Ross Wilkinson Justin Zobel Collaborative Information Technology Research Institute Departments of Computer Science RMIT and The University of Melbourne 723 Swanston St, Carlton, Melbourne 3053 Australia {ajk, alistair, rsd, ross, jz}@kbs.citri.edu.au 1 Introduction Large document collections present many problems for practical information retrieval. To avoid unnecessary accesses to the text of the collection during query evaluation, comprehensive in- dexes are required. It must be possible to create and access these indexes in a reasonable amount of time, and to store them, and the data itself, in a reasonable amount of space. More- over, although advanced indexes permit efficient evaluation of conventional ranking measures, they make only limited use of any structure inherent in the query. Here we describe two separate streams of research that address these problems. In the first stream we have over the last two years developed indexing and compression techniques that allow the text to be stored compressed [2, 8]; provide fast access to document collections via compressed indexes [2, 9, 16]; create indexes with a fast inversion technique [7]; and permit fast ranking of large document collections [17]. On the small document collections initially available to us (the largest was 132 Mb), compressed text and index typically required less than 35% of the space required for the source text; query response times were a few seconds; and peak memory requirements were less than a Megabyte during query processing. In Section 2 we describe the application of these techniques to the TREC data. In the second stream of research we used the compression and indexing techniques discussed above, as well as signature file indexing techniques developed previously by the group [5, 11], to investigate the extent to which the structure of queries aids the retrieval process. In Section 3 we report on a series of experiments that examine how this structure may be used in both generation of Boolean queries and in ranking. As part of these experiments we are able to test the advantage of using adjacent pairs in retrieval. 2 Compression and Inverted File Indexing We first descn be the structure of text databases with inverted file indexes, and how documents are ranked with regard to a query. Then, in Section 2.3, we describe the compression of the text of the documents, and in Section 2.4 describe the representation of the inverted file and 229 associated information. The result of applying these techniques to the TREC data is described in Section 2.5. 2.1 Text databases Text databases provide access to text documents on the basis of their content. We assume that access is via the inverted file indexing scheme we have described elsewhere [16]. In this scheme, each distinct word in the database is held in a vocabulary, which may be an array, or may be a search structure such as a B-tree. For each word in the vocabulary the inverted index stores an index entry, a list of the identifiers of documents containing the word. Interleaved with the identifiers are the number of occurrences of each word in each document. Since the index entries contain ordinal document identifiers rather than disk addresses, a document mapping is needed to convert identifiers into addresses. This will require either a non-trivial amount of memory space, or must be stored on disk in an auxiliary file. Similarly, an index mapping is required to turn an ordinal word number into the address of the corresponding index entry. Query [~bularY Inverted --------- Index L Search __ Entries Algorithm mApproximate Lengths ~: Ranking Algorithm F;Hidexm Mapping Compression L;-iodel Compressed Text Answers ~Compression: Algorithm Figure 1: Organisation of text databases ~ment LL-jrmation Other components of the database are the compress~on model, used for text compression, and the document lengths, used for ranking. These are discussed later. In our experiments, the vocabulary and compression model were held in memory, together with an array of approximate lengths, also described below; all of the other structures were stored on disk. Figure 1 shows the relationship between these various components. 2.2 Query evaluation Given an inverted file index of the format described, it is straightforward to rank the documents in a database with respect to an informally phrased query, returning, say, the top ten documents as answers. As an example, consider the cosine measure, one of the most effective ranking 230 techniques [1, 13, 14]. The cosine measure evaluates the relevance of a document to a query via the function Wq,t Wd,t cosine(q, d) = V(~~ wq2,~) V(~~ wd2,t) where q is the query, d is the document, and W~,t is the weight of word tin document or query x. A common function for assigning weights to words in document or query x is to use the frequency-modified inverse document frequency, described by Wx,t = ~ log6(N/ft) where tfX~t is the number of occurrences of word tin x, N is the number of documents in the collection, and ft is the number of documents containing t. This is commonly called the tf idf weighting. The information required to compute the sum ~t Wq,t Wd,t is held in the inverted index. To answer the query, an accumulator with initial value zero is created for each document in the database; the index entry for each word in the query is retrieved; and, for each `word- number, document-number, frequency-in-document' triple, a cosine contribution is added into the corresponding accumulator. This technique is also used by Harman and Candela [4]. For each document in a static database the document length, ~t Wd2,t, is a constant, and can be pre-computed and stored at the time the database is created. Thus, to compute the final ranking after the accumulators have been calculated, the document length for each document with a non-zero accumulator value must be fetched, the accumulator value divided by this length. For the queries supplied with TREC, over 90% of the documents contained one or more of the query terms, meaning that over 90% of the document lengths needed to be accessed. Moreover, at over 2 Mb, the table of document lengths is large, and in practice space limitations might prevent it being held in memory. We have shown, however, that keeping limited precision approximate lengths in memory can largely eliminate the cost of fetching document lengths from disk [17]. Suppose we are prepared to allow b bits of main memory storage for each document length x, where L < x < U. Then if B = [(U + E)IL]2~b, where E is a very small number, the function f(x) = LlOgB(x/L)j yields integers C between 0 and 2b - 1 for all legal values of x, and g(c) = L BC can be used to compute approximations to x. The C values stored can then be used in either of two ways. First, the approximate lengths can be used to determine which exact document lengths should be retrieved from disk. Second, cosine can be computed with y(c), yielding an approximate ranking. With as few as two bits per length, there was only a small decline in retrieval performance in the five small document collections for which we had relevance judgments [17]. The effects of these techniques on TREC are discussed in Section 2.5. 2.3 Text compression Using a word-based model of the text, the space required to store the documents comprising a database can be reduced to less than 30% of the original size [2, 8, 15]. Each word occurrence in the text is replaced by a canonical Huffman code, the length of which is dependent on the frequency of the word, and the intervening `non-words' are similarly coded against a vocabulary of non-words. Thus, by alternately decoding a word, then a non-word, and so on down the compressed text, the exact form of the input document can be reconstructed. For further details of the scheme used the reader is referred to our previous work [15]. 231 The decompression process is very fast. The canonical Huffman code used is particularly amenable to a table-based `look up and copy' implementation, and each such decoding step generates several output bytes. As a result, the compression regime has only limited impact on retrieval time. Moreover, the small amounts of time needed for decompression are partially (and sometimes fully) offset by reduced disk traffic. Because every word and non-word decompressed is retrieved from the compression model, it is crucial that this model be held in main memory. If the compression model was stored on disk, two or more disk accesses would be required for each word in each document retrieved. On the smaller databases, the compression model did not exceed about 500 Kb. We were very interested to see how much space would be required to hold the model for the TREC collection. 2.4 Inverted file compression The inverted file index entries should also be compressed, since without compression the in- verted file might take as much space as the original text itself [2, 9, 16]. Good compression of index entries can be achieved by numbering documents sequentially from 1, sorting index entries, representing each sequence of identifiers as a sequence of differences or gaps, and then using compact representations of the generally small integers that result. For example, a word might appear in the first, fourth, seventeenth, ..., documents, with an index entry of This can be represented by the sequence of differences 1,3,13,74,22,... which can then be compressed. We have experimented with several different techniques for compressing these runlengths, in two broad classes [8, 9]. Global methods use the same encoding for all entries, and so have the advantage of being more general, but are insensitive to the frequency of each term. Local methods code each entry using a code that is adjusted to take into account the frequency of the term. Use of a variable length prefix code allows the frequent small runlengths to be represented more succinctly that the less frequent long runlengths, and is a marked improvement over the standard binary encoding. The further advantage of using a local code arises because of the high variability of word frequency, and the effect this has on the average runlength within the entry. For example, the word `the'1 can be expected to have a sequence of runlengths where each difference is very small, usually one. On the other hand, a rare word such as `palimpsest' will have gaps that are often much larger, but also sometimes small, since rare words will tend to occur in clusters of related documents within the collection. Hence, a local compression method that adjusts its codes according to word frequency can be expected to perform better than a global method, and this is indeed the case [9]. Term `frequency-within-document' values are efficiently represented by use of the ~ code [16] After compression, the index entries for a text database will typically occupy less than 10% of the space required for the text itself. Decompression is fast, with about 100,000 document identifiers decoded per second on a 25 MIP Sun SPARC 2. None of these compression methods require any significant memory resources, once the index entry has been retrieved. 1We did not `stop' any words, and also indexed all numbers. We did, however, stem all words before indexing, using Lovins's algorithm [6]. 232 We have also used these inverted file encodings in the inversion process required during database construction, again capitalising on the large main memory offered by current work- stations and the ability to perform two passes over a static database [7]. An alternative approach is the disk-based method described by Harman and Candela [4]; they report that the inversion of an 806 Mb database required 313 hours and 163 Mb of disk work space to produce an inverted index of 112 Mb. We were thus very interested to see whether the in-memory approach would be successful for the TREC collection. 2.5 Experimental results This section describes the performance of our techniques on the 2055 Mb TREC collection. All results are for a 25 MIP Sun SPARC 2 with 256 Mb of main memory and no other active user processes. Compression The two components of the compression process took about three hours each, generating a compression model and a compressed database of 4.2 Mb and 605.1 Mb respectively (Table 1). The second pass also generated a file containing the document mapping, a little under 3 Mb. That is, including both of these auxiliary files, the document collection was reduced from 2055 Mb to 612 Mb, or 29.8%, a disk saving of 1400 Mb. The need to store the coding model meant that the compression passes were expensive on memory. With more than 900,000 unique symbols, and an additional storage requirement of about 12 bytes per word for the search structure, each pass required about 25 Mb of data structures. These figures reinforce our contention that database creation needs to be performed on a reasonably well configured machine, even if query processing is performed elsewhere. %I (Mb, peak) ~ Pass ~ Output files I CPU Time ~~ry First pass: 1 Compression Compression model 4.2 0.2 2:37 25.6 Inversion Vocabulary 6.4 0.3 3:02 18.7 Overhead ~ 10.6 0.5 0:19 2.5 Total ____________ 5:58 46.8 Second pass: Compression Compressed text 605.1 29.4 3:27 25.6 Document ma~ping 2.8 0.1 Inversion Inverted index 132.2 6.4 5:25 162.1 Inverted index mapping 2.1 0.1 Document lengths 2.8 0.1 Approximate lengths 0.7 0.0 Overhead _______________________ ____________ 0:23 2.5 Total 745.8 36.3 9:15 190.2 ~ Overall Total [ ~ 756.4 36.8 15:13 ~ 190.2 ] Table 1: Files in compressed TREC database (b = 8). Percentages are relative to the size of the original document collection (2055.3 Mb) 233 Inversion The inversion process was performed as two passes in tandem with the two compression passes. The first stage, counting term frequencies, had similar time and memory requirements to the first component of the compression process, but could not actually share data structures because of differing definitions of `words' and the need, in the index, for stemming. The second pass of the inversion process was where we expected to encounter problems. The in-memory technique requires the allocation of memory a little larger than the final size of the compressed inverted file, and our previous experiments had indicated that this would be roughly 10% of the size of the input text [7]. That is, we expected to use more than 200 Mb of the 256 Mb memory available. However, the long TREC documents-averaging 470 words rather than the 90 for our previous largest collection-meant that the inverted file was smaller than anticipated, and, as it turned out, the inversion process ran smoothly to completion in 162 Mb, and contributed five and a half hours to the time of the combined second pass. One would not want to do this every day, but these resource requirements are well within the capabilities of a typical $50,000 workstation. Using the `local VG' code [8, 9), the final inverted file required 132 Mb, or 6.4% of the input text, of which about 40 Mb was by ~ coded `frequency-within-document' values. Hence, if only Boolean queries were to be supported, the inverted file could be reduced to under 100 Mb. The second pass of the inversion process also produced three other files-the index mapping for the inverted file, giving the bit address for each entry; the file of exact document lengths (later merged with the document mapping to form the document information file); and a file of approximate document lengths. The sizes of these, plus all of the other components of the compressed database, are shown in Table 1. Performance on ranked queries The 50 test queries were transformed, by removal of SGML tags, stop words, and punctuation, into lists of words to which the cosine measure could be applied. The top 200 ranked documents were then retrieved and decompressed for each query. ___________ ___________________ Accesses (Kb) (sec) ~ ed 1 Operation ~_File } Disk Data CPU Time ~ ElasePc5) j Searching Index mapping ~ 43 170 Ranking Inverted Index T 43 1708 _____________ Document Information ~ 219 7 16.6 18.2 Compression Compressed text 200 - 203 _____________ Answers t[; 729 3.7 7.2 ~ Total ]______________________ 5 ~ 2817 [ 20.3 ~ 25.4 ] Table 2: Ranked queries on TREC (b = 8) Processing of these queries was remarkably quick. Table 2 shows, for the three stages of processing each query, the average amount of data involved and the average time required. (We did not separate the time taken by the searching algorithm and the ranking algorithm.) As can be seen, on average the queries were processed in about 20 seconds of CPU time. Given that each query involved approximately 2.1 million document numbers and 600,000 non-zero accumulators, we were very pleased with this result. Note that the times listed in 234 Table 2 include the cost of writing the retrieved documents to disk for later analysis by a post-processing program, but do not include the time required by the post-processing. On average the queries required about 500 disk accesses and the transfer of 2.8 Mb of data. At (say) 100 accesses per second on the disk2, and a transfer rate of 1 Mb/sec, the input/output operations contribute about 5-8 seconds to the elapsed query time, accounting for the observed difference between CPU times and elapsed times. For small queries response was even more impressive. For a query involving four terms (`Australia wheat tariff subsidy') and retrieval of the top ten documents, 2.6 seconds of CPU time were required; and, if output was directed into a file, the total elapsed time for the query was under 4 seconds. The use of approximate lengths in-memory (using b = 8) meant that on average only 219 exact weights were required to identify the top 200 documents for the 50 TREC queries. Depending on whether the exact lengths would have been stored in memory or on disk, with the use of approximate lengths we have either reduced the memory required by over 2 Mb, at the expense of 19 disk operations; or, alternately, at a cost of 725 Kb of memory, we have avoided the need to sequentially read the entire exact lengths file, a saving of several seconds per query. We also tested using the limited precision lengths to provide the final ranking, ignoring the exact lengths altogether. This means that the cosine measure is approximate, and some variation in recall effectiveness can be expected. In our previous experiments on smaller col- lections this degradation was minimal for even very low values of b. We used the 47 partial judgments supplied, and compared the number of relevant answers using the exact ranking and the approximate ranking technique. These results are shown in Table 3. Note that relevance judgements were not provided for all documents, hence the range for precision-most queries returned some documents that were un-judged. If we pessimistically assume that these docu- ments were not relevant, the precision should be taken to be the lower value. Clearly, although the approximate ranking is finding different documents, the use of approximate ranking has, down to b = 4, little effect on precision. [ b 1 Number of answers I 1 5 ~ 15 J 30 100 200 J exact 55.7-56.2 52.1-52.3 48.8-49.6 40.7-44.1 34.2-43.3 12 55.3-55.7 51.9-52.2 48.8-49.6 40.7-44.1 34.1-43.3 10 55.3-55.7 51.8-52.1 48.9-49.7 40.6-44.1 34.1-43.2 8 54.9-55.7 51.2-51.5 48.8-49.5 40.6-44.1 34.2-43.3 6 55.7-56.6 50.8-51.3 47.5-48.4 40.6-44.5 34.1-44.0 4 57.9-59.1 50.2-50.9 46.5-48.6 38.0-45.6 31.2-48.0 2 42.6-46.4 32.6-42.4 29.9-46.6 21.3-57.7 15.7-65.8 Table 3: Effect of varying b-percentage precision, 47 queries 2One hundred accesses per second for the file containing the compressed text is optimistic for purely random accesses, since a transfer of several Kb from a random point in a large file in the Unix file system will typically require three or more seeks, each costing perhaps 10 to 20 milliseconds. However, the accesses to the document information file are in sequential order, and the accesses into the compressed text were batched into groups that were performed in sequential order. 235 Memory requirements In total, the peak memory requirement during query processing is about 10.9 Mb (not all components are required to be co-resident, and, in particular, the accumulators and answer buffer are never simultaneously present). If necessary, this could be further reduced by writing the output text directly to disk rather than into an answer buffer. However, we do not feel that this requirement is excessive given the large size of the document collection, and to date the use of approximate weights is the only area where we have concentrated on reducing memory usage. 3 Structured Queries We focused in this experiment on structured queries. They would be transformed in two ways: into Boolean queries that would return a given number of documents, and into a vector of weights for the purposes of ranking. The system in which these experiments were carried out was a nested relational database, Atlas, that uses the multi-organisation signature ifie scheme [10]. The bulk of these experiments were carried out on the second part of the Tipster database. At the end of this section, we wrn report on some of the experiments that we have carried out on the whole database. 3.1 Boolean Query Generation There are good reasons for preferring ranked retrieval over Boolean retrieval. Nevertheless, there are two key reasons that Boolean retrieval might be used. The first is that Boolean retrieval is computationally simpler, and the second is that the retrieval system may not support ranked retrieval. Since both factors were relevant to the set of experiments described her~ we had been working with a Boolean retrieval system rather than the system described in Section 2-it was desirable to form a Boolean query and rank the answers it returned. There were several requirements that a Boolean query generation method needed to satisfy. The first is that it should return a fixed number of documents. This allowed control over query time by fixing the number of documents to be subsequently ranked. Secondly, the query should be in disjunctive normal form, and have relatively few disjuncts. This is because query time is almost linear in the number of disjuncts in signature file retrieval systems. Thirdly the query generation mechanism should be automatic. These are the requirements that have previously been used as criteria for Boolean query generation algorithms [12]. However, the queries that we are considering in the TREC experiments have additional structure that may be used to advantage. Despite the limited amount of text to be processed, no natural language processing was performed; all processing was statistically and structurally based. There were many possible approaches to Boolean query generation. A starting point might be to use one of the Boolean query generation algorithms previously developed [12] on the . However, this would not adequately use the other available structures. We considered three approaches: base the query on the , base the query on the , augmented by the , and base the query on the , , and , using adjacent pairs in the disjuncts as well. The first approach used a list of concepts. The words within the concept were ordered with the most frequent word in the full text of the query first, and the least frequent last. Starting at the first concept, a disjunct was formed by creating a conjunction of terms from the concept, starting at the first word, until the conjunct returned fewer documents than the desired limit. 236 If the conjunct had fewer words than was required to meet this termination condition, other words were added to the concept by finding words in the remaining text that were adjacent to the first word in the concept. Finally, words that occurred rarely in the database were added as simple disjuncts if the query was still estimated to return fewer than the required number of documents. For example, to return 500 documents for query 82 on genetic engineering, we generated the Boolean query: (genetic AND engineering) OR (manipulation AND molecular) OR (biotechnology AND genetic AND manipulation) OR (animal AND drug AND plant) The second approach started by forming a simple disjunction of rare words, up to 10% of the required number of documents. Next, two key descriptors were identified, using the text of the description primarily, but also using the narrative. The key descriptors were created as a set of three words each, where each key descriptor is intended to capture a principal char- acteristic of the query. The frequency of the terms in the query and the adjacency of words were used to generate the key descriptors. Sometimes the algorithm would form two variations of the one concept, sometimes two distinct concepts. For instance query 82 formed the key descriptors: to generate (genetic, engineering, developed) (genetic, engineering, manipulation) the query: (genetic AND engineering) OR (genetic AND manipulation) The third approach was quite similar to the second, except that more emphasis was placed on the title in generating the concepts, and, by using pairs of words in the query, finer gradation of the Boolean queries was possible. The concepts for queries 82 and 87 were the same. The Boolean query for query 82 was the same as for the second algorithm, however the query generated for 87 was: institutionslailed OR (actions~cnminal AND officers) When evaluating these approaches, the key consideration is whether or not relevant doc- uments are identified. For a fair comparison, each algorithm should return the same average number of documents. In Table 4, each algorithm has returned an average of about 400 docu- ments. The recall and precision results are all based on the same ranking formula. Note that since these algorithms identified documents that had not been assessed as relevant or irrelevant, we made the (pessimistic) assumption that all such documents were irrelevant. Algorithm ~ Concepts ~ Description J Description with Pairs ] Docs. Requested 200 700 600 Docs. Returned 453 354 360 Disjuncts 3.5 1.8 4.4 Recall 0.155 0.170 0.146 Precision 0.150 0.144 0.121 Table 4: Comparison of Boolean query generation algorithms To show how increasing the number of documents requested affects these algorithms, in Table 5, we show how many documents were retrieved and how many disjuncts were used when each of the algorithms were requested to return 1,000 documents. 237 4_Description with Pairs Alg~orithm Concepts Description Docs. Returned =1221 Di8junctsI 4.01 T T :1063 I 1.94 Table 5: Further comparison of Boolean query algorithms ~From these tables we are able to see that the least costly algorithm is based on finding two key descriptors. If pairs are added, there are more rare terms, which each contribute a single disjunct. 3.2 Ranking Ranking documents using the vector space model has usually treated both documents and queries as a flat structures-lists of words. However, it is very simple to combine different representations of a query in ranking. If each form of a query is represented as a unit vector over the same vector space, we may combine the representations by vector addition. We are thus able to combine structural information and statistical information in the same measure. The first series of ranking experiments were used to determine the relative usefulness of some of the query fields. The results are given in Table 6. The third Boolean query generation algorithm was used to generate a set of approximately 1,000 documents for each query. These were ranked using each field successively, and then combined, ignoring structure. In all these experiments we give first the results in terms of recall and precision, and then in terms of precision in terms of the number of viewed documents. Let V' stand for the title vector, Vd stand for the description vector, V~ stand for the narrative vector, VC stand for the concepts vector, Vf stand for the definition vector, Va stand for the vector for all of the text, and Vp stand for vector of all adjacent pairs in the query. Recall 10% 20 o 30 o 40 o 50% 60% 70% 80% 90% 100 o Av. Vd 0.197 0.114 0.032 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.038 Vfl 0.192 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.038 Va 0.191 0.114 0.034 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.038 Table 6: Ranking using individual fields Table 7 shows what happens if each of the five fields are added with the same weight, and then what happens if adjacent pairs are added as well. By way of comparison, we show the effect of adding pairs to the full text vector, and by adding pairs to the narrative vector. Vi- V2- V3=Va V4- Vs- V6- Vt+Vd+Vfl+Vf +VC vt+Vd+Vn+Vi +vc+VP Va+Vp vm+VP 238 [ Recall 110% ~ 20% f 30% ~ 40% ~ 50% ~ 60% ~ 70% ~ 80% ~ 90% 1100% ~ Av. J V1 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040 V2 0.209 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.040 V3 0.191 0.114 0.034 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.038 V4 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039 V5 0.192 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.038 V6 0.198 0.111 0.045 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 7: All field ranking Using this information, we tried to combine the various fields in "creative" ways. In Table 8 we show the results of creating a combined vector described by v7 = Wt+2Vd+3V~+O.Wf+3Vc+2Vp v8 = Wt+lVd+W~+O.Wf+3V~+3V~ Recall 10 o 20 o 30 o 40% 50% 60% 70% 80% 90% 100% Av. V7 0.208 0.112 0.034 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 V8 0.205 0.112 0.033 0.014 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 8: Combining the fields The first thing to note is how similar the results are. Since a limited number of documents are being examined using an imperfect Boolean algorithm, many relevant documents are being missed altogether, so the ranking formulas have no possibility of giving them high scores. An alternative technique for evaluating ranking formulas is to determine precision after fixed numbers of documents have been examined. This has the advantage that large numbers of relevant documents not identified by the Boolean algorithm do not flatten out recall/precision results. Table 9 gives results at twenty document intervals for all previous experiments. DocumentsI 5 ~ 15 [30 1100 ~200IAv.j Vd 0.374 0.321 0.298 0.223 0.158 0.275 V~ 0.319 0.295 0.279 0.225 0.159 0.255 Va 0.349 0.316 0.305 0.224 0.169 0.273 V1 0.387 0.352 0.336 0.237 0.166 0.296 V2 0.391 0.352 0.336 0.237 0.166 0.296 Va 0.349 0.316 0.305 0.224 0.169 0.273 V4 0.349 0.305 0.304 0.230 0.171 0.272 V~ 0.319 0.295 0.279 0.225 0.159 0.255 V6 0.336 0.304 0.284 0.229 0.164 0.263 V7 0.404 0.367 0.334 0.236 0.166 0.302 V8 0.370 0.350 0.327 0.238 0.165 0.290 Table 9: Comparison of ranking formula for fixed no. of documents returned 239 There are two observations that one might make after examining both forms of evaluation. The first is that there appears to be a small gain obtained by treating the text in the query in a structured way using the methodology proposed here. The gain is less than 10%. The second result is both surprising and has significant ramifications. The use of adjacent pairs in previous experiments on small collections with relatively short queries provided substantial improvement in recall/precision [3, 12]. We see no such improvement in these experiments. Finally, we may compare the three Boolean algorithms in their ability to identify appropriate subsets for ranking. In this comparison, we use V4 for ranking each of the subsets. Let A~ represent the algorithm based on concepts, Ad represent the algorithm based on the description, and A~ represent the algorithm that uses adjacent pairs. Table 10 shows recall/precision figures and Table 11 shows precision for fixed number of documents returned. Recall 10% 20% 30 o 40 o 50% 60% 70% 80% 90% 100% Av. A~ 0.185 0.089 0.054 0.031 0.028 0.012 0.000 0.000 0.000 0.000 0.040 Ad 0.220 0.133 0.048 0.017 0.018 0.017 0.000 0.000 0.000 0.000 0.045 A~ 0.198 0.115 0.037 0.018 0.015 0.012 0.000 0.000 0.000 0.000 0.039 Table 10: Comparison of Boolean algorithms Av. J L D~~~ocuments A-Li 15 1301 100 200 F~A~ 10.3281 0Z319 J0½306' J 0 169 0.2691 10.3491 0.332 0.311 0.222 0.149 AAPd 10.3491 0.305 0.304 0.230 0.171 00.227722 Table 11: Comparison of Boolean Algorithms for fixed no. of documents returned Using both methods of evaluation, the algorithm that uses key descriptors, without adjacent pairs outperformed augmenting the algorithm with pairs, and the algorithm using concepts. This is despite the fact that A~ returned an average of 1190 documents per query, A~ returned an average of 667 documents and Ad returned only an average of 530 documents. 3.3 The Atlas System The Boolean query generation experiments were performed using the Atlas database system [5J. This system is a nested relational database system designed for text applications. The system was not adapted or tuned in any way to support the TREC experiments. There are a number of signature file organisations that are supported by Atlas~bit slice, two level schemes, and multi-organisation schemes. We implemented an algorithm that analyses the data and selects the most appropriate scheme and optimal parameters for this scheme [11]. For the TREC collection, the multi-organisation scheme was selected. It took 23 hours to load the database and build the index, which required 313 Mb of disk space. This indexed all stopped and stemmed terms, including adjacent pairs. The text stored in the database was compressed using the mechanism described in Section 2.3. 240 3.4 Combined Database Results Due to our inability to hold the whole of TREC ill the form used for the compression experi- ments, and the form used for the Atlas system, we ran a further set of experiments using the compressed database system, but using the Boolean algorithm based on concepts to test some of the rank formulas. We performed 4 experiments where we repeated experiments using Vd, the description vector, Vfl the narrative vector, Va the vector of all text with structure ignored, V1 where all text was used but structure was taken into consideration, and V2, a modification of V1 where pairs were added as an extra vector. The results are given in Table 12 using recall and precision and then at various intervals based on the number of documents retrieved in Table 13. Recall ~ 10% ~ 20% ~ 30% ] 40% J 50% J 60% J 70% J 80% ~ 90% 100% Av. J Vd 0.324 0.196 - 0.136 0.090 0.083 0.034 0.022 0.005 0.000 0.000 0.089 Vfl 0.325 0.218 0.128 0.084 0.080 0.034 0.022 0.005 0.000 0.000 0.090 VG 0.298 0.170 0.116 0.074 0.067 0.040 0.026 0.005 0.000 0.000 0.080 V1 0.372 0.235 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099 V2 0.372 0.236 0.129 0.092 0.083 0.048 0.022 0.005 0.000 0.000 0.099 Table 12: Ranking all data IDocuments] 5 ~ 15 [ 30 ~ 100 ~ 200 ~ Av. J Vd 0.417 0.431 0.404 0.349 0.294 0.379 Vfl 0.370 0.408 0.407 0.353 0.301 0.368 V4 0.383 0.391 0.379 0.329 0.279 0.352 V1 0.447 0.472 0.438 0.377 0.312 0.409 V2 0.447 0.475 0.438 0.375 0.312 0.409 Table 13: Ranking all data The results we obtained for text. However the advantage of 4 Summary the smaller collection are again observed with the full 2 Gb of using the structure of the queries is slightly more pronounced. In the first line of investigation, we built a compressed retrieval system for around 2 Gb of text that required under 37% of the size of the input text, and was created in just over 15 CPU-hours on a typical well-configured workstation. Retrieval performance for simple techniques such as the cosine measure is fast, and queries can be processed effectively on low-end workstations in seconds. The techniques developed during this first thread of investigation have thus altered the status of 2 Gb information retrieval systems from being `unpleasantly expensive' to being `eminently practical'. In our second thread of research we have seen that the use of the structure of queries has enabled Boolean queries to be generated that have few disjuncts and perform better than more complicated ones. Given the large number of documents that are relevant to many of the queries that have been examined, such algorithms must be subject to reduced performance 241 when compared to techniques that rank the entire collection. In the light of the parallel experiments carried out using the full collection, it must be questionable as to whether such techniques will survive. We have seen a minor improvement in ranking as a result of taking the structure of queries into account. A more surprising result however is that pairs do not provide the performance gains that we have seen with much smaller collections. As a result, the vocabulary for a collection would appear to be more manageable than if pairs need to be explicitly described. Acknowledgements We are grateful to Lachlan Andrew, Daniel Lam, and Neil Sharman for assisting with the implementation. This work was supported by the Australian Research Council. References [1] S. Al-Hawamdeh and P. Willett. Comparison of index term weighting schemes for the ranking of paragraphs in full-text documents. International Journal of Information and Library Research, pages 116-130, 1990. [2] T.C. Bell, A. Moffat, C.G. Nevill, I.H. Witten, and J. Zobel. Data compression in full-text retrieval systems. Journal of the American Society for Information Science. To appear. [3] J. Fagan. Automatic phrase indexing for document retrieval: An examination of syntac- tic and non-syntactic methods. In Proc. 1O'th ACM-SIGIR International Conference on Research and Development in Information Retrieval, pages 91-108, 1987. [4] D. Harman and G. Candela. Retrieving records from a gigabyte of text. on a minicom- puter using statistical ranking. Journal of the American Society for Information Science, 41(8):581-589, 1990. [5] A.J. Kent, R. Sacks-Davis, and K. Ramamohanarao. A signature file scheme based on multiple organisations for indexing very large text databases. Journal of the American Society for Information Science, 41(7):508-534, 1990. [6] J.B. Lovins. Development of a stemming algorithm. Mechanical Translation and Compu- tation, 11(1-2):22-31, 1968. [7] A. Moffat. Economical inversion of large text files. Computing Systems, 5(2):125-139, 1992. [8] A. Moffat and J. Zobel. Coding for compression in full-text retrieval systems. In Proc. 2'nd IEEE Data Compression Conference, pages 72-81, Snowbird, Utah, March 1992. IEEE Computer Science Press. [9] A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. 15'th A CM-SIGIR International Conference on Research and Development in Information Re- trieval, pages 274-285, Copenhagen, Denmark, June 1992. ACM Press. [10] R. Sacks-Davis, A. Kent, R. Kotagiri, J. Thom, and J. Zobel. A nested relational database for text applications. Technical Report 92-52, Collaborative Information Technology Re- search Institute, Melbourne, Australia, 1992. 242 [11] R. Sacks-Davis, A.J. Kent, and K. Ramamohanarao. Multi-key access methods based on superimposed coding techniques. ACM Trans. on Database Systems, 12(4):655-696, 1987. (12] R. Sacks-Davis, P. Waflis, and R. WIlkinson. Using syntactic analysis in a document retrieval system that uses signature files. In Proc. 18'th ACM-SICIR International Con- ference on Research and Development in Information Retrieval, pages 179-192, September 1990. [13] G. Salton. Automatic Te~t Processing. Addison Wesley, Massachusetts, 1989. [14] G. Salton and M.J. McGrn. Introduction to Modern Information RetrievaL McGraw-Hrn, New York, 1983. (15] J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. In Proc. 15'th Australian Computer Science Conference, pages 1077-1089, Hobart, Australia, January 1992. [16] J. Zobel, A. Moffat, and R. Sacks-Davis. An efficient indexing technique for full-text database systems. In Proc. 18'th Conference on Very Large Databases, pages 352-362, Vancouver, Canada, August 1992. [17] J. Zobel, A. Moffat, and R. Sacks-Davis. Memory-efficient ranking of document collec- tions. Technical Report TR-92-53, Collaborative Information Technology Research Insti- tute, Melbourne, Australia, August 1992. 243