Compression, Fast Indexing, and Structured Queries
                            on a Gigabyte of Text

                                     Alan Kent
                                   Alistair Moffat
                                  Ron Sacks-Davis
                                   Ross Wilkinson
                                    Justin Zobel


              Collaborative Information Technology Research Institute
                            Departments of Computer Science
                         RMIT and The University of Melbourne
                         723 Swanston St, Carlton, Melbourne 3053
                                      Australia
                         {ajk, alistair, rsd, ross, jz}@kbs.citri.edu.au

1   Introduction
Large document collections present many problems for practical information retrieval. To avoid
unnecessary accesses to the text of the collection during query evaluation, comprehensive in-
dexes are required.  It must be possible to create and access these indexes in a reasonable
amount of time, and to store them, and the data itself, in a reasonable amount of space. More-
over, although advanced indexes permit efficient evaluation of conventional ranking measures,
they make only limited use of any structure inherent in the query.
   Here we describe two separate streams of research that address these problems. In the first
stream we have over the last two years developed indexing and compression techniques that
allow the text to be stored compressed [2, 8]; provide fast access to document collections via
compressed indexes [2, 9, 16]; create indexes with a fast inversion technique [7]; and permit
fast ranking of large document collections [17].   On the small document collections initially
available to us (the largest was 132 Mb), compressed text and index typically required less than
35% of the space required for the source text; query response times were a few seconds; and
peak memory requirements were less than a Megabyte during query processing. In Section 2
we describe the application of these techniques to the TREC data.
   In the second stream of research we used the compression and indexing techniques discussed
above, as well as signature file indexing techniques developed previously by the group [5, 11], to
investigate the extent to which the structure of queries aids the retrieval process. In Section 3
we report on a series of experiments that examine how this structure may be used in both
generation of Boolean queries and in ranking. As part of these experiments we are able to test
the advantage of using adjacent pairs in retrieval.


2   Compression and Inverted File Indexing
We first descn be the structure of text databases with inverted file indexes, and how documents
are ranked with regard to a query. Then, in Section 2.3, we describe the compression of the
text of the documents, and in Section 2.4 describe the representation of the inverted file and

                                         229

associated information. The result of applying these techniques to the TREC data is described
in Section 2.5.

2.1   Text databases
Text databases provide access to text documents on the basis of their content. We assume that
access is via the inverted file indexing scheme we have described elsewhere [16]. In this scheme,
each distinct word in the database is held in a vocabulary, which may be an array, or may be
a search structure such as a B-tree. For each word in the vocabulary the inverted index stores
an index entry, a list of the identifiers of documents containing the word. Interleaved with the
identifiers are the number of occurrences of each word in each document.    Since the index
entries contain ordinal document identifiers rather than disk addresses, a document mapping
is needed to convert identifiers into addresses.  This will require either a non-trivial amount
of memory space, or must be stored on disk in an auxiliary file.  Similarly, an index mapping
is required to turn an ordinal word number into the address of the corresponding index entry.

Query     [~bularY
                         Inverted
          ---------       Index
  L Search __ Entries
            Algorithm

  mApproximate
    Lengths


~:  Ranking
   Algorithm

           F;Hidexm
             Mapping    Compression
                        L;-iodel    Compressed
                                       Text

Answers                 ~Compression:
                         Algorithm


             Figure 1: Organisation of text databases

~ment

 LL-jrmation

   Other components of the database are the compress~on model, used for text compression,
and the document lengths, used for ranking. These are discussed later. In our experiments, the
vocabulary and compression model were held in memory, together with an array of approximate
lengths, also described below; all of the other structures were stored on disk.  Figure 1 shows
the relationship between these various components.

2.2  Query evaluation
Given an inverted file index of the format described, it is straightforward to rank the documents
in a database with respect to an informally phrased query, returning, say, the top ten documents
as answers. As an example, consider the cosine measure, one of the most effective ranking
                                     230

techniques [1, 13, 14]. The cosine measure evaluates the relevance of a document to a query
via the function
                                        Wq,t Wd,t
                      cosine(q, d) =
                                 V(~~ wq2,~) V(~~ wd2,t)

where q is the query, d is the document, and W~,t is the weight of word tin document or query
x.  A common function for assigning weights to words in document or query x is to use the
frequency-modified inverse document frequency, described by

                            Wx,t = ~ log6(N/ft)

where tfX~t is the number of occurrences of word tin x, N is the number of documents in the
collection, and ft is the number of documents containing t. This is commonly called the tf idf
weighting.
    The information required to compute the sum ~t Wq,t Wd,t is held in the inverted index.
To answer the query, an accumulator with initial value zero is created for each document in
the database; the index entry for each word in the query is retrieved; and, for each `word-
number, document-number, frequency-in-document' triple, a cosine contribution is added into
the corresponding accumulator. This technique is also used by Harman and Candela [4].
    For each document in a static database the document length, ~t Wd2,t, is a constant, and
can be pre-computed and stored at the time the database is created. Thus, to compute the final
ranking after the accumulators have been calculated, the document length for each document
with a non-zero accumulator value must be fetched, the accumulator value divided by this
length.
    For the queries supplied with TREC, over 90% of the documents contained one or more
of the query terms, meaning that over 90% of the document lengths needed to be accessed.
Moreover, at over 2 Mb, the table of document lengths is large, and in practice space limitations
might prevent it being held in memory. We have shown, however, that keeping limited precision
approximate lengths in memory can largely eliminate the cost of fetching document lengths from
disk [17]. Suppose we are prepared to allow b bits of main memory storage for each document
length x, where L < x < U. Then if B = [(U + E)IL]2~b, where E is a very small number, the
function f(x) = LlOgB(x/L)j yields integers C between 0 and 2b - 1 for all legal values of x,
and g(c) = L BC can be used to compute approximations to x.
    The C values stored can then be used in either of two ways. First, the approximate lengths
can be used to determine which exact document lengths should be retrieved from disk. Second,
cosine can be computed with y(c), yielding an approximate ranking. With as few as two bits
per length, there was only a small decline in retrieval performance in the five small document
collections for which we had relevance judgments [17]. The effects of these techniques on TREC
are discussed in Section 2.5.

2.3   Text compression
Using a word-based model of the text, the space required to store the documents comprising a
database can be reduced to less than 30% of the original size [2, 8, 15]. Each word occurrence
in the text is replaced by a canonical Huffman code, the length of which is dependent on the
frequency of the word, and the intervening `non-words' are similarly coded against a vocabulary
of non-words. Thus, by alternately decoding a word, then a non-word, and so on down the
compressed text, the exact form of the input document can be reconstructed.   For further
details of the scheme used the reader is referred to our previous work [15].
                                     231

   The decompression process is very fast. The canonical Huffman code used is particularly
amenable to a table-based `look up and copy' implementation, and each such decoding step
generates several output bytes. As a result, the compression regime has only limited impact
on retrieval time. Moreover, the small amounts of time needed for decompression are partially
(and sometimes fully) offset by reduced disk traffic.
   Because every word and non-word decompressed is retrieved from the compression model,
it is crucial that this model be held in main memory. If the compression model was stored on
disk, two or more disk accesses would be required for each word in each document retrieved.
On the smaller databases, the compression model did not exceed about 500 Kb. We were very
interested to see how much space would be required to hold the model for the TREC collection.

2.4   Inverted file compression
The inverted file index entries should also be compressed, since without compression the in-
verted file might take as much space as the original text itself [2, 9, 16]. Good compression
of index entries can be achieved by numbering documents sequentially from 1, sorting index
entries, representing each sequence of identifiers as a sequence of differences or gaps, and then
using compact representations of the generally small integers that result.
   For example, a word might appear in the first, fourth, seventeenth, ..., documents, with
an index entry of


This can be represented by the sequence of differences

                              1,3,13,74,22,...

which can then be compressed.
   We have experimented with several different techniques for compressing these runlengths,
in two broad classes [8, 9]. Global methods use the same encoding for all entries, and so have
the advantage of being more general, but are insensitive to the frequency of each term. Local
methods code each entry using a code that is adjusted to take into account the frequency of
the term.
   Use of a variable length prefix code allows the frequent small runlengths to be represented
more succinctly that the less frequent long runlengths, and is a marked improvement over the
standard binary encoding. The further advantage of using a local code arises because of the
high variability of word frequency, and the effect this has on the average runlength within the
entry. For example, the word `the'1 can be expected to have a sequence of runlengths where
each difference is very small, usually one. On the other hand, a rare word such as `palimpsest'
will have gaps that are often much larger, but also sometimes small, since rare words will tend
to occur in clusters of related documents within the collection.  Hence, a local compression
method that adjusts its codes according to word frequency can be expected to perform better
than a global method, and this is indeed the case [9]. Term `frequency-within-document' values
are efficiently represented by use of the ~ code [16]
   After compression, the index entries for a text database will typically occupy less than 10%
of the space required for the text itself. Decompression is fast, with about 100,000 document
identifiers decoded per second on a 25 MIP Sun SPARC 2. None of these compression methods
require any significant memory resources, once the index entry has been retrieved.

   1We did not `stop' any words, and also indexed all numbers. We did, however, stem all words before indexing,
using Lovins's algorithm [6].
                                     232

   We have also used these inverted file encodings in the inversion process required during
database construction, again capitalising on the large main memory offered by current work-
stations and the ability to perform two passes over a static database [7].    An alternative
approach is the disk-based method described by Harman and Candela [4]; they report that the
inversion of an 806 Mb database required 313 hours and 163 Mb of disk work space to produce
an inverted index of 112 Mb. We were thus very interested to see whether the in-memory
approach would be successful for the TREC collection.

2.5   Experimental results
This section describes the performance of our techniques on the 2055 Mb TREC collection. All
results are for a 25 MIP Sun SPARC 2 with 256 Mb of main memory and no other active user
processes.

Compression
The two components of the compression process took about three hours each, generating a
compression model and a compressed database of 4.2 Mb and 605.1 Mb respectively (Table 1).
The second pass also generated a file containing the document mapping, a little under 3 Mb.
That is, including both of these auxiliary files, the document collection was reduced from
2055 Mb to 612 Mb, or 29.8%, a disk saving of 1400 Mb.
   The need to store the coding model meant that the compression passes were expensive
on memory. With more than 900,000 unique symbols, and an additional storage requirement
of about 12 bytes per word for the search structure, each pass required about 25 Mb of data
structures. These figures reinforce our contention that database creation needs to be performed
on a reasonably well configured machine, even if query processing is performed elsewhere.

                                                     %I               (Mb, peak)
       ~ Pass        ~ Output files         I              CPU Time ~~ry
        First pass:                         1
        Compression      Compression model     4.2   0.2      2:37           25.6
        Inversion        Vocabulary            6.4   0.3      3:02           18.7
        Overhead                            ~ 10.6   0.5      0:19           2.5
        Total                                ____________     5:58           46.8
        Second pass:
        Compression      Compressed text      605.1  29.4     3:27           25.6
                         Document ma~ping      2.8   0.1
        Inversion        Inverted index       132.2  6.4      5:25           162.1
                         Inverted index mapping 2.1  0.1
                         Document lengths      2.8   0.1
                         Approximate lengths   0.7   0.0
        Overhead        _______________________ ____________  0:23           2.5
        Total                                 745.8  36.3     9:15           190.2
       ~ Overall Total [                    ~ 756.4  36.8     15:13  ~       190.2 ]
Table 1: Files in compressed TREC database (b = 8). Percentages are relative to the size of
the original document collection (2055.3 Mb)


                                            233

Inversion
The inversion process was performed as two passes in tandem with the two compression passes.
The first stage, counting term frequencies, had similar time and memory requirements to the
first component of the compression process, but could not actually share data structures because
of differing definitions of `words' and the need, in the index, for stemming.
   The second pass of the inversion process was where we expected to encounter problems.
The in-memory technique requires the allocation of memory a little larger than the final size
of the compressed inverted file, and our previous experiments had indicated that this would be
roughly 10% of the size of the input text [7]. That is, we expected to use more than 200 Mb
of the 256 Mb memory available. However, the long TREC documents-averaging 470 words
rather than the 90 for our previous largest collection-meant that the inverted file was smaller
than anticipated, and, as it turned out, the inversion process ran smoothly to completion in
162 Mb, and contributed five and a half hours to the time of the combined second pass. One
would not want to do this every day, but these resource requirements are well within the
capabilities of a typical $50,000 workstation.
   Using the `local VG' code [8, 9), the final inverted file required 132 Mb, or 6.4% of the input
text, of which about 40 Mb was by ~ coded `frequency-within-document' values. Hence, if only
Boolean queries were to be supported, the inverted file could be reduced to under 100 Mb.
   The second pass of the inversion process also produced three other files-the index mapping
for the inverted file, giving the bit address for each entry; the file of exact document lengths
(later merged with the document mapping to form the document information file); and a file
of approximate document lengths. The sizes of these, plus all of the other components of the
compressed database, are shown in Table 1.

Performance on ranked queries
The 50 test queries were transformed, by removal of SGML tags, stop words, and punctuation,
into lists of words to which the cosine measure could be applied. The top 200 ranked documents
were then retrieved and decompressed for each query.


       ___________ ___________________      Accesses  (Kb)      (sec)  ~        ed 1
       Operation    ~_File                 }  Disk    Data   CPU Time ~ ElasePc5)  j
       Searching     Index mapping         ~  43      170
       Ranking       Inverted Index        T  43      1708
       _____________ Document Information ~   219      7        16.6     18.2
       Compression   Compressed text          200    - 203
       _____________ Answers               t[;        729       3.7      7.2
       ~ Total      ]______________________      5   ~ 2817 [   20.3   ~ 25.4      ]
                       Table 2: Ranked queries on TREC (b = 8)


   Processing of these queries was remarkably quick. Table 2 shows, for the three stages of
processing each query, the average amount of data involved and the average time required.
(We did not separate the time taken by the searching algorithm and the ranking algorithm.)
As can be seen, on average the queries were processed in about 20 seconds of CPU time.
Given that each query involved approximately 2.1 million document numbers and 600,000
non-zero accumulators, we were very pleased with this result. Note that the times listed in
                                           234

Table 2 include the cost of writing the retrieved documents to disk for later analysis by a
post-processing program, but do not include the time required by the post-processing.
   On average the queries required about 500 disk accesses and the transfer of 2.8 Mb of data.
At (say) 100 accesses per second on the disk2, and a transfer rate of 1 Mb/sec, the input/output
operations contribute about 5-8 seconds to the elapsed query time, accounting for the observed
difference between CPU times and elapsed times.
   For small queries response was even more impressive.  For a query involving four terms
(`Australia wheat tariff subsidy') and retrieval of the top ten documents, 2.6 seconds of CPU
time were required; and, if output was directed into a file, the total elapsed time for the query
was under 4 seconds.
   The use of approximate lengths in-memory (using b = 8) meant that on average only
219 exact weights were required to identify the top 200 documents for the 50 TREC queries.
Depending on whether the exact lengths would have been stored in memory or on disk, with
the use of approximate lengths we have either reduced the memory required by over 2 Mb,
at the expense of 19 disk operations; or, alternately, at a cost of 725 Kb of memory, we have
avoided the need to sequentially read the entire exact lengths file, a saving of several seconds
per query.
   We also tested using the limited precision lengths to provide the final ranking, ignoring
the exact lengths altogether. This means that the cosine measure is approximate, and some
variation in recall effectiveness can be expected. In our previous experiments on smaller col-
lections this degradation was minimal for even very low values of b.  We used the 47 partial
judgments supplied, and compared the number of relevant answers using the exact ranking and
the approximate ranking technique. These results are shown in Table 3. Note that relevance
judgements were not provided for all documents, hence the range for precision-most queries
returned some documents that were un-judged. If we pessimistically assume that these docu-
ments were not relevant, the precision should be taken to be the lower value. Clearly, although
the approximate ranking is finding different documents, the use of approximate ranking has,
down to b = 4, little effect on precision.


               [   b 1                Number of answers
               I      1   5    ~   15     J   30         100          200   J
                exact  55.7-56.2 52.1-52.3 48.8-49.6   40.7-44.1    34.2-43.3
                   12  55.3-55.7 51.9-52.2 48.8-49.6   40.7-44.1    34.1-43.3
                   10  55.3-55.7 51.8-52.1 48.9-49.7   40.6-44.1    34.1-43.2
                   8   54.9-55.7 51.2-51.5 48.8-49.5   40.6-44.1    34.2-43.3
                   6   55.7-56.6 50.8-51.3 47.5-48.4   40.6-44.5    34.1-44.0
                   4   57.9-59.1 50.2-50.9 46.5-48.6   38.0-45.6    31.2-48.0
                   2   42.6-46.4 32.6-42.4 29.9-46.6   21.3-57.7    15.7-65.8

               Table 3: Effect of varying b-percentage precision, 47 queries


   2One hundred accesses per second for the file containing the compressed text is optimistic for purely random
accesses, since a transfer of several Kb from a random point in a large file in the Unix file system will typically
require three or more seeks, each costing perhaps 10 to 20 milliseconds. However, the accesses to the document
information file are in sequential order, and the accesses into the compressed text were batched into groups that
were performed in sequential order.


                                           235

Memory requirements

In total, the peak memory requirement during query processing is about 10.9 Mb (not all
components are required to be co-resident, and, in particular, the accumulators and answer
buffer are never simultaneously present). If necessary, this could be further reduced by writing
the output text directly to disk rather than into an answer buffer. However, we do not feel that
this requirement is excessive given the large size of the document collection, and to date the
use of approximate weights is the only area where we have concentrated on reducing memory
usage.


3   Structured Queries
We  focused in this experiment on structured queries.  They would be transformed in two
ways: into Boolean queries that would return a given number of documents, and into a vector
of weights for the purposes of ranking. The system in which these experiments were carried
out was a nested relational database, Atlas, that uses the multi-organisation signature ifie
scheme [10]. The bulk of these experiments were carried out on the second part of the Tipster
database. At the end of this section, we wrn report on some of the experiments that we have
carried out on the whole database.

3.1  Boolean Query Generation
There are good reasons for preferring ranked retrieval over Boolean retrieval. Nevertheless,
there are two key reasons that Boolean retrieval might be used.  The first is that Boolean
retrieval is computationally simpler, and the second is that the retrieval system may not support
ranked retrieval. Since both factors were relevant to the set of experiments described her~
we had been working with a Boolean retrieval system rather than the system described in
Section 2-it was desirable to form a Boolean query and rank the answers it returned.
   There were several requirements that a Boolean query generation method needed to satisfy.
The first is that it should return a fixed number of documents. This allowed control over query
time by fixing the number of documents to be subsequently ranked. Secondly, the query should
be in disjunctive normal form, and have relatively few disjuncts. This is because query time is
almost linear in the number of disjuncts in signature file retrieval systems. Thirdly the query
generation mechanism should be automatic.  These are the requirements that have previously
been used as criteria for Boolean query generation algorithms [12]. However, the queries that
we are considering in the TREC experiments have additional structure that may be used to
advantage. Despite the limited amount of text to be processed, no natural language processing
was performed; all processing was statistically and structurally based.
   There were many possible approaches to Boolean query generation.     A starting point
might be to use one of the Boolean query generation algorithms previously developed [12]
on the <narrative>. However, this would not adequately use the other available structures.
We considered three approaches: base the query on the <concepts>, base the query on
the <description>, augmented by the <narrative>, and base the query on the <topic>,
<description>, and <narrative>, using adjacent pairs in the disjuncts as well.
   The first approach used a list of concepts. The words within the concept were ordered with
the most frequent word in the full text of the query first, and the least frequent last. Starting
at the first concept, a disjunct was formed by creating a conjunction of terms from the concept,
starting at the first word, until the conjunct returned fewer documents than the desired limit.
                                           236

If the conjunct had fewer words than was required to meet this termination condition, other
words were added to the concept by finding words in the remaining text that were adjacent to
the first word in the concept. Finally, words that occurred rarely in the database were added
as simple disjuncts if the query was still estimated to return fewer than the required number
of documents. For example, to return 500 documents for query 82 on genetic engineering, we
generated the Boolean query:
         (genetic AND engineering) OR (manipulation AND molecular)
         OR (biotechnology AND genetic AND manipulation)
         OR (animal AND drug AND plant)
  The second approach started by forming a simple disjunction of rare words, up to 10% of
the required number of documents. Next, two key descriptors were identified, using the text
of the description primarily, but also using the narrative. The key descriptors were created as
a set of three words each, where each key descriptor is intended to capture a principal char-
acteristic of the query. The frequency of the terms in the query and the adjacency of words
were used to generate the key descriptors. Sometimes the algorithm would form two variations
of the one concept, sometimes two distinct concepts. For instance query 82 formed the key
descriptors:

to generate

(genetic, engineering, developed)
(genetic, engineering, manipulation)
the query:
(genetic AND engineering) OR (genetic AND manipulation)

   The third approach was quite similar to the second, except that more emphasis was placed
on the title in generating the concepts, and, by using pairs of words in the query, finer gradation
of the Boolean queries was possible. The concepts for queries 82 and 87 were the same. The
Boolean query for query 82 was the same as for the second algorithm, however the query
generated for 87 was:
         institutionslailed OR (actions~cnminal AND officers)
   When evaluating these approaches, the key consideration is whether or not relevant doc-
uments are identified. For a fair comparison, each algorithm should return the same average
number of documents. In Table 4, each algorithm has returned an average of about 400 docu-
ments. The recall and precision results are all based on the same ranking formula.  Note that
since these algorithms identified documents that had not been assessed as relevant or irrelevant,
we made the (pessimistic) assumption that all such documents were irrelevant.


             Algorithm      ~ Concepts ~ Description J Description with Pairs ]
             Docs. Requested   200        700            600
             Docs. Returned    453        354            360
             Disjuncts         3.5        1.8            4.4
             Recall            0.155     0.170          0.146
             Precision         0.150     0.144          0.121


             Table 4: Comparison of Boolean query generation algorithms

   To show how increasing the number of documents requested affects these algorithms, in
Table 5, we show how many documents were retrieved and how many disjuncts were used
when each of the algorithms were requested to return 1,000 documents.
                                        237

                                                       4_Description with Pairs
                Alg~orithm      Concepts    Description
                Docs. Returned   =1221
                Di8junctsI       4.01  T               T        :1063        I
                                               1.94


                 Table 5: Further comparison of Boolean query algorithms

   ~From these tables we are able to see that the least costly algorithm is based on finding two
key descriptors. If pairs are added, there are more rare terms, which each contribute a single
disjunct.

3.2   Ranking
Ranking documents using the vector space model has usually treated both documents and
queries as a flat structures-lists of words.  However, it is very simple to combine different
representations of a query in ranking. If each form of a query is represented as a unit vector
over the same vector space, we may combine the representations by vector addition.   We are
thus able to combine structural information and statistical information in the same measure.
   The first series of ranking experiments were used to determine the relative usefulness of some
of the query fields.   The results are given in Table 6. The third Boolean query generation
algorithm was used to generate a set of approximately 1,000 documents for each query. These
were ranked using each field successively, and then combined, ignoring structure. In all these
experiments we give first the results in terms of recall and precision, and then in terms of
precision in terms of the number of viewed documents.
   Let V' stand for the title vector, Vd stand for the description vector, V~ stand for the
narrative vector, VC stand for the concepts vector, Vf stand for the definition vector, Va stand
for the vector for all of the text, and Vp stand for vector of all adjacent pairs in the query.


   Recall   10%      20 o  30 o  40 o  50%      60%    70%    80%    90%     100 o  Av.
      Vd   0.197     0.114 0.032 0.014 0.015    0.012  0.000  0.000  0.000   0.000 0.038
      Vfl  0.192     0.112 0.034 0.014 0.015    0.012  0.000  0.000  0.000   0.000 0.038
      Va   0.191     0.114 0.034 0.018 0.015    0.012  0.000  0.000  0.000   0.000 0.038


                           Table 6: Ranking using individual fields

   Table 7 shows what happens if each of the five fields are added with the same weight, and
then what happens if adjacent pairs are added as well. By way of comparison, we show the
effect of adding pairs to the full text vector, and by adding pairs to the narrative vector.

                            Vi-
                            V2-
                            V3=Va
                            V4-
                            Vs-
                            V6-

                                   Vt+Vd+Vfl+Vf +VC
                                   vt+Vd+Vn+Vi +vc+VP


                                   Va+Vp


                                   vm+VP
                                            238

  [ Recall 110% ~ 20% f 30% ~ 40% ~ 50% ~ 60% ~ 70% ~ 80% ~            90% 1100% ~  Av.  J
     V1    0.209  0.112  0.034  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.040
     V2    0.209  0.112  0.034  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.040
     V3    0.191  0.114  0.034  0.018  0.015  0.012       0.000  0.000 0.000 0.000  0.038
     V4    0.198  0.115  0.037  0.018  0.015  0.012       0.000  0.000 0.000 0.000  0.039
     V5    0.192  0.112  0.034  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.038
     V6    0.198  0.111  0.045  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.039


                                Table 7: All field ranking

   Using this information, we tried to combine the various fields in "creative" ways. In Table 8
we show the results of creating a combined vector described by

                      v7   =   Wt+2Vd+3V~+O.Wf+3Vc+2Vp
                      v8   =   Wt+lVd+W~+O.Wf+3V~+3V~


   Recall   10 o  20 o   30 o   40%    50%    60%         70%    80%   90%   100%   Av.
     V7    0.208  0.112  0.034  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.039
     V8    0.205  0.112  0.033  0.014  0.015  0.012       0.000  0.000 0.000 0.000  0.039


                             Table 8: Combining the fields

   The first thing to note is how similar the results are.  Since a limited number of documents
are being examined using an imperfect Boolean algorithm, many relevant documents are being
missed altogether, so the ranking formulas have no possibility of giving them high scores.
An alternative technique for evaluating ranking formulas is to determine precision after fixed
numbers of documents have been examined.    This has the advantage that large numbers of
relevant documents not identified by the Boolean algorithm     do not flatten out recall/precision
results. Table 9 gives results at twenty document intervals for all previous experiments.


                  DocumentsI    5   ~  15  [30    1100         ~200IAv.j
                      Vd       0.374 0.321  0.298         0.223 0.158 0.275
                      V~       0.319 0.295  0.279         0.225 0.159 0.255
                      Va       0.349 0.316  0.305         0.224 0.169 0.273
                      V1       0.387 0.352  0.336         0.237 0.166 0.296
                      V2       0.391 0.352  0.336         0.237 0.166 0.296
                      Va       0.349 0.316  0.305         0.224 0.169 0.273
                      V4       0.349 0.305  0.304         0.230 0.171 0.272
                      V~       0.319 0.295  0.279         0.225 0.159 0.255
                      V6       0.336 0.304  0.284         0.229 0.164 0.263
                      V7       0.404 0.367  0.334         0.236 0.166 0.302
                      V8       0.370 0.350  0.327         0.238 0.165 0.290


        Table 9: Comparison of ranking formula for fixed no. of documents returned
                                           239

   There are two observations that one might make after examining both forms of evaluation.
The first is that there appears to be a small gain obtained by treating the text in the query
in a structured way using the methodology proposed here. The gain is less than 10%. The
second result is both surprising and has significant ramifications. The use of adjacent pairs
in previous experiments on small collections with relatively short queries provided substantial
improvement in recall/precision [3, 12]. We see no such improvement in these experiments.
   Finally, we may compare the three Boolean algorithms in their ability to identify appropriate
subsets for ranking. In this comparison, we use V4 for ranking each of the subsets. Let A~
represent the algorithm based on concepts, Ad represent the algorithm based on the description,
and A~ represent the algorithm that uses adjacent pairs. Table 10 shows recall/precision figures
and Table 11 shows precision for fixed number of documents returned.


   Recall   10%  20%   30 o   40 o   50%    60%    70%    80%     90%    100%         Av.
     A~    0.185 0.089 0.054  0.031  0.028  0.012  0.000  0.000   0.000  0.000       0.040
     Ad    0.220 0.133 0.048  0.017  0.018  0.017  0.000  0.000   0.000  0.000       0.045
     A~    0.198 0.115 0.037  0.018  0.015  0.012  0.000  0.000   0.000  0.000       0.039


                     Table 10: Comparison of Boolean algorithms


                                                                  Av. J
                 L D~~~ocuments A-Li 15
                                          1301     100
                                                          200
                 F~A~       10.3281 0Z319 J0�306'       J 0 169  0.2691
                            10.3491 0.332  0.311  0.222  0.149
                     AAPd   10.3491 0.305  0.304  0.230  0.171   00.227722


     Table 11: Comparison of Boolean Algorithms for fixed no. of documents returned

   Using both methods of evaluation, the algorithm that uses key descriptors, without adjacent
pairs outperformed augmenting the algorithm with pairs, and the algorithm using concepts.
This is despite the fact that A~ returned an average of 1190 documents per query, A~ returned
an average of 667 documents and Ad returned only an average of 530 documents.

3.3  The Atlas System
The Boolean query generation experiments were performed using the Atlas database system [5J.
This system is a nested relational database system designed for text applications. The system
was not adapted or tuned in any way to support the TREC experiments.  There are a number
of signature file organisations that are supported by Atlas~bit slice, two level schemes, and
multi-organisation schemes. We implemented an algorithm that analyses the data and selects
the most appropriate scheme and optimal parameters for this scheme [11].  For the TREC
collection, the multi-organisation scheme was selected.
   It took 23 hours to load the database and build the index, which required 313 Mb of disk
space. This indexed all stopped and stemmed terms, including adjacent pairs. The text stored
in the database was compressed using the mechanism described in Section 2.3.


                                          240

3.4   Combined Database Results

Due to our inability to hold the whole of TREC ill the form used for the compression experi-
ments, and the form used for the Atlas system, we ran a further set of experiments using the
compressed database system, but using the Boolean algorithm based on concepts to test some
of the rank formulas. We performed 4 experiments where we repeated experiments using Vd,
the description vector, Vfl the narrative vector, Va the vector of all text with structure ignored,
V1 where all text was used but structure was taken into consideration, and V2, a modification
of V1 where pairs were added as an extra vector. The results are given in Table 12 using recall
and precision and then at various intervals based on the number of documents retrieved in
Table 13.

    Recall ~ 10% ~ 20% ~ 30% ] 40% J 50% J 60% J 70% J 80% ~ 90%            100%    Av. J
      Vd    0.324 0.196 - 0.136  0.090  0.083   0.034  0.022   0.005 0.000  0.000  0.089
      Vfl   0.325 0.218  0.128   0.084  0.080   0.034  0.022   0.005 0.000  0.000  0.090
      VG    0.298 0.170  0.116   0.074  0.067   0.040  0.026   0.005 0.000  0.000  0.080
      V1    0.372 0.235  0.129   0.092  0.083   0.048  0.022   0.005 0.000  0.000  0.099
      V2    0.372 0.236  0.129   0.092  0.083   0.048  0.022   0.005 0.000  0.000  0.099


                                 Table 12: Ranking all data


                  IDocuments]     5   ~ 15    [ 30   ~ 100  ~  200 ~ Av. J
                       Vd       0.417  0.431   0.404  0.349   0.294  0.379
                       Vfl      0.370  0.408   0.407  0.353   0.301  0.368
                       V4       0.383  0.391   0.379  0.329   0.279  0.352
                       V1       0.447  0.472   0.438  0.377   0.312  0.409
                       V2       0.447  0.475   0.438  0.375   0.312  0.409


                                 Table 13: Ranking all data

  The results we obtained for
text. However the advantage of


4   Summary

the smaller collection are again observed with the full 2 Gb of
using the structure of the queries is slightly more pronounced.

In the first line of investigation, we built a compressed retrieval system for around 2 Gb of text
that required under 37% of the size of the input text, and was created in just over 15 CPU-hours
on a typical well-configured workstation. Retrieval performance for simple techniques such as
the cosine measure is fast, and queries can be processed effectively on low-end workstations in
seconds. The techniques developed during this first thread of investigation have thus altered
the status of 2 Gb information retrieval systems from being `unpleasantly expensive' to being
`eminently practical'.
   In our second thread of research we have seen that the use of the structure of queries has
enabled Boolean queries to be generated that have few disjuncts and perform better than more
complicated ones.  Given the large number of documents that are relevant to many of the
queries that have been examined, such algorithms must be subject to reduced performance
                                        241

when compared to techniques that rank the entire collection.     In the light of the parallel
experiments carried out using the full collection, it must be questionable as to whether such
techniques will survive.
  We have seen a minor improvement in ranking as a result of taking the structure of queries
into account. A more surprising result however is that pairs do not provide the performance
gains that we have seen with much smaller collections.      As a result, the vocabulary for a
collection would appear to be more manageable than if pairs need to be explicitly described.


Acknowledgements
We are grateful to Lachlan Andrew, Daniel Lam, and Neil Sharman for assisting with the
implementation. This work was supported by the Australian Research Council.


References
 [1] S. Al-Hawamdeh and P. Willett. Comparison of index term weighting schemes for the
    ranking of paragraphs in full-text documents. International Journal of Information and
    Library Research, pages 116-130, 1990.

 [2] T.C. Bell, A. Moffat, C.G. Nevill, I.H. Witten, and J. Zobel. Data compression in full-text
    retrieval systems. Journal of the American Society for Information Science. To appear.

 [3] J. Fagan. Automatic phrase indexing for document retrieval: An examination of syntac-
    tic and non-syntactic methods. In Proc. 1O'th ACM-SIGIR International Conference on
    Research and Development in Information Retrieval, pages 91-108, 1987.

 [4] D. Harman and G. Candela. Retrieving records from a gigabyte of text. on a minicom-
    puter using statistical ranking. Journal of the American Society for Information Science,
    41(8):581-589, 1990.

 [5] A.J. Kent, R. Sacks-Davis, and K. Ramamohanarao. A signature file scheme based on
    multiple organisations for indexing very large text databases. Journal of the American
    Society for Information Science, 41(7):508-534, 1990.

 [6] J.B. Lovins. Development of a stemming algorithm. Mechanical Translation and Compu-
    tation, 11(1-2):22-31, 1968.

 [7] A. Moffat. Economical inversion of large text files.   Computing Systems, 5(2):125-139,
    1992.

 [8] A. Moffat and J. Zobel. Coding for compression in full-text retrieval systems. In Proc.
    2'nd IEEE Data Compression Conference, pages 72-81, Snowbird, Utah, March 1992.
    IEEE Computer Science Press.

 [9] A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. 15'th
    A CM-SIGIR International Conference on Research and Development in Information Re-
    trieval, pages 274-285, Copenhagen, Denmark, June 1992. ACM Press.

[10] R. Sacks-Davis, A. Kent, R. Kotagiri, J. Thom, and J. Zobel. A nested relational database
    for text applications. Technical Report 92-52, Collaborative Information Technology Re-
    search Institute, Melbourne, Australia, 1992.
                                   242

[11] R. Sacks-Davis, A.J. Kent, and K. Ramamohanarao. Multi-key access methods based on
   superimposed coding techniques. ACM Trans. on Database Systems, 12(4):655-696, 1987.

(12] R. Sacks-Davis, P. Waflis, and R. WIlkinson.  Using syntactic analysis in a document
   retrieval system that uses signature files. In Proc. 18'th ACM-SICIR International Con-
   ference on Research and Development in Information Retrieval, pages 179-192, September
   1990.

[13] G. Salton. Automatic Te~t Processing. Addison Wesley, Massachusetts, 1989.

[14] G. Salton and M.J. McGrn. Introduction to Modern Information RetrievaL McGraw-Hrn,
   New York, 1983.

(15] J. Zobel and A. Moffat. Adding compression to a full-text retrieval system. In Proc. 15'th
   Australian Computer Science Conference, pages 1077-1089, Hobart, Australia, January
   1992.

[16] J. Zobel, A. Moffat, and R. Sacks-Davis. An efficient indexing technique for full-text
   database systems. In Proc. 18'th Conference on Very Large Databases, pages 352-362,
   Vancouver, Canada, August 1992.

[17] J. Zobel, A. Moffat, and R. Sacks-Davis. Memory-efficient ranking of document collec-
   tions. Technical Report TR-92-53, Collaborative Information Technology Research Insti-
   tute, Melbourne, Australia, August 1992.


                                   243