Retrieval of Partial Documents

       Alistair Moffat*       Ron Sacks~DaVist


Information systems usually retrieve whole doc-
uments as answers to queries. However, it may
in some circumstances be more appropriate to re-
trieve parts of documents.  These parts could be
formed by arbitrary division of running text into
pieces of similar length, or by considering the doc-
ument's hierarchical structure. Here we consider
how to break documents into parts, how to imple-
ment retrieval of parts, and the impact of division
of documents on retrieval effectiveness.

1  Introduction

Provision of answers to informally phrased que~
tions is a central part of information retrieval.
These answers traditionally take the form of doc-
uments retrieved from a text database, but doc-
uments will often be unsatisfactory as answers.
They may be large and unwieldy; the answer they
represent may be diffuse, and therefore hard for
the user to extract; and word-based retrieval sys-
tems may be misled by the breadth of vocabu-
lary of a long document into believing it to be
relevant.
  Indexing and returning parts of documents ad-
dresses these problems. We have approached the
problem of partial documents in two ways. The
first approach is to regard documents as an un-
structured series of "pages" of text of similar
length, each of which can be returned as an an-
swer to a query.  We would expect, under this
approach, that any bias in the retrieval mecha-
nism towards documents of a particular length
should be eliminated. By regarding an answer to
be the document from which an answer page is
drawn, paging can be used even in contexts where
  *Dept. of Computer Science, The University of Mel-
bourne,  Parkvrne,  Victoria,   Australia 3052;
alistair~cs.mu.oz.au
  tCollaborative Information Technology Research Insti-
tute, 723 Swanston St., Cariton, Victoria, Australia 3052;
rsd~kbs.citn..edu.au
  :Dept. of Computer Science, RMIT, GPO Box 2476V,
Melbourne, Victoria, Australia 3001; ross~cs.rmit.oz.au
  伶ept. of Computer Science, RMIT, GPO Box 2476V,
Melbourne, Victoria, Australia 3001; jz~cs.rmit.oz.au

  Ross Wilkinson:       Justin Zobel~


documents are required as answers, as is the case
for the TREC experiments.
  Breaking documents into pages, however, has
implications for implementation: the growth in
the number of candidate answers is such that
current approaches for evaluating queries have
unacceptable memory requirements and response
times.  We have developed new algorithms for
implementing information retrieval methods on
large collections, concentrating on the cosine
measure with IDE term weights as a typical ex-
ample.  These include techniques for efficiently
constructing and compressing large inverted files,
and for restricting the amount of memory space
and processing time required during query eval-
uation. The result is that we are able to iden-
tify answers in a fraction of the space and time
required by previous methods. Using these tech-
niques on a version of the TREC data in which
the documents are broken into over 1.7 million
pages, answers can be found more quickly than
they could previously be found on the unpaged
data, even though the latter has a smaller index
and fewer records. Section 2 gives an overview of
these techniques.
  Our second approach to the problem of par-
tial documents was to regard documents as hi-
erarchical structures. Some of the documents in
TREC are very large.  It is not clear that it is
desirable to return such documents as a whole,
nor is it clear that these documents should be in-
dexed as a whole. Most of the long documents
in TREC are from the Federal Register collection
and all have some degree of structure associated
with them. We conducted a set of experiments
that attempted to determine whether these doc-
uments should be indexed as single objects, and
whether the documents' structure could be used
in conjunction with the contents of its elements.
  We also experimented with retrieval of par-
tial documents and investigated whether context,
that is the rank of the whole document, helped
improve ranking of sections.  The experiments
with hierarchical structures are necessarily based
on the small set of longer documents in the TREC

181

collection.  We would expect that, in a large
database of such documents, the implementation
techniques we developed for unstructured text
could be used for structured documents almost
without change.

2   Retrieval of paged text

The first issue that must be considered when ac-
cessing a set of documents as a collection of pages
is the pagination strategy; one possible method
for pagination is discussed in Section 2.1.
   Then, given paged text, answers must be found
in response to documents. In an inverted file text
database system of the kind we have been devel-
oping [6, 9], the implications of paging are a large
increase both in the number of accumulators used
to hold the intermediate cosine values and in the
length of the inverted file entries.  The former
means that evaluating a ranking will require more
memory; while the latter implies a longer time to
resolve queries. Methods for avoiding these in-
creases are outlined in Sections 2.2 and 2.3; a full
description is given elsewhere [10].
   Finally, there is the impact on effectiveness
of the decision to paginate the documents. Ex-
perimental results comparing document retrieval
with page retrieval are described in Section 2.4.

2.1  Breaking text into pages
One way to store text in a database is as a set
of records, each containing a single document.
Such an organisation implies that entire docu-
ments must be retrieved in response to queries.
Moreover, a query can match long documents in
which its words are widely separated and could
well be unrelated, so this storage strategy can re-
sult in irrelevant material being retrieved.
   An alternative way of presenting text is as a
series of pages. There are several advantages to
using pages:
   * they are of a more manageable size than
    whole documents, and the size variation be-
    tween pages can be constrained to be far less
    than the size variation between documents;

   * if a user looks for answers matching a query,
    it is likely that in relevant material the
    query's words will occur close together in the
    retrieved text;

   * retrieving a page of text to display may be
    considerably cheaper than retrieving an en-
    tire document;

     1. Set prev  undefined and curr   Wi.
     2. For each j from 2 to n,
        (a) If curr > B, emit prev if it is defined,
           set prev   curr, and set curr   Wj.
        (b) Otherwise,  if W,    > prev then set
           prev    prev + curr and set curr  w3.
        (c) Otherwise, set curr    curr + w3.

     3. If curr < B then set prev  prev + curr and
       emit prev. Otherwise, emit prev followed by
       curr.


           Figure 1: Paging algorithm


     * in many applications it is natural to regard
      documents as consisting of parts rather than
      as a whole; for example, the nodes in a hy-
      pertext system or the sections of a book; and

     * only parts of documents can, in general, fit
      on a screen, and people can only comprehend
      part of a document at a time.

  For these reasons-and because of the challenge
  posed by the sheer size of a paged version of the
  TREC collection-experiments were undertaken
  with paged text.
     The paging strategy adopted first breaks doc-
  uments into minimal units likely to be useful for
  ranking. In most collections this unit would prob-
  ably be the paragraph. The algorithm shown in
  Figure 1 is then used to gather paragraphs into
  pages, where W~ is the length of the i'th para-
  graph of the original document and B is the tar-
  get length of each constructed page. Documents
  of fewer than B bytes must, of course, be allo-
  cated singleton pages, but all other pages are B
  bytes or more.  Moreover, the aim is that only
  a small number of pages are significantly longer
  than B bytes. The algorithm requires time linear
  in the length of the text and a constant amount
  of space, and so is a small additional step during
  database construction. For the experiments de-
  scribed below, length was measured in bytes (the
  alternatives being words, or some more abstract
  length value based upon term frequencies), and
  B = 1,000 was used.
     The pagination of the 2,055 Mb TREC collec-
  tion resulted in the 742,358 documents being con-
  verted into 1,743,848 pages of average size 1,235
  bytes each.


182

2.2  Management of accumulators

An effective way to rank documents against a
query is with the cosine measure, defined by the
function

                     Wq,t Wd,t
        Gd     ~q2,t~~2,t'


where q is the query, d is the document, and W~,t
is the weight of word i in document or query x.
We assume that Wd = V(Z~ wd2,t) is the length
of document d, and that the top r answers are to
be retrieved and returned to the user.
   In our experiments term weights were calcu-
lated using the IDE rule Wd,t = fd,t  log(N/ft),
where fd,t is the number of appearances of term
tin document d, N is the number of documents
in the collection, and ft is the number of docu-
ments that contain term t.  Harman [3] gives a
summary of ranking techniques and a discussion
of the cosine measure and IDE weighting rule.
   To support the cosine measure in an inverted
file text database system, each inverted file entry
contains a sequence of (d, fd,t) pairs for some term
i. The value ft, the number of documents con-
taining t, is stored in the lexicon, and from these
two the cosine contribution fd,t  (log(N/ft))2 of
term tin document d can be calculated. A struc-
ture of accumulators is used to collect these con-
tributions. As the inverted file entry for each
query term is processed, the accumulator value
for each document number in the inverted file
entry is updated by adding in each partial co-
sine value as it is calculated, so that by the time
all inverted file entries have been processed the
accumulator for document d contains the value

        fd,t  (log(N/ft))2 =      Wqt Wd,t.

   The number of accumulators required to pro-
cess a query can be large. Queries were created
from TREC topics 51-100 by simply extracting
and stemming all of the alphanumeric strings,
and stopping 601 closed-class words.   The gen-
erated queries had an average of over 40 terms,
and each resulted in about 75% of the pages
of TREC having a non-zero accumulator value.
With this ratio the most memory-efficient accu-
mulator structure is an array.   But at 32 or 64
bits per accumulator, this structure is formidably
large, and dominates the retrieval-time mem-
ory requirements.  Even initialising it is time-
consuming.

1. Order the words in the query from highest
  weight to lowest.
2. Set A   $; A is the current set of accumula-
  tors.
3. For each word tin the query,
   (a)
   (b)

Retrieve It, the inverted file entry for t.
For each (d, ~ pair in It,

              i. If the accumulator Ad E A, calcu-
                late Ad Ad + wq,t Wd,t.
              ji. Otherwise, set A A+{Ad}, cal-
                culate Ad Wq,t Wd,t.

       (c) If Al >~ L, go to step 4.
    4. For each document d such that Ad C A, cal-
      culate Cd    Ad/Wd.

    5. Identify the r highest values of Cd using a
      heap.


  Figure 2:  Quit algorithm for computing cosine
  using approximately L accumulators


    A simple strategy for restricting the number of
  accumulators is to order query terms by decreas-
  ing weight, and only process terms until some
  designated termination condition is met.  One
  such condition is to impose an a priori bound L
  on the number of non-zero accumulators, and to
  stop processing terms when this bound is reached.
  Such a ranking process, which we call quit, is il-
  lustrated in Figure 2. Other possible termination
  conditions would be to place a limit on the num-
  ber of terms considered or on the total number
  of pointers decoded; or to place an upper bound
  on the term frequency ft, and only process terms
  that appear in fewer than x% of the documents,
  for some predetermined value x. Buckley and Le-
  wit [1]; Lucarella [8]; and Wong and Lee [13] have
  also considered various stopping rul~s, and derive
  conditions under which inverted file entries can be
  discarded without affecting the r top documents
  (although their order within the ranking may be
  different). Here we are prepared to allow approx-
  imate as well as exact rules, acknowledging that
  the cosine measure is itself a heuristic.
    Quitting has the advantage of only process-
  ing the inverted file entries of a subset of the
  query terms, and hence of faster ranking, but
  at the possible expense of poor retrieval perfor-
  mance, depending upon how discriminating the
  low weighted terms are.
    An alternative is to continue the processing of


183

1. Order the words iu the query from highest
  weight to lowest.
2. Set A
3. For each word tin the query,
   (a) Retrieve It.
   (b) For each ~d, fd,t) pair in It~
       i. If Ad E A, calculate Ad       Ad +
            Wq,t Wd,t.

       ji. Otherwise, set A      A + {Ad}, cal-
            culate Ad Wq,t Wd,t.

   (c) If IAI > L, go to step 4.                      20 -
4. For each remaining word tin the query,
   (a) Retrieve It.
   (b) For each d such that Ad E A,
       if (d,fd,t~ E I~, calculate Ad   Ad+
      Wq,t Wd,t.

5. For each document d such that Ad ~ A, cal-
  culate Cd   Ad/Wd.

(0-


0


0

15 -

sured as an lipt average precision, as a function
of k = Al, the number of accumulators actually
used.  The small numbers beside the continue
curve show the average number of terms pro-
cessed in phase one of each query. For example,
only 8.2 terms are needed to generate 27,000 ac-
cumulators. The difference between quit and con-
tinue is marked, and, perhaps surprisingly, even
the mid to low weight terms appear to contribute
to the effectiveness of the cosine rule.  Ignoring
them leads to significantly poorer retrieval.

  6. Identify the r highest values of Cd.


Figure 3: Con~inue algorithm for using approxi-
mately L accumulators

                                                                            10

                                                                        continue
                                                                    D   quit

                                                                               1000          10000         100000       1000000

inverted file entries after the bound on the num-
ber of accumulators is reached, but allow no new
documents into the accumulator set.  This con-
tinue algorithm is illustrated in Figure 3.  The
algorithm has two distinct phases.  In the first
phase, accumulators are added freely, as in the
qui~ algorithm. In the second stage, existing ac-
cumulator values are updated but no new accu-
mulators are added. Both qui~ and continue gen-
erate the same set of approximately L candidate
answers, but in a different permutation, so when
the top r documents are extracted from this set
and returned, different retrieval effectiveness can
be expected, particularly if r 俠.
  A similar strategy to that of continue is de-
scribed by flarman and Candela [3, 4], although
their motivation is somewhat different-they de-
sire a small number of accumulators to reduce
the sorting time required for a total ranking of
the collection, whereas we assume that only a
small fraction of the collection is to be presented.
In this latter case, to find the top r documents
a heap is the appropriate data structure, and
O(N + r log N) time is required, a small fraction
of query processing time.
  Figure 4 shows retrieval effectiveness, mea-

              Number of accumulators


    Figure 4: Effectiveness of quit and continue

    In these experiments direct assessment of the
  relevance of pages was not possible because the
  relevance judgements were for whole documents.
  To measure effectiveness, it was assumed that if a
  document was relevant then each page from it was
  relevant, but only the most highly ranked page
  from each document was returned and counted.
  We believe this method to be fair, since one way
  in which pages might be used would be to rank on
  pages but return whole documents, with perhaps
  some highlighting of the top-ranked pages. Other
  strategies for using parts of documents to select
  whole documents are discussed in Section 3.
    Also surprising is that the continue strategy,
  with restricted numbers of accumulators, is capa-
  ble of better retrieval performance than the orig-
  inal cosine method, in which all pages are per-
  mitted accumulators. It would appear that the
  mid to low weight terms, while contributing to
  retrieval effectiveness, should not be permitted
  to collaborate and select documents that contain
  none of the more highly weighted terms.


184

    1o*o -

                                          try in full can be avoided by including a series
                                          of synchronisation points at which decoding can
                                          commence [10]. These can be arranged as a series
                                          of pointers, or skips, as illustrated in Figure 6.

  C)
  w                                                              skip

  E
      1.0 -                                                  document flu

  0
                                     array
                                     continue                            inverted file entry
                                     quit

     0.1    *1
         tO          10000   100000  1000000

                Number of accumulators


   Figure 5: CPU time of quti and con jinue


   Whilst the results for retrieval effectiveness are
encouraging, the results for time are not. Figure 5
shows the cost of these two strategies in terms of
cpu time, with the set A of accumulators stored
as a hash table.     The times shown include all
processing required to identify the top r = 200
ranked documents, but do not include the cost
of actually retrieving those documents, which, in
a system such as ours, in which the text is also
compressed, involves further decoding effort. All
timings are for a lightly loaded Sun SPARCsta-
tion Model 512 using a single processor; programs
were written in C and compiled using gcc. While
for values of L less than about 100,000 the con-
~inue method takes no longer than the simple co-
sine measure when implemented using an array
of accumulators, it is clearly unsatisfactory com-
pared with the quii technique. In the next section
we discuss methods by which this situation can
be improved.

2.3  Processing long inverted file entries

Compression can reduce index size by a factor
of six; for example, for the paged form of TREC
the size reduces from 1,120 Mb to 184 Mb, an
irresistible saving.  Compression, however, pro-
hibits random access into inverted file entries, so
that the whole of each inverted file entry must
be decoded, even though not every (d, fd,t) pair
is required. This is the reason that the con~inue
strategy is slow.
   The need to decompress an inverted file en-

                Figure 6: Adding skips

    The skips divide the inverted file entry into a
  series of blocks, and to access a number in the
  entry, only the skips and the block containing
  the number need to be decoded. Appropriately
  coded, such skips increase the size of the com-
  pressed inverted file by only a few percent, but
  can drastically reduce the amount of decoding re-
  quired. Decode time for a skipped inverted file
  entry is given by T = ~d (2s + Lp/2s), where id
  is the time needed to decode one (d, fd,t) pair, p
  is the number of such pairs in the inverted file
  entry, L is the number of accumulators, and S
  is the number of skips. This time is minimised
  at S = \/L~/2.    For 10,000 accumulators and
  an inverted file entry of length 100,000 (a com-
  mon figure for queries to the paged TREC),to-
  tal time including reading from disk on a typi-
  cal system drops from 0.30 seconds to 0.22 sec-
  onds [10].   Moreover, under the same assump-
  tions an uncompressed inverted file entry of this
  length would take 0.30 seconds to read, and so the
  skipped compressed inverted file provides faster
  ranking than even an uncompressed inverted file.
    To test skipping,   inverted files were con-
  structed for several different fixed values of L and
  then, for each inverted file, the cpu time for in-
  verted file processing measured for a range of ac-
  tual L values. To ensure that inverted files did
  not become too large, a minimum block size was
  imposed, requiring that every skip cover at least
  four (d, fd,t) pairs. The results of these experi-
  ments are shown in Figure 7.  As can be seen,
  substantially faster query processing is the result
  when the number of accumulators is less than
  about 100,000. As predicted by the analysis, the
  index constructed assuming that L = 1,000 gives
  the best performance when the number of accu-
  mulators (the variable L in Figure 3) is small.


185

                                             for the first round of experimentation [6, 9].
                                             The times listed cover all activity from issue
                                             of query until a ranked list of 200 document
                                             numbers is calculated.

0
Co

E
i:

0

10*o -


                               continue
                               L=1OO,OOO
                               L=1O,OOO
                               L=1 000
              *1
   1000      10000       100000     1000000

           Number of accumulators


Figure 7: Time required by skipping

2.4  Retrieval experiments

The implementation techniques described above,
skipping and limitation of the number of accu-
mulators, can be applied to document retrieval
as well as to page retrieval. Table 1 summarises
the various resources required and performance
achieved by both document and page retrieval,
for various values of L, the nominal number of ac-
cumulators. The following points should be noted
in connection with the data in Table 1

File sizes Stemming was applied  during in-
    verted file construction, but no words were
    stopped.   All alphanumeric strings were
    indexed, totalling 333,000,000 term occur-
    rences of 538,244 distinct terms. The doc-
    ument index contained 136,000,000 (d,fd,t)
    pointers, the paged index 196,000,000 such
    pairs.  Each is compressed to occupy less
    than one byte. The variation in inverted file
    size is due to the insertion of skips, or, in the
    case of the "All" row, their absence.
    The text itself was also stored compressed
    using a word-based model [9], allowing the
    2,055 Mb to be reduced to 605 Mb. A fur-
    ther 27 Mb of smaller auxiliary files were
    also required. In total, the complete retrieval
    system for the paged TREC occupied about
    817 Mb, or 40% of the initial unindexed text.

Retrieval time The SPARCstation 512 used
    for these experiments is approximately 2.2
    times faster than the SPARCstation 2 used


                                           186

   Fetching of the 200 answers, which occupy
   226.9 Kb compressed, adds a uniform 1.9 sec
   per query including the cost of decompres-
   sion. Decompressed, there 15 an average of
   829.9 Kb of output text per query.
   Elapsed times during the ranking process
   are generally about 1.6 sec greater than cpn
   time, in the course of which an average of
   42 disk accesses are made to the lexicon; 42
   disk accesses made to the inverted file itself
   (fetching a total of 1.65 Mb of data, contain-
   mg about two million compressed (d, fd,t)
   pairs); and 275 accesses to the combined file
   of document weights and addresses.  All of
   these files except the inverted file are small,
   and likely to have been buffered into main
   memory, hence the small overhead.
   Elapsed time for the presentation of docu-
   ments was about 5.0 sec per query greater
   than cpu time, caused by the need to per-
   form a further 200 seeks into the file con-
   taming the compressed text. This file is too
   large for there to have been any buffering ef-
   fect.

Memory space When  skipping    is employed
   (the first three rows in each section of the ta-
   ble) memory space during query processing
   is proportional to the number of non-zero ac-
   cumulators. In these experiments a hash ta-
   ble was used, and an average of 14 bytes per
   accumulator required. For the "All" exper-
   iments an array of accumulators was used,
   2.8 Mb for document retrieval, and 6.6 Mb
   for page retrieval.
   An array of 6-bit approximate document
   lengths was used to guide the retrieval pr~
   cess [11], requiring 0.5 Mb for document re-
   trieval and 1.2 Mb for page retrieval.

Terms processed The values for "Terms pro
   cessed" indicate the average number of terms
   processed before processing switched from
   the first to the second phase of continue. For
   each query a whole number of inverted file
   entries were processed, and this is why the
   "non-zero accumulators" average is greater
   than the target value L. Not surprisingly,
   with page retrieval fewer terms were re-

                                  file      Retrieval Time      TI              Non-zero
                              (Mb)     II       (cpu sec)       TI          accumulators
          ____________ ______    Page      Doc.     ~   Page               Doc.    ] Page
           L = 1,000    IJ 145.8J205.4 II  2.36     `   2.95    11         2,330     2,701
           L = 10,000   11156.71220.311    4.46     ~   5.57    ~          12,855    13,238
           L = 100,000 11161.61230.211     9.80     ~   12.00   11107,737           109,370
           All (array)  11128.61184.411    7.66     ~   12.48   II 582,783          1,307,115
                                       (a) Resources required

                              Terms         Precision at 200    JI          Pessimal lipt
                           processed   ________ ________TI                 Doc. average      ~
          _____________ ______   Page      Doc.     ~   Page             _________ ~ Page
           L = 1,000    f[ 3.08 J 2.84 II 0.271-0.612 ~ 0.260-0.639 TI     0.164     0.166
           L = 10,000   ~ 6.80  ~ 6.06  0.346-0.530  0.319-0.554 ~         0.185     0.173
           L = 100,000 ~ 16.84   14.26 ~ 0.333-0.446 ~ 0.326-0.504 ~       0.175     0.175
           All (array)  11 42.44 ~ 42.44 J[ 0.331-0.444 j 0.321-0.502 `1   0.174     0.173
                                       )

                                 (b, Re trieval ~fiectiveness


  Table 1: Document vs. paragraph retrieval: (a) resources required, and (b) retrieval effectiveness

   quired, on average, to consume the available
   accumulators.

Precision at 200 The range of values in this
   column is to show the fraction of unjudged
   documents that were accessed.  The lower
   number assumes that all unjudged docu-
   ments are irrelevant; the upper is calculated
   assuming that all are relevant.

lipt effectiveness The "1 ipt average" values
   are calculated based solely upon the top 200
   ranked documents, assuming that all remain-
   ing relevant documents are ranked last, and
   that all unjudged documents are not rele-
   vant.

Index construction New algorithms have been
   developed for index construction.  Using
   these algorithms, the index was built in un-
   der 4 hours, using a peak of 40 Mb of main
   memory and less than 50 Mb of temporary
   disk space above and beyond the final size
   of the inverted file. The compression of the
   documents took a further 4 hours, for a total
   database build time of under 8 hours.

Based upon these experiments we conclude that:

 * use of a limited number of accumulators does
   not appear to impact retrieval effectiveness,
   and so is an extremely attractive heuristic for
   ranking large collections because of the dra-
   matic savings in retrieval-time memory us-
   age that result;


                                             187

  * introduction of skips to the compressed in-
    verted file entries significantly reduces pro-
    cessing time in this restricted-accumulators
    environment;

  * index compression can, in this way, become
    "free": if only partial decoding is required,
    the input time saved by compression can be
    more than enough to pay the cpu cost offrac-
    tional decoding;

  * all non-stopped terms should be allowed to
    contribute to the ranking, and that the quit
    strategy is inferior to continue; and

  * even relatively simple pagination gives re-
    trieval not measurably inferior to document
    retrieval.

When dealing with pages, the retrieval system
was implemented to display the entire document,
but highlight the page or pages that had caused
the document to be ranked highly. This deci-
sion made the paged collection particularly easy
to use, and when we browsed the answers to
the queries it was very satisfying to be able to
find the exact text that had triggered the match.
This advantage of pagination became clear even
in the early stages of the project, while we were
attempting to debug the implementation of the
cosine method.
  On the other hand,   a problem that was
brought out by these experiments was the dif-
ficulty of comparing retrieval effectiveness  in

the face of non-exhaustive relevance judgements.
When precision rates are around 30%, and a fur-
ther (in the L = 1,000 case) 30% of documents
are unjudged, there can be no significance what-
soever attached to the difference between even
30% precision and 40% precision.    Indeed, as-
suming that 27.1% of the unjudged documents
are relevant for the "L = 1,000; Doc~ combina-
tion gives a final precision of 0.411; the corre-
sponding number for the "All; Doc" pairing is
only 0.373. Thus, the precision figures of Table 1
are sufficiently imprecise that no conclusion can
be drawn about the appropriate value of L that
should be used, and about the merits of docu-
ment vs. paged retrieval. There is clearly scope
for research into other methodologies for compar-
ing retrieval mechanisms.


3   Structured documents

Many of the documents in the TREC collection
are very large and have explicit structure, and
it may be possible to use this structure-rather
than the statistically based pagination methods
described abov~to break documents into parts.
In particular, many documents can be broken up
into a set of sections, each section having a type.
There has been relatively little work done on re-
trieving or ranking partial documents. However,
Salton et al. [12] have demonstrated that docu-
ment structure can be valuable. Sometimes this
structure is explicitly available [2], and sometimes
it has to be discovered [5], but the knowledge
of this structure has been shown to help deter-
mine the relevance of sub-documents.   In this
part of the work we used a small database to
investigate whether retrieval of sections helped
document retrieval, and whether retrieval of doc-
uments helped section retrieval.  By way of a
benchmark, the paged retrieval techniques de-
scribed earlier were applied to the same database.

3.1  The database

Since we needed information about the relevance
of sections to queries it was not possible to use the
full TREC database. Instead, we used a database
consisting of 4,000 documents extracted from the
Federal Register collection.  These documents
were selected as being the 2,000 largest docu-
ments which were relevant to at least one of topics
51-100 provided for the first TREC experiment.
Another 2,000 documents were randomly selected
from the Federal Register collection to provide

  both smaller documents and non-relevant docu-
  ments.  The average number of words in these
  documents was 3,260.
      These documents were then split into sections
  based on their internal markup. The documents
  had a number of tags inserted that defined an in-
  ternal structure.  It appeared that only the T2
  and T3 tags could be reliably used to indicate a
  new internal fragment. Section breaks were de-
  fined to be a blank line, or a line containing only
  markup, followed by a T2 or a T3 tag. This led
  to a database of 32,737 sections.  Each of these
  sections had a type based on its tag. The types
  were (purpose), (abstract), (start), (summary),
  (title), (supplementary), and a general category
  (misc) that included all remaining categories.
      Having made the document selections, only 19
  of the queries 51-100 had a relevant document
  in the collection. Each of the sections for doc-
  uments that had been judged as relevant was
  judged for relevance against these queries so that
  finer grained retrieval experiments were possi-
  ble.  One difficulty that arose was that quite a
  few documents that had been judged relevant ap-
  peared to have no relevant sections-there were
  relevant key terms in the documents but the doc-
  uments themselves did not appear to address the
  information requirement.    There were 145 such
  (query, document) pairs.    To be consistent, we
  took these document to be irrelevant. After these
  alterations, only 14 queries had a relevant section,
  and there were an average of 23 relevant sections
  per query.

  3.2    Structured ranking

  ~`e carried out a set of experiments on rank-
  ing documents using the retrieval of sections.
  We first compared simple ranking of documents
  against ranking sections to find relevant docu-
  ments. Next, a set of formulae were devised that
  attempted to use the fact that one document has
  several sections that might be more or less highly
  ranked. These took into consideration the rank
  of the section, the number of ranked sections, and
  the number of sections in the document. Exper-
  iment 3 describes one of the more successful for-
  mulas.
      Further trials were then performed using the
  type of the section. First, a set of experiments
  were run that determined which types were bet-
  ter predictors of relevance.  These results were
  then used to devise a measure that used a weight
  for each type. Finally, we tried to combine these


188

results to see if it was helpful to use the rank of
the whole documents along with the rank of its
component parts.

Experiment 1: Rank full documents against
   the queries using standard cosine measure.

Experiment 2: Split documents into sections.
   Measure similarity of each section against
   the queries   using  standard cosine mea-
   sure. Order documents based on the highest
   ranked section.

Experiment 3: Split documents into sections.
   Measure similarity of each section against
   the queries using standard cosine measure.
   Order documents based on

                    0.5~~1wt(s,,)

   where S,, is the flth highest ranked section in
   the document. The effect of this formula is
   that a document's weight is determined by
   a decay formula using the weight of all of a
   document's sections. (Values other than 0.5
   were tried, but gave poorer performance.)

Experiment 4: Split documents into sections.
   Measure similarity of each section against
   the queries using standard cosine measure.
   Then weight each section using its type, for
   example (introduction) or (address). Order
   documents based on

            tw(i~pe(s,, ))0.5"~1wt(s,,)

   where S,, is the flth highest ranked section in
   the document, type(s,,) is the type of the sec-
   tion, and tw(t~) is the weight of the type t~
   The weights of the types of sections were ob-
   tained by conducting a set of experiments
   where all types but one were given a weight
   of 1, and in turn, each type was given a
   weight of 2. Using these experiments it was
   determined that (purpose) and (summary)
   were each more helpful, and that (misc) was
   less helpful.  As a result, in this experi-
   inent tw((purpose)) = tw((summary)) = 2,
   tw((misc)) = 0.5, and other weights were set
   to 1.

Experiment 5: Rank the documents, and rank
   the sections. Form a new rank based on the
   average rank of these two ranks.


                                              189

   Other experiments were carried out using best
two sections, and formulas that more closely ap-
proximated the cosine measure. None of these
experiments achieved better results than the ones
displayed here. The obvious conclusion is that if
documents are available for ranking as whole doc-
uments, then for this collection it is preferable to
do so.

3.3    Section retrieval

For very long documents it may be desirable to
return relevant sections rather than relevant doc-
uments.    We were interested to see whether it
might be useful to know about the rank of the
containing document.    In the first experiment
documents were ranked, and sections shown in
document order.     This produced very poor re-
sults.  Next, we still ranked sections in higher
ranked documents ahead of lower ranked docu-
ments, but used section ranking for sections in
the same document.     This was reasonable but
there were still many irrelevant sections being ex-
amined. Finally, we attempted to delete these ir-
relevant sections by using document ranking, and
then section ranking, but this time discarding sec-
tions that had a section rank of greater than 200.

Experiment 6: Rank      sections against the
      queries using standard cosine measure.

Experiment 7: Rank full documents against
      the queries using standard cosine measure.
      Order the   sections by their appearance
      within documents.

Experiment 8: Rank full documents against
      the queries using standard cosine measure.
      Order sections, first by document, then by
      rank within documents.

Experiment 9: As in experiment 3, but then
      delete all but the 200 highest ranked sec-
      tions.

   These experiments show that ranking both
documents and sections does help to find more
relevant sections.  This result is in contrast to
the earlier investigation of finding relevant docu-
ments. We are able to find relevant sections much
easier if the rank of both the sections, and the
containing documents are taken into account.

3.4    Paged versus section retrieval

Our results showing that retrieving documents
based on section ranking was not as useful as

            Documents /5        J 10   115120T          25   ~ 30   15012001
            Experiment    ________ I ________ lii ________ I ________ 1IJ
                  1       0.286  0.271  0.248   0.236  0.234 0.229  0.206  0.102
                  2       0.243  0.221  0.214   0.204  0.191 0.178  0.170  0.092
                  3       0.271  0.250  0.229   0.221  0.206 0.202  0.184  0.094
          ~4              0.329  0.257 ifi30.229       0.206 0.202  0.184~0.085
                  5       0.343  0.264  0.238   0.236  0.234 0.231  0.209  0.099


        Table 2: Comparison of ranking formula for fixed number of documents returned


            Documents/S Experiment ______J~~     20   ~              50__~  200 J

                         I 0.186 0.164  0.181   0.161  0.160 0.152  0.140  0.120
                          0.100  0.121  0.105   0.100  0.094 0.088  0.090  0.083

                  9�      _____I~f                    r~!                  f    j
                          0.171  0.121  0.114   0.100  0.091 0.083  0.100  0.082
           ______________ 0.214  0.171  0.186   0.189  0.189 0.193  0.179   NA

         Table 3: Comparison of ranking formula for fixed number of sections returned

document ranking was perhaps surprising-other
studies have indicated that considering smaller
fragments was helpful, although in combination
with larger contexts. We wondered whether the
section boundaries that had been imposed were
inappropriate.  We thus applied the pagination
techniques described in Section 2.1 to the 4,000
document database used here. These results are
shown in Table 4 as Experiment 10.
  These three experiments are a little difficult
to interpret.  Dividing documents into sections
leads to poorer retrieval performance, but divid-
ing further into pages leads to comparable re-
trieval performance to ranking whole documents.
It may be that the manually supplied divisions
are poorer than the divisions generated by au-
tomatic techniques [5].   Experiments 2-4 show
that some of this performance degradation can
be ameliorated by taking documents' structure
into account.  However, these experiments indi-
cate that there is no retrieval advantage in break-
ing the document up, should the desired unit of
retrieval be whole documents.


4  Conclusions

The combination of a restricted-accumulators
policy and the introduction of skips to the com-
pressed inverted file entries allows fast query
evaluation on large text collections.  Moreover,
the ranking can be carried out within modest
amounts of main memory.       For example, the


                                                 190

paged TREC collection contains 1.7 million pages,
but ranked queries of 50 or more terms can be re-
solved within seconds using just a few megabytes
of main memory.  These two techniques mean
that large collections can be searched on small
machines without measurable degradation in re-
trieval effectiveness.
  In the second part of the experiment we have
concentrated on large documents, breaking them
into smaller units for the purposes of indexing.
It is not clear that users are interested in re-
trieving 3 Mb documents, and these experiments
were designed to allow users the option of retriev-
ing smaller parts of such documents. The results
were mixed. It appears that indexing both sec-
tions and documents is helpful in ranking sec-
tions. However, it is not clear what an appropri-
ate indexing strategy is if only full documents are
to be returned. We are continuing our investiga-
tion of partial document retrieval.

Acknowledgements

We would like to thank Daniel Lam and Neil
Sharman for their assistance with various com-
ponents of the experiments described here. This
work was supported by the Australian Research
Council, the Collaborative Information Technol-
ogy Research Institute, and the Centre for Intel-
ligent Decision Systems.