The Efficiency Issues Workshop Report

               James P. Callan
          Computer Science Department
           University of Massachusetts
                         Box 34610
          Amherst, MA 01003-4610, USA
              (callan~cs.umass.edu)


 Although most groups participating in TREC-2 em-
phasised precision and recall, the conference was also
an appropriate forum in which to discuss the efficiency
of document indexing and retrieval. For some partici-
pants, running out of disk space was their worst prob~
lem, while for others running out of time was. Some
groups were unable to run on the entire collection1 and
several remarked that they were happy to have gotten
anything running at all. No group reported finding
TREC trivial.
 Two discussion groups were organized to address ef-
ficiency issues, one focusing on document indexing, the
other on document retrieval. Since efficiency has not
been emphasised by the research IR community, par-
ticipants in both groups felt that current algorithms
have a lot of room for improvement. TREC provides
one motivation for such improvements, as do ever-
growing real world databases. However, there was con-
cern in both groups that the TREC format, which en-
courages participation in both ad hoc (retrospective)
and routing (filtering) tasks, might discourage research
on efficient task-specific architectures.
 The following sections provide more detail on each
of the two discussion groups.


1  Document Indexing

The raw text for the TREC collection (routing and
ad hoc) required approximately 3 GB of space. Index
structures required from 7.55Othemselves sufficient to
recreate the original text, so they would be additional
overhead in an operational system. (A research system
might be able to discard the original text, reporting
just document ids for evaluation.)
 Several groups, including CITRI, Thinking Ma-
chines, and UMass, stored inverted lists in compressed
form. There was general agreement that for sites will-
ing to invest the programming effort, substantial space
savings could be achieved in this fashion.  (CITRI
demonstrated a factor of six reduction in index file
sise.) There was more debate on the potential for in-
dex compression speeding up query processing as well,
with some participants saying their query processing
was I/O bound, but others saying theirs was CPU
bound. The peak amount of space used during index-


                                                 303

       David D. Lewis
     AT~T Bell Laboratories
      600 Mountain Avenue
          Room 2C409
   Murray Hill, NJ 07974, USA
     (1ewis~research.att .com)


mg (for text, indices, and auxiliary files) varied from
liOthat this may be worth more attention.
 Efficiency improvements will not come immediately,
and some may require significant expense in program-
ming time. Sharing of software between groups, which
increased from TREC-1 to TREC-2, helps limit this
expense. In addition, TREC research groups primar-
ily interested in, say, query analysis, may in the future
want to team up with groups that have addressed or
are addressing issues of scale. As the sise of test col-
lections increases, it makes less sense to have them
replicated at dozens of sites, particular when interac-
tive access across networks is usually easily available.
 Time to build index structures was tolerable for
most participants, though it was mentioned that it
never seems to go as easily or automatically as one
might hope. It was a serious issue for groups doing
some form of natural language processing. Times of
2 to 4 MB/hr were mentioned by at least two of the
NL groups, making TREC a very daunting task. The
opinion was expressed by some participants that an
NL technique would have to provide as yet undemon-
strated improvements in effectiveness to be worth the
slowdown in indexing and query processing.
 Complex text representations, such as those pr~
duced by NL, require additional information in the
index structures. The different ways of dealing with
this problem are most noticeable in the handling of
phrases in the TREC, with some groups indexing on
phrases just as on words, others relying on word p~
sition information stored in an inverted file, and still
others reparsing the raw text of a subset of documents
at retrieval time to find phrase occurrences.


2  Document Retrieval

The second discussion group was organized around
general issues in document retrieval. Participants were
encouraged to use their experience with the one and
two gigabyte TREC collections to forecast the issues
that will arise when collections are larger and more
distributed. Two issues dominated the discussion.

2.1  Will Existing Methods Scale?         2.2 Specialized Hardware and Software

Recent trends in information retrieval are towards gi-
gabyte and terabyte document collections, retrieval by
subsections/paragraphs, and multiple representations
of document content. The first focus questions were
whether existing storage and retrieval methods can
cope with this explosion of information, and if new
methods are needed, what might they be?
  Participants identified two approaches as currently
dominating IR:

1. Search all available subcollections (e.g. the TIP-
  STER/TREC task), or

2. have a user specify which subcollection to search
  (e.g. commercial systems).

Both approaches require modification if they are to
scale up.
  One problem with the "search everything" approach
is that the growth rate of online information was felt
to exceed the growth rate in computer performance.
Even if a collection is distributed across multiple pro-
cessors, detailed consideration of every document may
be too expensive when an IR system faces a terabyte of
data. One solution is to do a fast first-pass retrieval to
produce a reduced set of documents for more detailed
consideration. This first pass might involve generat-
ing approximate scores for each document (e.g. ETH
in TREC-2), scoring documents based on a smaller
amount of text (e.g. abstracts or introductions), or
scoring cluster centroids rather than individual doc-
uments.
  One problem with the "user chooses subcollections"
approach is that the task will be overwhelming when
there are many subcollections.  A significant portion
of users of one commercial service already choose to
search everything rather than select subcollections.
The system will have to provide assistance if user se-
lection is to be viable. If subcollections can be charac-
terized succinctly, perhaps by centroid vectors, auto-
matically generated thesauri, or controlled vocabulary
terms (assigned manually or automatically), then one
could use the query to rank subcollections, and then
search only the to~ranked subcollections. Other a~
proaches include assistance by an expert system, or
browsing interfaces for hyperlinked subcollections.
  GlobaL stati8tics that summarise some aspect of a
collection (e.g. idj) were expected to be a problem for
searching multiple subcollections and distributed doc-
ument collections. If a collection is formed at indexing
time, statistics can be gathered and saved when in-
dices are bailt. If a single processor performs retrieval,
statistics can be gathered during retrieval. However,
if multiple subcollections or processors are involved, it
is less clear how to compute global statistics. Meth-
ods that rely only on Local 8tatistica that summarize a
document (e.g. Berkeley in TREC-2) offer a computa-
tional advantage in this environment.


                                                304

A related question posed by increasingly large and dis-
tributed text databases is whether existing hardware
and software platforms will be up to the challenge. The
consensus among participants was that conventional
architectures will suffice, because IR is a data-parallel
task that lends itseff to distributed computation. Par-
ticipants also felt that they had generally ignored is-
sues of efficiency, and could increase their speeds if
necessary. There was little support for supercomput-
ers, massively parallel computers, or speciallzed archi-
tectures.
 One could argue that the participants were biased
towards conventional hardware and software by their
own need for flexibility, their small budgets, their insu-
lation from the time constraints of real users, and their
desire not to think about `systems' issues not directly
relevant to their research. However, the recent fielding
of ranked retrieval systems using conventional main-
frames by some of the largest online vendors provides
additional support for their views that conventional
architectures will suffice.