The Efficiency Issues Workshop Report James P. Callan Computer Science Department University of Massachusetts Box 34610 Amherst, MA 01003-4610, USA (callan~cs.umass.edu) Although most groups participating in TREC-2 em- phasised precision and recall, the conference was also an appropriate forum in which to discuss the efficiency of document indexing and retrieval. For some partici- pants, running out of disk space was their worst prob~ lem, while for others running out of time was. Some groups were unable to run on the entire collection1 and several remarked that they were happy to have gotten anything running at all. No group reported finding TREC trivial. Two discussion groups were organized to address ef- ficiency issues, one focusing on document indexing, the other on document retrieval. Since efficiency has not been emphasised by the research IR community, par- ticipants in both groups felt that current algorithms have a lot of room for improvement. TREC provides one motivation for such improvements, as do ever- growing real world databases. However, there was con- cern in both groups that the TREC format, which en- courages participation in both ad hoc (retrospective) and routing (filtering) tasks, might discourage research on efficient task-specific architectures. The following sections provide more detail on each of the two discussion groups. 1 Document Indexing The raw text for the TREC collection (routing and ad hoc) required approximately 3 GB of space. Index structures required from 7.55Othemselves sufficient to recreate the original text, so they would be additional overhead in an operational system. (A research system might be able to discard the original text, reporting just document ids for evaluation.) Several groups, including CITRI, Thinking Ma- chines, and UMass, stored inverted lists in compressed form. There was general agreement that for sites will- ing to invest the programming effort, substantial space savings could be achieved in this fashion. (CITRI demonstrated a factor of six reduction in index file sise.) There was more debate on the potential for in- dex compression speeding up query processing as well, with some participants saying their query processing was I/O bound, but others saying theirs was CPU bound. The peak amount of space used during index- 303 David D. Lewis AT~T Bell Laboratories 600 Mountain Avenue Room 2C409 Murray Hill, NJ 07974, USA (1ewis~research.att .com) mg (for text, indices, and auxiliary files) varied from liOthat this may be worth more attention. Efficiency improvements will not come immediately, and some may require significant expense in program- ming time. Sharing of software between groups, which increased from TREC-1 to TREC-2, helps limit this expense. In addition, TREC research groups primar- ily interested in, say, query analysis, may in the future want to team up with groups that have addressed or are addressing issues of scale. As the sise of test col- lections increases, it makes less sense to have them replicated at dozens of sites, particular when interac- tive access across networks is usually easily available. Time to build index structures was tolerable for most participants, though it was mentioned that it never seems to go as easily or automatically as one might hope. It was a serious issue for groups doing some form of natural language processing. Times of 2 to 4 MB/hr were mentioned by at least two of the NL groups, making TREC a very daunting task. The opinion was expressed by some participants that an NL technique would have to provide as yet undemon- strated improvements in effectiveness to be worth the slowdown in indexing and query processing. Complex text representations, such as those pr~ duced by NL, require additional information in the index structures. The different ways of dealing with this problem are most noticeable in the handling of phrases in the TREC, with some groups indexing on phrases just as on words, others relying on word p~ sition information stored in an inverted file, and still others reparsing the raw text of a subset of documents at retrieval time to find phrase occurrences. 2 Document Retrieval The second discussion group was organized around general issues in document retrieval. Participants were encouraged to use their experience with the one and two gigabyte TREC collections to forecast the issues that will arise when collections are larger and more distributed. Two issues dominated the discussion. 2.1 Will Existing Methods Scale? 2.2 Specialized Hardware and Software Recent trends in information retrieval are towards gi- gabyte and terabyte document collections, retrieval by subsections/paragraphs, and multiple representations of document content. The first focus questions were whether existing storage and retrieval methods can cope with this explosion of information, and if new methods are needed, what might they be? Participants identified two approaches as currently dominating IR: 1. Search all available subcollections (e.g. the TIP- STER/TREC task), or 2. have a user specify which subcollection to search (e.g. commercial systems). Both approaches require modification if they are to scale up. One problem with the "search everything" approach is that the growth rate of online information was felt to exceed the growth rate in computer performance. Even if a collection is distributed across multiple pro- cessors, detailed consideration of every document may be too expensive when an IR system faces a terabyte of data. One solution is to do a fast first-pass retrieval to produce a reduced set of documents for more detailed consideration. This first pass might involve generat- ing approximate scores for each document (e.g. ETH in TREC-2), scoring documents based on a smaller amount of text (e.g. abstracts or introductions), or scoring cluster centroids rather than individual doc- uments. One problem with the "user chooses subcollections" approach is that the task will be overwhelming when there are many subcollections. A significant portion of users of one commercial service already choose to search everything rather than select subcollections. The system will have to provide assistance if user se- lection is to be viable. If subcollections can be charac- terized succinctly, perhaps by centroid vectors, auto- matically generated thesauri, or controlled vocabulary terms (assigned manually or automatically), then one could use the query to rank subcollections, and then search only the to~ranked subcollections. Other a~ proaches include assistance by an expert system, or browsing interfaces for hyperlinked subcollections. GlobaL stati8tics that summarise some aspect of a collection (e.g. idj) were expected to be a problem for searching multiple subcollections and distributed doc- ument collections. If a collection is formed at indexing time, statistics can be gathered and saved when in- dices are bailt. If a single processor performs retrieval, statistics can be gathered during retrieval. However, if multiple subcollections or processors are involved, it is less clear how to compute global statistics. Meth- ods that rely only on Local 8tatistica that summarize a document (e.g. Berkeley in TREC-2) offer a computa- tional advantage in this environment. 304 A related question posed by increasingly large and dis- tributed text databases is whether existing hardware and software platforms will be up to the challenge. The consensus among participants was that conventional architectures will suffice, because IR is a data-parallel task that lends itseff to distributed computation. Par- ticipants also felt that they had generally ignored is- sues of efficiency, and could increase their speeds if necessary. There was little support for supercomput- ers, massively parallel computers, or speciallzed archi- tectures. One could argue that the participants were biased towards conventional hardware and software by their own need for flexibility, their small budgets, their insu- lation from the time constraints of real users, and their desire not to think about `systems' issues not directly relevant to their research. However, the recent fielding of ranked retrieval systems using conventional main- frames by some of the largest online vendors provides additional support for their views that conventional architectures will suffice.