An Information Retrieval ~~st-bed on the CM-5 Brij Masand and Craig Stanfill Thinking Machines Corporation 245 First Street Cambridge MA 02142 L INThODUCTION For many years, research on information retrieval was mostly conilned to a few relatively small test collections such as the Cranfield collection [1], the NPL Collection [2], and the CACM Collection [31. Over the years, results on those collec- tions accumulated, with the aim of determilting which tech- nique or combination of techniques resulted in the best precision/recall figures on those collections. Gradually, a "standard model" more-or-less emerged: for the test collec- tions under study, consistently good results are obtained by vector-model retrteval using a cosine similarity meastue, ~idf weighting, and a stemming algorithm (e.g. Chapter 9 of [4], [5])~1 Out-performing this model on the old test collections has proved extremely difficult. This has led to a danger of stagnatioii in the field of IR, and a feeling that the majority of what can be learned from precision-recall experiments on the old collections has been learned. Fortunately for the field, a number of recent developments have led to new challenges for the field. Among these: * The ~pster project [6] has led to the construction of a test collection of unprecedented size [7]. This leads to chal- lenges related to scaling up IR methods. IR methods are now being used on full documents (e.g. news stories), rather than on abstracts. This leads to new challenges relating to the structure of large documents, particularly effects relating to term proximity. * The on-line database vendors have shown interest in full- text IR. This leads to challenges relating to the integration of IR methods with traditionsl boolean methods which, in the operational environment, must continue to be sup- pcted if only for the benefit of the existing, entrenched user community. 1. Other approaches yield results which are nearly as good and. in rare cases better. Approaches based on Bayesian inference networks are particularly notable in that respect. However, the above model has come to be accepted as a de-facto standard against which new methods are measured. *Connection Machine is a registered trademark of Thinking Machines Corpo- ration. CM, CM-2, CM-5. and Datavault are trademarks of ThInking Machines Corporation. SPARC is a trademark of Sun Microsystems. 117 * Network-oriented IR protocols have become increasingly popular as a basic tool for navigating across the network. This leads to the challenge of designing Systems which give uniformly interpretable results across a distributed database. The work reported in this paper represents some steps along the way to solving these problems. In essence, the challenge we are facing is to design a system which delivers high preci- sion-recall figures on large databases of long, complex docu- ments. Furthermore, the system must be compatible with existing boolean retrieval methods, and it must be possible to use this system in a distributed database in a manner which yields consistently interpretable results. The specific steps reported at this point consist of the following: * A database architecture which is suitable for use on large collections of structured documents, which supports both base-line IR and boolean retrieval methods. * An in-memory implementation of this architecture on a massively parallel computer (the Connection Machine model CM-S), which is used as test-bed. * Precision-Recall figures for this test-bed applied to the ~pster collection, as part of ThEC. It must be emphasized that this work is in an early stage and, at this point, has not reached the point where these meth- ods are demonstrable on extremely large databases. Further- more, work relating to taking advantage of document structure has only barely begun. II. DATABASE ARCHHECTURE A. Choice of a Representation There are three basic approaches to representing databases for retrieval applications: * Text Files. In this method, the database is stored in full- text form, and scanned sequentially. Such methods are often implemented by special-purpose hardware [8]. * Signature Files. In this family of methods, a compressed surrogate for the text file is searched instead of the text file itself. Overlap encoding is usually employed to construct the surrogate. There are many variants on signature ifies [9]. * Inverted Files. Inthis family of methods, an index con- taining the locations of every word in the database is con- structed. The search is then accomplished by reference to the index. Again, there are many variants on inverted file representations [10]. Inverted files have proved the only technique which sup- ports interactive access to very large databases; this is prima- rily due to the excessive I/O requirements of the other methods. Within the family of inverted file methods, several variants are possible: * Simple Inverted Files. In a simple inverted file, the index consists of a list of documents in which each word occurs. * Inverted Files with Weights. In an inverted file with weights, the above information is supplemented with weights, in support of vector retrieval methods. * Structured Inverted Files. In a structured inverted file, the index captures structural information (typically the paragraph/sentence/word coordinates of each occurrence of each term), in support of boolean retrieval methods which rely on proximity operators (e.g. word adjacency). In this research, we have decided to explore the use of structured inverted files for the following reasons: * They support the proximity operations required as part of boolean retrieval systems. This support makes architec- tures based on structured inverted files more likely to be adopted by the major on-line vendors. * They provide a flindamentally richer representation of document structure than is available with the other meth- ods. * They are collection-independent and retrieval-method independent. This last point requires some explanation. In a distributed environment, it would be useful to be able to search text files at multiple locations as if they were a single text file. Once weights - which are both collection-dependent and retrieval- method dependent - are incorporated into the index, trans- parent distributed access becomes impossible. The product of this is that the results of a single query applied to multiple databases cannot be meanin~ly combined, since there is no way to compare the ranks or scores applied to the documents returued. One of the primary challenges associated with the use of structured inverted files is their bulk: there are approximately as many index entries in the inverted file as there are words in the database, and each entry must represent a document, para- graph, sentence, and word coordinates for that word. This problem has been substantially solved by use of a novel coin- bination of compression techniques [11], which allow struc- tured indexes having a bulk on the order of 1/3 the size of the full text to be constructed. The second challenge associated with the use of structured inverted files is to implement the standard information retrieval model using them. The results of this effort are reported in this paper. 118 The final challenge associated with structured inverted files is using them to implement methods which go beyond the standard model, taking into account the added richness the structured representation to improve retrieval system peffor- mance for databases having long, structured documents. This final challenge remains a topic for future research, and there are no results to report at this time. B. The Database Architecture The database consists of the following structures: * A Compressed Structured Posting File. The first compo- nent of the database is an array of compressed postings. Each posting gives the location of a word expressed as a document-paragraph-sentence-word 4-tuple. The postings are sorted by word D, in ascending document order. A Lexicon/Index. The second component of the database is a lexicon[mdex. For each word in the database, it stores the number of times itoccurs plus the location of its post- ings in the posting file. The lexicon is structured so that the information pertaining to a given word may be qulckly located. * Document Information. The final component of the data- base is an array of document-related information. Each entry contains a mapping from an internal document iden- tifier (an integer) to an external document identifier (a string). Additional information, such as the length of the document or normalization information, may be added as necessary. The following operations are supported: * Adding Postings. A 5-tuple consisting of a word plus its four coordinates may be added to the database. * Extracting Postings. A word is presented to the database. An array having the decompressed coordinates of all occurrences of that word is returned. The array is sorted by ascending document coordinate, with paragraph, sentence, and word coordinates used as additional sort keys as needed. * Extracting Lexicon Information. Lexicon information such as the number of occurrences of a word may be extracted. * Setting Lexicon Information. Supplemental lexicon information (e.g. part-of-speech information) may be added to the lexicon. This feature is not used in the work reported herein. * Delining a Document. Defines document-specific infor- mation such as external document identifier, document location on disk, etc., and associates it with a document coordinate. * Extracting Document Information. Extracts the docu- ment-specific information mentioned above, given a docu- ment coordinate. We believe that this architecture suffices to implement boolean retrieval methods9 standard IR methods, and novel methods making use of document structure, and that it can be scaled to very large databases and to massively parallel com- puters. C. The Test-bed The above architecture has been implemented in test-bed form on the Connection Machine model CM-S. The purpose of this implementation is to permit various retrieval methods to be evaluated, rather than to support on-line services on very large files. As such, the focus was on simplicity of implemen- tation, speed of database construction, and speed of query exe- cution, rather than on handling files larger than a few Gigabytes. The CM-S [12] is a general-purpose parallel computer con- structed from commodity microprocessors. Each processing node consists of a SPARC microprocessor, 32 megabytes of RAM, and a network interface. The CM-S has two user-acces- sible networks: a Data Router, which is used for point-to- point transmission of data packets, and a Control Network which is used to implement global operations such as barrier synchronization, broadcast, and global-maximum. The data router provides each processor with 5 MB/second (full duplex) worth of point-to-point bandwidth. The control net- work is constructed using a fan-in/fan-out tree -~o that global operations complete within a few microseconds of their imtia- tion. The CM-S 1/0 system consists of a high-performance mass storage system called the scalable disk array (SDA). In this system, disk controllers are connected directly to the CM-S's data router. This allows all processors to have equal access to all disks in the system, providing the image of a scalable shared disk environment. The file system implements a UNIX file system on top of this hardware, such that file systems may be striped across up to 256 disks [13]. The result is a file sys- tem which can sustain transfer rates exceeding 150 MB/sec- ond in large configurations. The basic approach taken in this implementation is to begin by partitioning the document set among the processors. This is done by having each processor read a fixed-size contiguous chimk (1 MB) of data from the input file. In general, this will result in some documents spanning processor boundaries, so document fragments will then be re-distributed. Once this is done, each processor parses and lexes its own documents, using conventional programming techniques. The postings are then inserted into the inverted file. One novel implementation trick is used in this phase ofpro- cessing: rasher than sorting the postings, which would be very time consuming owing to the size of the uncompressed post- ings, the database is indexed in two passes. On the first pass, called the "dry run,', the postings are generated, counted, and discarded. At the end of this pass, the system knows how much space is required to store the posting list for each word. Space is then allocated and carved up. The second "produc- 119 tion" phase then begins. The database is scanned, lexed, and indexed again from scratch, but this time the postings are compressed and stored into the space allocated at the end of the dry run. This strategy doubles indexing time but, by elimi- nating the expense of sorting the posting file, it ends up both simplifying the software and reducing overall database con- struction time. At the end of this phase, the data structures noted above (posting file, lexicon/index, and document information) have been constructed in-memory, and the database is ready for querying. Using these methods on a 64 processor CM-S, the ThEC-2 training database (2.2 Gigabytes) required 20 min- utes to index. The speed of the indexing software permits the database to be re-indexed for every retrieval experiment, allowing both indexing methods and query methods to be con- veniently explored. Ignoring stop words, the size of the com- pressed inverted file index for the ~fl~C database is 24% of the raw text. Details of the compression algorithm can be found in [11]. The first step in query evaluation is to broadcast the query to all processors. In a boolean system, the query would then be processed locally by each processor, then the results pro- duced by the processors concatenated to return the final answer. In an information retrieval system based on term weighting and document ranking, slightly more work is required. First, query-term weights are generally based on some sort of term- frequency observations. These cannot be done locally, but require the use of some simple global operations. For exam- ple, to determine the number of documents in which a word occurs, each processor would count document occurrences for itself, then the global sum operation (supported in hardware by the Control Network) would be used to produce a machine- wide count. The second problem which arises comes after the documents have been scored: an algorithm is required to extract the highest-ranking documents in the collection. Sev- eral parallel algorithms for this task have been described in [14]. The test-bed used a variant on the iterative extraction algorithm: each processor locally determines its highest-rank- ing documents, then repeated application of the global maxi- mum operation are used to find the best documents in the collection. m. ThE COSINE SIMWARi~TY MEASURE Results of using the classical cosine similarity measure on the ThEC collection have already been reported elsewhere [15], 50 those results have not been replicated. This section will briefly describe how the cosine measure may be imple- mented within our architecture. Cosine similarity measures generally involve constructing an inverted file incorporating document-term weights. The structured inverted file architecture does not provide this information, partly beeause it is inconsistent with the struc- tured representation, and partly because of the difficulties noted above with regard to distributed databases. Except for the cosine measure's document normalization factor, we find that it is possible to implement document-term weighting based only in information in the structured posting file. in general, term weights (both in queries and documents) may be computed based on the following factors [5], all of which are available at run-time in our architecture. * Term Frequency. This is the number of times a term occursin the database. This information is retaieed in the lexicon. * Document-Term Frequency. This is the number of docu- ments in which a term occurs. It may be computed at run- time by traversing posting list for the term, counting initial occurrences of the term. * In-Document Frequency. This is the number of times a term occursin a given document. This may be computed at run-time by traversing the posting list for the term, count- ing the number of occurrences of the term having the same document coordinate. * Maximum In-Document Frequency. This is the maxi- mum number of times a given term occurs in any docu- menL It may be computed direedy from the in-document frequencies. Most conventional weighting schemes (e.g. tf.idf) may be computed from these run-time factors. The advantage of doing these computations at run-time is that it eliminates the need to incorporate collection-specific information into the database at indexing time. This is important for dynainic col- lections as well as for distributed databases, as described above. The difficulty with these methods when applied to the cosine norm is that the total-document-weight term cannot be conveniently computed at run-time using an inverted file. It must, therefore, be computed at index time and stored sepa- rately (e.g. in the document-specific information data struc- ture). There are two solutions to this difficulty. The first is to insist on using the cosine norm, accepting the difficulties that this implies. The second is to look for alternatives to the cosine norm. We believe that this is the more promising approach, but this is in the realm of work-not-yet-completed. in any event, the value of the cosine norm for large structured documents has not been established at this time. W RFIRIEVAL EXPERIMENTS: Most of the time (about 3 to 4 person months) for our ~I~EC participation was spent on building the test bed. How- ever we did finlsh some runs with automatically constructed routing and acilloc queries for which we report the results here. The experiments were done using the entire collection. We compare words and phrases, mixed case vs. lower case and also explore document length normalization and proximity measures. 120 A. Retrieval for Routing Topics: All queries were constructed and optimized automatically. The terms consisted of words (ignoring stop words) as well as all phrases consisting of adjacent words. Numbers were ignored and case was preserved. No stemming or query expansion with thesauri etc. was attempted. Routing queries were constructed by loolting at each word and (adjacent) phrase from the whole text of the topic tem- plates and determing a weight based on the number of rele- vant documents present in the first 100 retrieved documents, by using just that term. An initial weighted query was con- structed for each topic by the above process. Then each topic query was optimized by choosing thresholds (per topic) for the weights and rejecting all weighted query terms below the threshold. The optimum threshold for each topic was chosen by straightforward incremental search. Table 1 shows the results for the routing experiments. Routing queries with both Table 1: Routing Queries Precisionat Average Method 100 docs precision tm~- routing-words- phrases .3396 .2553 tmc7-routing-words .2920 .2045 tm~-routing-words- phrases-ip .3750 .2716 tm~-routing-words- phrases-doc-length- sent-prox .3782 .2792 tm~-routing-words- phrases-ip-sent-prox .3856 .3344 weighted words and phrases (queryid tmc6) did better than queries using just words (queryid tmc7). Using the same (offi- cial) queries, but adding sentence level proximity (sent-prox), document length scaling (doc-length) and inverse weights based on which paragraph the term appears in (ip), seem to improve results (see the next section for more details about the techniques). B. Re frieval for Adhoc Topics Adhoc queries were automatically constructed by using words and phrases from different sections of the topic tem- plates and using tf.idf weights (as derived from the training collection). The "best" sections for the new topics were ch~ sen by experimenting with the training topics. Queries derived from the description~oncept sections were used for most of the experiments. A threshold for weights was used to select terms for the final queries. Table 2 shows the results for the adhoc queries. Table 2: Adhoc Queries Method Precision Average at 100 docs precision Case: tm~-adhoc-dcwp-idf- caps-wt .2002 .1276 tmc8-adhoc-dcwp-idf- lower-wt .1734 .1157 Document4ength and inverse-para-scaling: tmc8-adhoc-dcwp-idf- Iower-doc-length-wt .3308 .1904 tmc8-adhoc-dcwp-idf- caps-ip-wt .3432 .1939 tmc8-adhoc-dcwp-idf- lower-ip-wt .3422 .2027 tmc9-adhoc-etwp-idf- caps-ip-wt .3144 .1736 Stemming: tmc8-adhoc-dcwp-idf lower-stem-wt .1670 .1152 tmc8-adhoc-dcwp-idf- lower-stem4p-wt .3240 .1980 Proximity: tmc8-adhoc-dcwp-idf- caps-doc-length-sent- prox-wt .3436 .2012 tmc8-adhoc-dcwp-idf- lower-doc-length-sent~ prox-wt .3518 .2146 tmc8-adhoc-dcwp-idf- caps-ip-para-prox-wt .2892 .1681 tmc8-adhoc-dcwp-idf- lower-ip-para-prox-wt .3006 .1772 tmc8-adhoc-dcwp--idf- caps-ip-sent-prox-wt .3476 .1988 tmc8-adhoc-dcwp-idf- lower-ip-sent-prox-wt .3602 .2164 121 The query tmc8 (dcwp) consisted of words and phrases from the description and concept sections of the topic tern- plates. Query tmc9 (etwp) used words and adjacent phrases from the entire topic. Bold4ace acronyms emphasize particu- lar experiments with case (caps and lower), sentence and paragraph level proximity (sent-prox and para-prox) docu- ment length scaling (doc-length), inverse weights based on paragraph position (ip) and weight thresholds (wt). The que- nes were not changed for the different experiments. Idf weighted terms from the description and concept sec- tions taken together (query tmc8) seem to do better than those derived from the entire topic (query tmc9). C. Case For the adhoc queries, we compared indexing with and without preserving case (similar treatment for the queries). Except for the simplest experiment with weight thresholds, converting everything to lower case seems to yield compara- ble or better results than upper case. A similar experiment for routing queries wasn't attempted because that would have requited reformulating and reoptimizing the routing queries. D. Stemming Using the Porter Algorithm software for stemming from [16] we experimented with stemming at index time (and stem- ming the queries). We found that stemming reduces perfor- mance when compared with sintilar experiments using lower case -- since the software we had used lower case. We are not sure yet why there is such a decrease. E. Document length and Term position Document length scaling was used to explore the effect of emphasizing shorter documents. A linear decreasing scaling for longer documents, with a tail was used. An inverse weight based on the paragraph the term appears in, was also explored. Both the document length scaling and the inverse paragraph scaling increase performance significantly and seem to be comparable to each other. F Proximity The postings for the inverted file allow use of term position. Experiments are underway to deline proximity scoring meth- ods that enhance weights for terms appearing close together (clusters of terms), and can also be implemented efficiently within the current architectm~. We have achieved good results with sentence level proximity measures based on a bonus score for the query terms that appear within the same sentence and within a certain distance of each other. The bonus is also proportional to the term weight itself. Experiments that used a bonus independent of term weight dramatically reduced per- formance (numbers not reported here), possibly due to noise introduced by clusters of unimportant terms. Sintilar experi- ments with paragraph level proximity yielded significantly poorer results as compared to sentence level proximity. Finally combining either document length or inverse para- graph scaling with sentence level proximity improved results. V CONCLUSIONS We have successfully implemented an in-memory corn- pressed inverted file text retrieval system on the CM5 Con- nection Machice system. Queries that used both words and phrases composed of adjacent words did better that those that used words alone. While our experiments completed so f& suggest that convert- ing everything to lower case for adhoc queries seems some- what better, it is not clear whether the minor differences couldn't be removed by filirdler optimization of other p&ame- ters. Experiments that take document length and term position into account suggest that normalizing for document length and increasing the weight of terms appearing earlier in the docu- ments lead to significant improvements for both routing and adhoc queries. We have also seen that proxintity scores based on nearby terms in a sentence improve retrieval performance. VL FUTURE WORK We would like to compare the effects of cosine normaliza- tion with what we have triedso far and also explore its inter- action with techniques that use term position and proximity measures. [13] Lo Verso, S., Isman, M., Nanopoulos, A., Nesheim, W., Milne, E. & Wheeler, R. (1993). SES: A Parallel Rie System for the CM-S. Proceed- ings, 1993 Summer USENIX Technical Conference. pp.291-306. Cin- cinnati, OH. [14] Stanifi', C. (1992). Parallel Information Retrieval Mgorithms. In Infor- mation Retrieval: Data Structures and Algorithms. Frakes, B. & Baeze- Yates, R. eds. Englewood Cliffs, New Jersey: Prentice Hall. [15] Salton, G. & Buckley, C. (1992). SMART Trade-offs. Proceedings, TREC-1. Gaitnersburgh, MD. [16] Frakes, B.(1992). Stemming Algorithms. In Information Retrieval: Data Structures and Algorithms. Frakes, B. & Baeze-Yates, R. eds. Engiewood Cliffs, New Jersey: Prentice Hall. VII. REFERENCES [1] Cleverdon, C. W., Mrns, J. & Keen, E. M. (1966). Factors Determining the Peiformance of Indexing Systems, Vol 1: Design, Vol.2: Test Results. Aslib Crnnticld Research Project, Cranfield, England, 1%6. [2] Sparck Jones, K. and Webster, C.A. (1979). Research in Relevance Weighting, British Library Research and Development Report 5553, Computer Laboratory, University of Cambridge. [3] Fox, E. (1983). Characteristics of Two New Experimental Collections in Computer and Information Science Containing Textual and Biblio- graphic Concepts. Technical Report TR 83-561, Cornell Univerisy: Computing Science Depument. [4] Salton, G. (1989). Automatic Text Processing. New York: Addison-Wes- ley. [5] Salton, G & Buckley, C. (1988). Term-Weighting Approaches in Aut~ matic Text Retrieval. Information Processing and Management, 24 (5), pp.513-523. [6] Harman, D (1993). The DARPA TIPSTER Project. SIGIR Forum, 26 (2), pp.26-28. [7] Harman, D (1993). Overview of the First TREC Conference. Proceed- ings, SJGIR-93, pp. 3~47. Pittsburgh. [8] Hollaar, L. A. (1992). Special-Purpose Hardware for Information Retrieval. In Information Retrieval: Data Structures and Algorithms. Frskes, B. & Baese-Yates, R. eds. Englewood Cliffs, New Jersey: Pren tice Hall. [9] Faloutsos, C. (1992). Signature Files. In Information Retrieval: Data Structures and Algorithms. Frakes, B. & Baeze-Yates, R. eds. Engle- wood Cliffs, New Jersey: Prentice Hall. [10] Harman, D., Fox, E. Baeza-Yates, A. & Lee, W. (1992). Inverted Files. In Information Retrieval: Data Structures and Algorithms. Frakes, B. & Baeze-Yates, R. eds. Englewood Cliffs, New Jersey: Prentice Hall. [11] Linoff, 0. & Stannil, C (1993). Compression of Indexes with Full Posi- tional Information in Very Large Text Databases. Proceedings, SIGIR- 93, pp.88-95. Pittsburgh. [12] Thinking Machines Corporation (1991). Connection Machine CM-S Technical Summary, Cambridge, MA: Thinking Machines Corporation. 122