A Single Language Evaluation of a Multi-lingual Text Retrieval System Ted Dunning Mark Davis Computing Rese&ch Laboratory New Mexico State University Las Cruces, NM 88003 1. Introduction The goal of the participation by the Computing Research Laboratory at New Mexico State University in the Text Retrieval Evaluation Conference was to evaluate our multi-lingual text retrieval system in a mono-lingual setting and in the context of a large set of documents. This system is currently being developed for use in a multi4ingual setting, but we felt that there were novel aspects to the system which should prove useflil in retrieval from a large text base. In particular, the system is able to find and make use of phrases in queries which will aid in retrieval. Marking which phrases are significant was done using a likelihood ratio test which is more reliable than previously used measures. 2. System Structur our Multi-Lingual Text-Retrieval system consisted of a fairly conventional system based around an inverted index structure. The positions of individual words are recorded in terms of which document they appeared in, and where in the document they appeared, both in terms of bytes, as well as in terms of words. In order, to maintain language independence as well as to improve modularity, tokenization during indexing can be done as a separate process. This modularity allows most of the system to remain unchanged when changing languages from English to Japanese, or when changing the indexing structure to support fast access to, say, a traditional dictionary of word definitions where only the various spelling forms of the head word might be indexed. All postings in the final database are consrant length structures in order to facilitate and simplify vector oriented operations. Document numbers and locations are uniformly stored as 4 byte integers in order to avoid for all practical purposes any limitations on the number or size of documents and tokens. Since full positional information is kept, it is relatively fast and simple 193 -2- to search for phrases, or to find instances of words appearing near each other. We plan in the near future to alter one of the position indices to store sentence position instead of word position relative the beginning of a file. In other CRL systems, the ability to determine whether words appear in the same sentence has proved very useful. 3. Indexing Operations Indexing for the Multi-Lingual Text-Retrieval system consists of a 4-pass process. The passes include tokenization, relabelling, sorting and stripping. 3.1. Tokenization Tokenization in English consists of putting each word in the input text on a single line along with positional and document information. Kill list processing is done at this stage, as well, as creation of a word and document table. Lemmatization and suffix stripping can be done at this stage, but no savings result since this merely increases the average length of posting vec- tors during retrieval. Further, early lemmatization or suffix stripping makes it impossible to later avoid lemmatization or suffix stripping. In Japanese, tokenization down to the level of strings of consecutive kanji or katakana is supported as well as tokenization down to individual kanji and strings of katakan~ Since strings of hiragana typically are used for inflection, they are ignored. This results in words being ignored, since there are words in Japanese (mostly verbs) which are written in hiragana. Using kanji strings results in indexing a large number of phrases, while indexing each kanji ide- ograph separately results in an increasing number of phrase searches later in the retrieval pro- cess with an accompanying drop in performance. Preprocessing to mark and normalize company, place and person names and dates is best included into the tokenization process. We plan to incorporate elements of CRL's Tipster Extraction software into this program at some time to investigate the impact this has on perfor- mance, but we have not done so at the present time. Indices for fielded data can be created by using a conventionalized word location such as 0. mis approach allows all such data to be handled in a uniform manner. 3.2. Relabelling The output of the tokenization process is still ASCII or some form of extended ASCII such as Unicode or JIS. The next step is the conversion to a uniform binary form which assists 194 -3- in subseqent sorting. This step is also where word and document tables are created which allow bi-directional conversion between strings and word numbers, or file references and document numbers. The output of the relabelling consists of three files, 1) the string table, in which all strings (including file names) are kept. 2) the document table, in which the file names, starting offset and lengths of all documents is kepL 3) a vector of sortable posting structures. In order to improve performance in cases where a large amount of data must be indexed. the first two passes are often integrated. This has several benefits which primarily include a rad- ical decrease in the number of times the word and document tables must be reread, and the sub- stitution of flinction call/return instead of Unix pipes for the transfer of data between the tokeni- zation and relabeling passes. 3.3. Sorting Next, these sortable posting are sorted using a version of MSB radix sort combined with a heap merge pass for very large files. Our radix sort uses a radix of 65536 and is able to sort at a rate of nearly 1MB per second. Total time for the radix sort is linear in the number of ele- ments to be sorted, up to the limit of main memory. Merging then takes e(n logm) where m is the number of internal sorts needed. In practice, sorting is so fast that tokenization and relabel- mg are the bottleneck even when these first two passes are integrated. The major disadvantage of this technology is that a considerable amount of temporary disk space is required. Ultimately, at least 100% overhead in temporary disk space is required. Furthermore, this overhead is needed even if all that is being done is to update the indices. We believe that paying a time penalty to avoid this overhead is probably very much warranted if a system is to be used on large files. 3.4. Stripping The output of the sort phase consists of a very large vector of postings. In this vector, there are long runs in which the word number is constant. The stripping process consists of creating and index so that the boundaries of these long runs can be found, and deleting the word number from the postings. This change in structure results in about a 25-50% decrease in post- ing file size (depending on how much positional information is kept), and the index allows very fast access to the vector of postings for any particular word. 195 -4- 4. Primitive Document Access The posting vector for any particular word can be retrieved from the database with a single procedure call. The posting vectors obtained in this way can be combined using a set of func- tions which combine perform the customary boolean operations, as well as functions which allow adjacent words to be found, as well as when specified words appear within a specified neighborhood of each other. In addition, the basic query routines include a procedure which, given a document number from within a posting, will read the contents of this document into memory and return the resulting string. On top of this primitive layer, another level is built which supports scoring of documents based on a query. Given a vector of strings which represent query terms, procedures in this level return a sorted score vector whose contents contain documents and scores. All scoring is done based on inverse document frequency weighting. We plan to convert to a system based on Bayesian decision rules if time permits in this original contract 5. Query Formulation In the CRL TREC effort, conversion from topics to queries was done entirely automati- cally. The retrieval topics were reduced to lists of one, two, three and four word phrases. Each such phrase was included in the query if it met a statistical test based on generalized likelihood ratios. On the average, this resulted in about 80 terms being retained for the retrieval. Had time permitted, we would have included a manual relevance feedback operation in the process of query formulation. We expect that this would have greatly improved the utility of the multi-word phrases. 6. Document Scoring Retrieval was accomplished by using the or~posting procedure to compute a posting vector which contained references to all documents which potentially had a non-zero score. These documents were then scored using a conventional inverse document frequence weighting scheme and the results were sorted. Only the first 200 documents in the sorted document vector were printed. In the submiued results, the fact that there was a very significant difference in average document length was partially compensated for by accumulating scores in quadrature rather than simply summing them. There are much better methods available to normalize for document length in a principled way. 196 -5- 7. Overall Results Analysis of the results as returned by the automatic TREC scoring program shows that the CRL entry performed much better than might be expected, given the severe problems encoun- tered in actually rurming the tests. Even though the system was not able to complete a clean index of the AP textbase in time, and even though the merging process was severely flawed, and even though no relevance feedback was employed the system was able to perform on a credit- able level. Many of the last minute problems can be backed out of the results by examining the results for only the WSJ and Ziff textbases. The other primary databases are less interesting because: 1) The FR texts were significantly longer and thus caused severe problems in merging results. 2) The index of the AP textbase was not completed by the revised program in time. 3) Excluding the DOE textbase had little effect on the results (only three queries among the first 50 had relevant documents in the DOE textbase). When these exclusions are made, and a composite recall-precision plot is made it can be seen that, on the average, our system provides a maximum precision of about 60%, a minimum precision of about 40% and a maximum recall of about 40%. Unfortunately, there is a wild variability so that in a strong sense, no single average is a terribly valid picture of the systems performance. 197