A Single Language Evaluation of a Multi-lingual Text Retrieval
                                 System

                                Ted Dunning

                                Mark Davis

                         Computing Rese&ch Laboratory
                          New Mexico State University
                            Las Cruces, NM 88003


1. Introduction

    The goal of the participation by the Computing Research Laboratory at New Mexico State
University in the Text Retrieval Evaluation Conference was to evaluate our multi-lingual text
retrieval system in a mono-lingual setting and in the context of a large set of documents. This
system is currently being developed for use in a multi4ingual setting, but we felt that there were
novel aspects to the system which should prove useflil in retrieval from a large text base.

    In particular, the system is able to find and make use of phrases in queries which will aid
in retrieval. Marking which phrases are significant was done using a likelihood ratio test which

is more reliable than previously used measures.


2. System Structur

    our Multi-Lingual Text-Retrieval system consisted of a fairly conventional system based

around an inverted index structure. The positions of individual words are recorded in terms of
which document they appeared in, and where in the document they appeared, both in terms of
bytes, as well as in terms of words. In order, to maintain language independence as well as to
improve modularity, tokenization during indexing can be done as a separate process. This
modularity allows most of the system to remain unchanged when changing languages from
English to Japanese, or when changing the indexing structure to support fast access to, say, a
traditional dictionary of word definitions where only the various spelling forms of the head
word might be indexed.

    All postings in the final database are consrant length structures in order to facilitate and
simplify vector oriented operations. Document numbers and locations are uniformly stored as 4

byte integers in order to avoid for all practical purposes any limitations on the number or size of
documents and tokens. Since full positional information is kept, it is relatively fast and simple


                                    193

                                             -2-


to search for phrases, or to find instances of words appearing near each other. We plan in the
near future to alter one of the position indices to store sentence position instead of word position
relative the beginning of a file.  In other CRL systems, the ability to determine whether words

appear in the same sentence has proved very useful.


3. Indexing Operations

    Indexing for the Multi-Lingual Text-Retrieval system consists of a 4-pass process. The

passes include tokenization, relabelling, sorting and stripping.


3.1. Tokenization

    Tokenization in English consists of putting each word in the input text on a single line
along with positional and document information.  Kill list processing is done at this stage, as
well, as creation of a word and document table. Lemmatization and suffix stripping can be done
at this stage, but no savings result since this merely increases the average length of posting vec-

tors during retrieval. Further, early lemmatization or suffix stripping makes it impossible to
later avoid lemmatization or suffix stripping.

    In Japanese, tokenization down to the level of strings of consecutive kanji or katakana is
supported as well as tokenization down to individual kanji and strings of katakan~    Since
strings of hiragana typically are used for inflection, they are ignored. This results in words
being ignored, since there are words in Japanese (mostly verbs) which are written in hiragana.

Using kanji strings results in indexing a large number of phrases, while indexing each kanji ide-
ograph separately results in an increasing number of phrase searches later in the retrieval pro-
cess with an accompanying drop in performance.

    Preprocessing to mark and normalize company, place and person names and dates is best
included into the tokenization process.   We plan to incorporate elements of CRL's Tipster
Extraction software into this program at some time to investigate the impact this has on perfor-
mance, but we have not done so at the present time.

    Indices for fielded data can be created by using a conventionalized word location such as
0. mis approach allows all such data to be handled in a uniform manner.


3.2. Relabelling

    The output of the tokenization process is still ASCII or some form of extended ASCII
such as Unicode or JIS. The next step is the conversion to a uniform binary form which assists


                                               194

                                  -3-

in subseqent sorting. This step is also where word and document tables are created which allow
bi-directional conversion between strings and word numbers, or file references and document
numbers. The output of the relabelling consists of three files,

1)   the string table, in which all strings (including file names) are kept.

2)

     the document table, in which the file names, starting offset and lengths of all documents is
     kepL

3)   a vector of sortable posting structures.

     In order to improve performance in cases where a large amount of data must be indexed.
the first two passes are often integrated. This has several benefits which primarily include a rad-
ical decrease in the number of times the word and document tables must be reread, and the sub-
stitution of flinction call/return instead of Unix pipes for the transfer of data between the tokeni-
zation and relabeling passes.


3.3. Sorting

     Next, these sortable posting are sorted using a version of MSB radix sort combined with a
heap merge pass for very large files. Our radix sort uses a radix of 65536 and is able to sort at
a rate of nearly 1MB per second. Total time for the radix sort is linear in the number of ele-
ments to be sorted, up to the limit of main memory. Merging then takes e(n logm) where m is
the number of internal sorts needed. In practice, sorting is so fast that tokenization and relabel-
mg are the bottleneck even when these first two passes are integrated.

     The major disadvantage of this technology is that a considerable amount of temporary disk
space is required.  Ultimately, at least 100% overhead in temporary disk space is required.
Furthermore, this overhead is needed even if all that is being done is to update the indices. We
believe that paying a time penalty to avoid this overhead is probably very much warranted if a
system is to be used on large files.


3.4. Stripping

     The output of the sort phase consists of a very large vector of postings. In this vector,
there are long runs in which the word number is constant. The stripping process consists of
creating and index so that the boundaries of these long runs can be found, and deleting the word
number from the postings. This change in structure results in about a 25-50% decrease in post-
ing file size (depending on how much positional information is kept), and the index allows very
fast access to the vector of postings for any particular word.

                                       195

                                         -4-

4. Primitive Document Access

    The posting vector for any particular word can be retrieved from the database with a single
procedure call. The posting vectors obtained in this way can be combined using a set of func-
tions which combine perform the customary boolean operations, as well as functions which

allow adjacent words to be found, as well as when specified words appear within a specified
neighborhood of each other.  In addition, the basic query routines include a procedure which,
given a document number from within a posting, will read the contents of this document into
memory and return the resulting string.

    On top of this primitive layer, another level is built which supports scoring of documents
based on a query. Given a vector of strings which represent query terms, procedures in this
level return a sorted score vector whose contents contain documents and scores. All scoring is

done based on inverse document frequency weighting. We plan to convert to a system based on
Bayesian decision rules if time permits in this original contract


5. Query Formulation

    In the CRL TREC effort, conversion from topics to queries was done entirely automati-

cally. The retrieval topics were reduced to lists of one, two, three and four word phrases. Each
such phrase was included in the query if it met a statistical test based on generalized likelihood
ratios. On the average, this resulted in about 80 terms being retained for the retrieval.

    Had time permitted, we would have included a manual relevance feedback operation in the
process of query formulation. We expect that this would have greatly improved the utility of

the multi-word phrases.


6. Document Scoring

    Retrieval was accomplished by using the or~posting procedure to compute a posting
vector which contained references to all documents which potentially had a non-zero score.

These documents were then scored using a conventional inverse document frequence weighting

scheme and the results were sorted. Only the first 200 documents in the sorted document vector
were printed.

    In the submiued results, the fact that there was a very significant difference in average
document length was partially compensated for by accumulating scores in quadrature rather than
simply summing them.  There are much better methods available to normalize for document
length in a principled way.


                                          196

                                  -5-


7. Overall Results

     Analysis of the results as returned by the automatic TREC scoring program shows that the
CRL entry performed much better than might be expected, given the severe problems encoun-
tered in actually rurming the tests. Even though the system was not able to complete a clean
index of the AP textbase in time, and even though the merging process was severely flawed, and
even though no relevance feedback was employed the system was able to perform on a credit-
able level.

     Many of the last minute problems can be backed out of the results by examining the
results for only the WSJ and Ziff textbases. The other primary databases are less interesting
because:

1)   The FR texts were significantly longer and thus caused severe problems in merging
     results.

2)   The index of the AP textbase was not completed by the revised program in time.

3)   Excluding the DOE textbase had little effect on the results (only three queries among the
     first 50 had relevant documents in the DOE textbase).

     When these exclusions are made, and a composite recall-precision plot is made it can be
seen that, on the average, our system provides a maximum precision of about 60%, a minimum
precision of about 40% and a maximum recall of about 40%. Unfortunately, there is a wild
variability so that in a strong sense, no single average is a terribly valid picture of the systems
performance.


                                    197