Multilevel Ranking m Large Text
  Collections Using FAIRS
   S-C. Chang, H. Dediu, H. Azzam, M-W. Du
       GTE Laboratories


 Abstraci

2.0 System Overview

A description of a general-purpose multilevel ranking
information retrieval prototype is presented. The methods
used in weighting and ranking the retrieved documents are
discussed. Experiments with the TREC92 collection of
text and queries have been conducted without manual pre-
processing. Initial results have shown the multilevel rank-
ing scheme to be highly competitive in precision and recall
relative to other ranking strategies.


1.0 Introduction

Information retrieval research at GTh Laboratories has led
to the development of a prototype system called FAIRS
(Friendly Adaptable Information Retrieval System).
FAIRS has evolved into a functional and flexible system
currently running on SunOS, HPILJX, Ultrix, AIX, VMS
and PC platforms. It has been used in environments as
diverse as literature searching, library operations, research
and development, customer support, market analysis and
management. FAIRS is being further tested as a retrieval
engine for very large collections of text, such as those pre-
sented by the TREC92 collection and wide-area distrib-
uted collections.

FAIRS is designed to minimize user effort in the prepara-
tion of text, the learning of query syntax while providing a
user-modifiable multilevel ranking scheme1. Experiments
with the TREC92 collections of text and queries have been
conducted with no human intervention in the processing of
either text or queries. The results of experiments against
the collection of Wall Street Journal articles are listed in
Section 3A.


1. PatentPending


                                        329

FAIRS uses pure text as its information base, while
allowing flexible links into non-textual information.
FAIRS extracts information out of an unstructured,
amorphous collection of data, in four main steps:
1. First, it partitions (logically) a raw text file or a col-
   lection of such files into retrievable record units.
   This simple record partitioning is necessary and suf-
   ficient for indexing to begin. The goal is to use
   source information as-is [I].
2. Second, FAIRS automatically constructs an index. A
   feature exists where deletions are permitted from an
   index. Statistics on the collections can also be gener-
   ated at this time. Such statistics are used in normaliz-
   ing the weighting of retrieved documents.
3. The third step involves the user queries. Queries are
   accepted and interpreted in an intelligent and sensi-
   tive manner [1,2]. A flexible approach to the under-
   standing of the query is essential to providing good
   responses.
4. Finally, once the query is processed, the relevant hits
   are retrieved quickly and displayed in an ordered list
   ranked according to a relevance measure. The
   records corresponding to the hits can then be viewed,
   printed or mailed in their entirety on demand.

2.1  Char8cterIstIcs

In FAIRS, both the responses to information requests
and the way relevance is determined can be customized
by the user. FAIRS can provide a tool for decision mak-
ing by presenting to the user all the relevant facts in an
elegant and timely fashion. There are some interesting
and novel features of FAIRS and research issues associ-
ated with each of these strategies. They will be dis-
cussed in sequence.

2.1.1 Using Files As-Is

Raw textual information is broken into records (i.e., the
units that will be subsequently retrieved). Consistent with
the goal of using the information as-is (i.e., eliminating
file conversion), FAIRS does not require the input files to
be in any particular format [1). Any ASCII file can be
indexed and searched. However, if a file does have a logi-
cal field structure that the user wants to use in searching
(e.g., restrict a search to text in the NAME field of
records), then this structure must be described to FAIRS.

If the information files do not have "recordt' or "field"
structures then an implicit method must be used to parti-
tion these files into records. To use the files as-is, FAIRS
allows the user to impose a record structure. Any character
string(s) can be designated as an end-of-record marker(s).
Such a page/record structure has been used successfully
and a number of documents have been quickly trans-
formed into textbases that could be processed by FAIRS.

Alternatively, a fixed line count, such as 66 lines or 78
lines, can be used as an end-of-record marker. Further-
more, each file may be declared as an individual record if
multiple files are present. Records can be further divided
into fields by using any character string(s) as field mark-
ers.

For retrieval, FAIRS requlres the information base being
searched to be partitioned into indexed, retrievable
records. Either the record/field structure in the original
text file can be used or, in the absence of such structure, it
can be imposed. In either event, a description of record/
field format will control how FAIRS breaks the raw text
file into records. Note that a file format description is sep-
arate from the described input file which can be used as-is
(without conversion).

These methods of record definition are the result of practi-
cal experience with text collections generated in typical
business environments. A flexible input text structure or
pre-processor is essential to the effectiveness of a general-
purpose information retrieval system.

2.1.3 Stemming

Words are stemmed at index time and at query parsing
time. The stemming rules used are published by Paice
[4]. The rules are incomplete but are claimed to be satis-
factory. Words shorter than 4 characters are not
stemmed.

Conflation is used as an alternative to the inadequacies
of truncation. Paice's scheme does not return "correct"
roots, but does ensure that members of a family reduce
to the same root, and members of different families
reduce to different roots.

2.1 A Query Formation

FAIRS handles free association queries for information
[1,2]. Free association is the essence of being "user
friendly" in information retrieval, rather than graphical
user interface features such as windows, icons, mice,
and pulldown menus. Free association means that there
is no set of keywords that must be known and used
exactly. Queries do not have to be phrased as Boolean
expressions with ANDs and ORs. Words, phrases, and
even word stems thought to be relevant are listed in a
free association fashion. A FAIRS query has the simple
form:

     ierm1[w1] [,]term2[w2] []...ierm~[w~]


Where term~ can be a single word or a phrase and W~ is a
user-specified weight for that.word. There will be a hit
for each record in the information base that contains at
least one word from the query.

Commas are the only delimiters used to delineate
phrases, so that relative adjacency relation among
search words (i.e., the distance between two words that
appear as a query term such as "scenic view" or "soft-
ware engineering") could be considered automatically.
The weight W~ is used at ranking time but if omitted, it is
set by default to 1.

                                             2.1.5 Previewing Recorda
2.1.2 Stop Lists

FAIRS constructs an index to support the full text retrieval
of records. A user-defined stop list can be used to limit the
words indexed. For the TREC92 experiments the stop
word list consisted of 280 words. It was customized with
the addition of several common embedded SGML strings
and spurious field definitions such as: docno, doc, text,
journal, title, summary, descriptor, author, fileid, file_id,
&bullet, &para, &degrees, etc.


                                        330

The ranking results are shown to the user via several
interactive and selectable preview features. The preview
features show some aspects of the top ranked records.
The aspects include the first (or all) lines containing
query words, coverage, and frequency data or simply
the first few lines of highly ranked records. This is
designed to give users a better feel for how relevant the
retrieved records really are, and thus give him a feel for
how well do the ranking rules work, and how they

should be customized. After examining a preview or the
full text of the highly ranked records, the user can then
revise the query or even change the ranking strategy.

2.1.6 Synonyms

A synonym definition capability (glossary) is also avail-
able within FAIRS. Given a word, FAIRS will retrieve
instances of that word, as well as instances of its syn-
onyms. The option can be invoked to broaden a query.
Users can build their own vocabulary or invoke and mod-
ify the system-wide synonym dictionary. This can be used
to ease the problem of different word usages from differ-
ent people.

A very fast elastic string matching algorithm2 [5] is being
evaluated for inclusion among the query expansion fea-
tures of FAIRS.

2.1.7 Displaying Records

When a request is made, users will always be presented
with the retrieved records in their full-text form. Further-
more, FAIRS provides several ways to associate non-text
information with each record. There are basically two
types of links: implicit and explicit. Implicit links use
source information as-is while explicit links involve spe-
cial fields embedded in the source text. Implicit links are
useful in situations where an implied one4o-one mapping
may be established between records and image files.
Explicit links may be used to express one-to-many rela-
tions between records and other media.

2.1.8 Ranking

Gne of the most interesting aspects of FAIRS is its uncon-
ventional ranking scheme to determine the relative rele-
vance of retrieved records. The ranking scheme is
designed to mimic the human relevancy judgement pro-
cess.

When a person is asked to determine the relative relevance
between two records he is likely to first weigh them using
a set of criteria. If the two records have the same weight
with the set of criteria, a secondary set of criteria may be
used to differentiate them, and so on. The criteria used can
be highly heuristic. The adaptation of this "multilevel
ranking scheme" has been filed with the Patent office in
the United States.

To enable FAIRS to use free association in place of Bool-
ean semantics, a multilevel ranking model [1,2] for full-

2. Patent Pending

331

text information retrieval has been developed and imple-
mented. FAIRS ranks records with respect to a particular
query according to a set of rules. The default rules consist
of six attributes in six levels. The six attributes are the
importance, popularity, frequency, location of a search
word, and record size, and record JD of the record it occurs
in. Each attribute may have either positive, negative or no
impact (neutral) on the relevance judgement of a record.
Such arrangement also guarantees the automatic consider-
ation of coverage (i.e., percentage of different query words
covered by record), which is next to impossible to imple-
ment in a Boolean environment [1,2]. The ranking rules of
FAIRS are always accessible and modifiable by the user.

Descriptions of the attributes chosen for FAIRS follow:

FREQUENCY: The number of occurrences of the key-
word in the record. This attribute may be used to reflect
interest in finding records with more repetitions of a given
term. That is, when set to have a positive impact, the more
instances of a term in a record, the more relevant the
record. Therefore, the record with the higherfrequency of
a term is more likely to be retrieved.

iMPORTANCE: FAIRS provides the searcher the ability to
assign an arbitrary weight (importance) to each query key-
word; thus the user has additional control over how
records are retrieved. For example, specifying brea~ast:4
assigns a weight of 4 to the term breakfast in a query. In
general, keyword weighting allows the searcher to change
how FAIRS sorts records in order to identify the most rele-
vant ones. If a keyword is not weighted, FAIRS supplies a
default weight which is in reverse proportion to its input
position in the query (words that came to mind first
deserve more weight). Therefore, if the searcher chooses
to weight some or all keywords in a query with a range of
weights, records strong in the heavily-weighted keywords
are ranked before others. This attribute is defaulted to have
a positive impact on the relevance judgement.

POPULARITY: This is the number of times a term occurs
in the entire collection, as opposed to the number of times
it appears in the retrieved record. For example, if the word
software appeared, at least once, in 15 records in a collec-
tion, its popularity is considered to be 15. This attribute is
usually used in the negative sense and, by default, FAIRS
assumes that the more popular a term is, the less effective
it is in retrieval.

REC_ID: The record ID is the location of a record in the
collection. It may indicate the age of the record.This is
also useful when the records are arranged according to
their degree of significance (in either increasing or
decreasing order.) This attribute is a good example for

showing there is no perfect ranking rules that work in all
situations. This attribute may have either a positive or neg-
alive impact, depending on user intention. Although there
is no obviously better default setting for this attribute, we
chose positive to be the default, so as to give priority to the
later records as they are likely to be more timely.

REC_SIZE: The total word count of the record. This
attribute may be used to counter (normalize) the size
advantage that a larger record may have over smaller ones
during rank judgement. A larger record may have more
keywords simply because it contains more words. It is
therefore we set its default to have a negative impact on
the relevance judgement. Of course, when records are of
similar size, this attribute will have minimal effect, and
should probably be disabled to improve response time.

WORD_LOC: The location of the first occurrence of the
keyword in the record. A negative setting (the default) of
this attribute assumes that important words appear in the
beginning of a collection, or record. For example, head-
ings or tides which contain keywords describing the con-
tents of a document, usually appear at the beginning. Of
course, this depends solely on how the contents of the
information are organized. This serves as another example
of the context-sensitivity of the ranking process.

The following table shows the default ranking rules:

  Table 1: Ranking Attributes Settings for TREC92

  level  Imp   Pop   Freq   Size  ID    Loc

  1                  ---    ---  ---


  3      ---   neg   -      -    ---    ---
  4      pos   neg   pos    neg   -     -


  6
                                 pos

The first level, having no attribute values, automatically
accounts for coverage if the weight computation is done
using Method 1, as described below in section 2.1.9.

To perform ThEC92 experiments, we changed the ranking
rules as follows: Size was introduced on level one to bal-
ance the effect of the very large records found in the Fed-
eral Register. Consequently, we also had to specify
impacts for the Importance, Popularity and Frequency
attributes on the first level since they are the most impor-
tant criteria for ranking.

The following table shows the TREC92 ranking rules:


                                             332

  Table 2: Ranking Attributes Settings for TREC92

  level  Imp  Pop    Freq   Size     ID   Loc

  1      pos  neg    pos    neg      -    -
  2      pos  neg     -     neg      -
  3      pos  neg    pos    neg      -


  5      ---  ---    ---    ---    pos

2.1.9 Weight Computation
Two methods were available to compute the weight of a
document:

U Method 1

The weight of a retrieved record r at level l is determined
by the following formula:
                   K FA~
           W1[r] =   Ie[i]Ha1,~'['1I

                  Ł=1 L~=1J

Where:

  e[i] = 1 if the ith keyword exists in the record and 0
      otherwise,

     = The value of the attributej9
  A Number of attributes (currently 6,)

  K = Total number of query keywords,

  s1,1 = 1 if the value of attributej is positive,

      0 if the value of attributej is neutral,

      -1 if the value of attributej is negative.

In other words, the weight at a certain level is the sum of
the product of the attributes at that level. This weight com-
putation method automatically calculates coverage since
for each keyword e[i], the product of attributes is never 0.

The value of the attribute is configured by the user either
before running FAIRS, or before delivering the query. To
set the value before running FAIRS, a file must be created
containing the initial values. A default value is set during
system start-up.

* Method2

One disadvantage of method 1 is that it lacks a common
reference scale that evenly distributes the influence of the
attributes on the weight of the document. For example,
consider a textbase with one very large record, record

number 1, and assume the following settings for the vari-
able attributes:

  FREQUENCY = neutral (s[1] =0)

  POPULARITY = neutral (s[2] =0)

  WORD~LOC = neutral (s[3] =0)
  REC_ID = neutral (s[4] =0)
  REC_SIZE = posiave (s[5] = 1)

The contributing factors to the weight of a retrieved docu-
ment are IMPORTANCE and REC_SIZE. Because of its
large size (a[5] » 0,) record I will be irrelevantly
retrieved in response to a large percentage of pending que-
nes.

To avoid the above scenario, method 2 uses the following
formula to compute the normalized weight of a retrieved
record r at level 1:
                K ~    A

        W1[r] = ~   e(i]  aills U]I
               ~tl L   JX=1~iCj
Where:

  e[i) = 1 if keyword i exists in the record and 0 other-
      wise,

  ail] = The value of the attributej,

  A = Number of attributes (currently 6,)

  K = Total number of query keywords,

  sill = 1 if the value of attributej is positive,

      0 if the value of attributej is neutral,

      -1 if the value of attributej is negative.

  C = Number of attributes whose value is not 0.

  amill = The maximum value of the attributej.

To avoid the effect of disproportionately large attribute
values, al/Jiamill is set to 1 if a[,] is one of the largest 2%
attribute values. Method 2 has the advantage of accommo-
dating large attribute values by normalizing with respect to
their maximum.

However, this method does not automatically calculate the
coverage since sill contributes as a multiplicand rather
than as an exponent as in Method 1. Therefore, if the
attributes are all neutral (level 1 in Table 1--default
attributes,) the weight would be 0, evan though query
terms would occur in the record r. We are therefore still
looking for ways to improve the method.


                                         333

3.0 Experiments
The TREC92 collection of text and topics was used to
quantify and analyze the performance of FAIRS in the
domain of very large collections. The entire collection
includes 2.3Gb of text from various sources and of various
formats.


3.1   System Configuration
The hardware platform used for indexing and query pro-
cessing was an IBM RS/6000 320 workstation with AIX
3.0 operating system. The core memory (RAM) installed
was 32 Mb. The net disk space available was
4,169,728Kb. The CPU clock rate was 25 MHz, with a
MIPS rating of 27. The disk access time was 9.8 ms on
average.


3.2   Indexing Performance
Our experiments showed that pefformance in indexing is
strictly constrained by I/O wait. We used several tech-
niques to reduce this constraint and optimize throughput.
We scheduled index runs to run simultaneously, thus keep-
ing the CPU busy during I/O wait. We used a kernel which
employs automatic disk caching (providing an order of
magnitude improvement in indexing time.) To further
exploit caching, large amounts of high-speed core memory
should be used (our hinit of 32 Mb is by no means ideal.)

The I/O bottleneck also implies that the very fastest disks
be used. We chose 9.8 ms disks as the fastest available for
the platform for a reasonable price. Using the above con-
figuration, the sustained indexing throughput was on the
order of 10 days (240 hrs.)/Gb. The storage overhead for
index output was 110%. That is, for 1 Gb of text, 2.1 Gb of
space should be available before indexing is initiated. By
this measure, the entire TREC92 collection of 2.3 Gb
would require 4.83 Gb of storage. Since only 4.17 Gb was
available, we participated in category B (Wall Street Jour-
nal only.)


3.3   Query processing
Before submitting the raw topics to FAIRS, they were con-
verted automatically into FAIRS-compatible queries of the
form described in section 2.1.4. The TREC92 topics were
pre-processed by a simple syntactic term token generator.
The pre-processor removed stop words, stemmed and
removed cases from topic terms. It used an ad-hoc term
weighting scheme to assign IMPORTANCE weights to
terms according to their positions in the topic (e.g. terms
occurring in the title section were assigned an arbitrarily
higher importance than those in other sections.)

The importance weights were assigned with a minimum of
attention. As described below, failure analysis had later
shown that great improvements in precision could be made
if pre~processing were better tuned with term expansion,
duplicates removal and a heuristic importance measure.

Mter conversion, queries were submitted to the retrieval
engine through a non-interactive interface. The time for
processing (ranking) was approximately 30 minutes per
query. Again, I/O wait was the dominant factor in running
time. The same system solutions which apply to indexing
can be applied with equal effectiveness to query process-
ing. Modification of some of the internal data Structures
for index storage (concordance list optimization) could
also improve ranking running time.

3.4   Retrieval Performance
FAIRS participated in the following categories: Ad-hoc
Wall Street Journal disk 1 (AWl), Ad-hoc Wall Street
Journal Disk 2 (AW2), Routing, Category B ~B). Rele-
vance judgements are as follows: AWl: 50 topics, AW2:
50, RB: 25; total 125.

Of these judgements, FAIRS' recall rates were ranked as
follows: 77 on or above median, 45 below median, and 3
undetermined (1 tied, 2 with 0 relevant.) 61.6% on or
above median, 36% below, 2.4% undetermined.
       Table 3: Recall Performance Summary


                   %>=    %<
        Category   med    med

       AWl         66.0  32.0

       AW2         46.0  50.0

       RB          84.0  16.0

       Total       61.6  36.0

%rell


 ret


54~

Beside recall, relevancy judgements made available recall/
precision figures across 11 points of recall rates. The aver-
age of the 11 recall/precision figures is called the 11-pt.
average. FAIRS' 11-pt. average rates were ranked as fol-
lows: 79 on or above median, 43 below median, and 3
undetermined. The percentages for Il-Pt. averages are:
63.2% on or above, 34A% below, 2A% undetermined.


                                         334

   Table 4: 11-pt. Average Performance Summary


            Category  med
                              med
           AWl        85.7    31.0

           AW2        45.8    54.2

           RB         84.0    16.0
           Totals     63.2    34A

3A.1 AWl
Ad-hoc WSJ disk I, results were submitted by three sys-
tems. In this category, of all systems' submissions, 4,056
documents were judged relevant, 10,000 were submitted
by FAIRS in response to 50 queries, of which 1,561 were
among the relevant. The average 11 point average (over 50
queries over 11 recall rates for each query) was 0.2083.

The distribution of relevant retrieved (recall) over the 50
topics was: 20 ranked best, 12 on the median, 16 worst, 2
tied. The distribution of 11-Pt. averages over the 50 topics
was: 22 ranked best, 14 on the median and 13 worst, I
tied.
    *Thble 5: AWl RecaII/Predslon Performance

               Best     Med      Worst

     Recall   20(40%) 13 (26%)  16(32%)

     11-Pt.   22(44%) 14(28%)   12(24%)

Total recall placed FAIRS second among 3 participants,
and first in 11-Pt. averages recall/precision.

3A.2 AW2
Ad-hoc WSJ disk 2, results were submitted by two sys-
tems. In this category 2,172 documents were judged rele-
vant, 10,000 were submitted in response to 50 queries, of
which 1,188 were among the relevant. The 11 point aver-
age precision was 0.2216. The distribution of relevant
retrieved (recall) over the 50 topics was: 17 ranked best,
22 worst, 9 tied, 2 unevaluated. Of the tied, the 11-pt.
averages favored FAIRS 6 times, the other system 3 times.

    Table 6: AW2 Recall/Precision Performance


      rn         Best     Worst ~ ~e
       Recall   17(34%)  22(44%)_~ 11

      Lilt.[_22(44%)     26 (52%)_[ 2

For those queries for which there was a tie in recall values,
there are two queries which had 0 records judged relevant.
Of the remaining 9, we considered the 11-pt. average as a
tie-breaker. The result was 6 best, 3 worst. Combining the
recall and 11-pt. averages, for AW2, FAIRS had 23 sub-
missions on or above the median (46%), 25 below (50%)
(4% undetermined).

3A.3 RB
Routing, Category B results were submitted by 7 systems.
For those topics judged, 3,766 documents were considered
relevant, 5,000 were submitted by FAIRS in response to
25 queries. Of those submissions, 1,124 were among the
relevant.

The distribution of relevant retrieved (recall) over the 25
topics was: 2 ranked best, 13 ranked above the median, 6
on the median, 4 below, for a total of 21 on or above the
median, 4 below.

The following graph illustrates the performance index of
the recall rates of FAIRS compared to the group. It shows
FAIRS is above average most of the time. The average
recall P! of FAIRS is 65.8.


           1 3 5 7 9 11 13 15 i7 19 21 23 25
                  Recall ~Qiaery


The next graph shows the performance index of the 11-
point average of FAIRS compared to the group. It again
shows FAIRS to be above average most of the time. The
average 11-point-average P1 of FAIRS is 61.8.

       100


        50

    Table 7: RB RecalVprecision Performance

                                                 I

~tion to Median      >          I  =          1

 Recall          ~_15(60%)      6(24%) :4(16%) 1

 11-pt.          ~ 15(60%)      6(24%)  4(16%)]

This is the only group that had enough participants to
make a comparative-performance analysis meaningful. We
compared our 11-point average and recall rates for each
query to the best, the median, and the worst scores of that
query. The performance index (P!) is calculated as fol-
lows:

               ____ (score-median~ score > median
         50+50                  I
 P1 =          bess - median
            score - w:orsrs:s),
                                   score <median


P! has the property that a value of 100 means the best is
achieved, and a 50 means the performance is on the
median, and a 0 means it is the worst.


                                             335

0

 i 3  5 7  9 11 13 15 17 19 21 23 25

    11-point average IQuery

3.5  Failure Analysis
Based on the feedback from relevance judgements, we are
considering several improvements in the query handling
and ranking methods. These changes include:

1. Expanding of terms in the topics which are abbreviated
   via an abbreviation dictionary. Initial investigation of
   topics which have abbreviations reveals that those
   abbreviations had an appreciably negative impact on
   recall. Topic 17 is a good example, where the term
   "United States" is abbreviated as "U.S.", In a later trial,
   this simple expansion alone significantly improved the
   recall rate for this topic.

2. Using better term weighting based on heuristics. Up to
   50% improvement was observed when term weighting
   was modified more intultively (by hand.)

3. Considering more terms in the topics based on their
   popularity.

4. Using synonyms to expand terms.

5. Optimizing the ranking rules. This Will be important in
   Category A participation. The diversity of the docu-
   ments in Category A Will necessitate the use of opti-
   mum ranking rules which minimize the negative
   effects of extreme attribute values. These ranking rules
   will require more experimentation and analysis.


4.0 Conclusion
Various techniques used by FAIRS to retrieve, weight and
rank documents were discussed. The system was shown to
successfully process the ThEC92 collection and achieve a
high degree of precision and recall in the categories it par-
ticipated, relative to other systems in those categories.

The system was described, the ranking rules and weight-
ing methods were also presented. The application of
FAIRS to the TREC92 collection was analyzed in terms of
system and retrieval performance. The retrieval perfor-
mance was compared to competing systems in the same
categories. It is hoped that further analysis of the results
will provide insight into which features of FAIRS contrib-
ute the most to its effectiveness.


5.0 References
1. Chang, S. C. and A. Chow, "Towards a Friendly
   Adaptable Information Retrieval System," Proceedings
   of R!AO 88, March 1988, pp.172-182.

2. Chang, S. C., "Adaptive Ranking: A Spreadsheet
   Approach." Technical Report TR00171288500, GTE
   Laboratories, Waltham, MA, Dec.1988.

3. Chow, A. and S. C. Chang, "Text Structure Specifica-
   tion: A Study to Eliminate File Conversion," Technical
   Note TN87500.03, GTE Laboratories, Waltharn, MA,
   Dec.1987.
4. Paice, C. D., J~~ormation Retrievat and the Computer,
   MacDonald and Jane's Computer Monographs, Lon-
   don, 1977.
5. Du, M. W. and S. C. Chang, "A Model and Fast Algo-
   rithm for Multiple Errors Spelling Correction," Techni-
   cal Report TR06509895OO, GTE Laboratories,
   Waitham, MA, Sept.1989.


                                        336