N-Gram-Based Text Filtering For TREC-2

                               William B. Cavnar
                     Environmental Research Institute of Michigan
                               P.O. Box 134001
                           Ann Arbor, MI 48113-4001
                               cavnar@eriin.org

Abstract
Most text retrieval and filtering systems depend heavily on
the accuracy of the text they process. In other words, the
various mechanisms that they use depend on every word in
the text and in the queries being correctly and completely
spelled. To get around this limitation, our experimental text
filtering system uses N-gram-based matching for document
retrieval and routing tasks. The system's first application
was for the TREC-2 retrieval and routing task. Its perfor-
mance on this task was promising, pointing the way for sev-
eral  types of  enhancements, both for speed and
effectiveness.

1.0 Background
Even though modern character recognition techniques are
steadily improving, scanning and recognizing characters in
paper documents still results in many errQrs. These errors
can impact the ability of a standard, classical text retrieval
system to process the resulting ASCII text by interfering
with word stemming, stopword identification, or even basic
indexing. Furthermore, as paper documents age, their print
quality can degrade, possibly causing flirther recognition
problems. This is becoming a serious problem, especially
given the vast amount of material that exists in a paper form
that has a limited lifetime.

Another difficulty is that the texts themselves may contain
spelling variations (e.g., British vs. American) or outright
spelling errors. The same is true of the queries entered by a
human user. Discrepancies between the spelling of a word
in a document and a word in a query may prevent a match.

A third problem area is that current word stemming and
stopword list construction algorithms sometimes involve a
significant amount of knowledge about the documents' lan-
guage. Although there are automatic approaches to acquir-
ing this kind of knowledge, it is complicated [1]. An ideal
text retrieval method would require no special language
knowledge, and in fact would be able to handle multiple
languages (represented, say, in Unicode) simultaneously.


                                              171

 All of these problems have analogs in a somewhat different
 problem domain, namely, reading and interpreting postal
 addresses on pieces of mail. For the last eight years, the
 Environmental Research Institute of Michigan ~RIM) has
 conducted research for the United States Postal Service
 (USPS) in automatic address interpretation. The ultimate
 goal of this research is to build a high-performance address
 interpretation unit that can reliably and efficienfly read the
 address information on a mailpiece, and determine its cor-
 rect 9-digit or 11-digit ZIP code, even if that code is not
 present on the mailpiece or is in error.

 There are two parts to interpreting an address from a mail-
 piece, and both of them are difficult:

 * Recognizing the characters in the address block. This is
   difficult because of poor print quality, poor contrast,
   complicated backgrounds, presence of logos and non-
   address information, use of proportional fonts, and a
   host of other problems.

 * Interpreting the lines of the address. This is difficult
   because of spelling errors, addressing errors, unusual
   abbreviations, strange local addressing conventions, and
   errors and omissions in the USPS databases.

 Our current address interpretation systems use a number of
 different techniques to get around these problems. Among
 these is the use of N-gram-based matching to find matches
 in postal databases. (Please see [2] and [3] for details of
 these systems.)

 In this work, N-grams are N-character slices of some longer
 string. We do not use positional N-grams; i.e., what is
 important is whether a given string contains a particular
 N-gram, not where it falls in the string. We use bi-grams
 ~=2) and tri-grams ~=3) together in the same system.
 Bi-grams and tri-grams complement each other in the sense
 that bi-grams provide somewhat better matching for indi-
 vidual words, while tri-grams provide the connections
 between words to improve phrase matching.

 N-gram-based matching addresses the difficulties men-
 tioned above for postal addresses in ways that have applica-
 tion for document retrieval:

 * It uSually finds good matches for postal addresses even
   in the face of a significant number of individual charac-
   ter recognition errors. This is particularly true if the sys-
   tem can use redundant data for the search key, such as
   the city, state, and ZIP together. This is identical to the
   problem of matching document text that was scanned
   from poor quality images. Words that occur near one
   another in a text also "reinforce" each other for search-
   ing the way that city, state, and ZIP do.

 * It easily overcomes problems with spelling and address-
   ing variations and errors on mailpieces or in postal data-
   bases. This is analogous to the problem of matching
   document text that contains spelling variations. It also
   achieves an effect similar to stemming in that different
   words having the same root are considered to be similar.

 * It can also handle abbreviations fairly easily. This is also
   similar to the problem of word stemming.

 * Some postal address interpretation systems depend
   heavily on the system's ability to recognize certain key
   words, such as SThEET or BOX. These words play a
   role similar to stopwords in that they carry little content,
   and mostly provide structural information. N-gram-
   based matching can handle errors even in these words.

 Drawing on the experience of using N-grams in postal
 work, ERIM has been adapting these techniques for use in
 full-text retrieval and filtering. We currently have a working
 full-text retrieval system, called zview, which provides a
 simple query facility for accessing a wide variety of ASCII
 documents, including biographies, abstracts, and documen-
 tation. The zview system's N-gram-based matching provides
 a very simple, easy-to-use text retrieval system. It is tolerant
 of spelling variations and errors in both the text and the doc-
 uments. It also provides a straighiforward ranking of docu-
 ments that enables the user to see which documents best
 match the query. This system has given us some confidence
 in the wider utility of N-gram-based matching for informa-
 tion retrieval. Although the ThEC task would not seem to
 direcfly require inexact matching (the documents were high
 quality ASCII), our experience with zyjew indicates that this
 kind of tolerance for variations in user input is very useful
 for general retrieval.

 Incidentally, one criticism that one can make of N-gram-
 based systems is that they require large amounts of storage.
 Although this is true of the naive implementation, there are
 a number of variations using various superimposed coding


                                           172

 techniques that provide substantial space savings without
 impacting matching performance. See [2] for details.

 2.0 System Description
 The zyjew system mentioned above has two parts: an off-
 line index builder, and an on-line retrieval interface. In its
 current form, zyjew does not yet use our most compact
 N-gram representation techniques, and is thus not well-
 suited for very large (i.e., gigabyte size) databases. Also, it
 can handle only single query strings, not the multiple query
 strings necessary for TREC. Instead, for this task we opted
 to use a simple filter architecture, and concentrated on mak-
 mg it as fast as possible. This architecture has several
 advantages:

 * It requires no disk space for an index. In fact, we were
   able to save even more disk space by leaving the original
   documents in their compressed form, and uncompress-
   ing them on the fly.

 * It is conceptually simple, requiring only a very modest
   amount of code (987 lines of C source). The whole sys-
   tem took only a few days to design and build.

 This architecture is similar in concept to that used by Mark
 Zimmerman's PARA system, which participated in last
 year's TREC. That system used a set of handcrafted Unix
 awk scripts to filter the documents. The PARA system,
 although it got good results, took 22 days on a more or less
 dedicated NeXT machine, even though its match criteria
 generally specified fewer match terms than our system did.
 Also Zimmerman's system used the notion of proximlty to
 identify sets of matching lines that occurred near one
 another, whereas our system counts matches throughout the
 document without regard to the proximity of match terms.
 See [4) for details.

 Obviously, a significant disadvantage of such a filter-type
 architecture is that the system has to read all of the data to
 simulate a retrieval. This is very computationally intensive,
 but we were able to mitigate this somewhat in our system by
 handling much of the actual N-gram-based matching with a
 highly parallel bit-vector approach, which we describe
 below. Also, we were able to split the processing load up
 among a suite of computers and run them all in a coarse-
 grained parallel fashion.

 Figure 1 gives an overview of our system in dataflow fash-
 ion. We discuss each of the processes in this diagram in the
 following sections.

-5

    Topics


    Generate
`siries


    Queries

Compressed
Documents


Uncompressed
Documents

                                  Relevance Scores


                                      Find
                                      Top
                                      Choices


                                  Top Choice Scores

                   Figure 1: Dataflow Diagram for N-Gram-Based Text
                           Retrieval

2e1 Generating Query Strings From TREC
Topics

The process labelled "Generate Queries" represents a pro-
gram gen~query, which takes the TREC topics file and gen-
erates a set of query strings for each topic. We considered
several different schemes for extracting query strings from
the topics, but finally settled on using just the phrases in the
concept and nationality sections. The following is a typical
topic from ThEC-l:

      <num> Number 007
      <title> Topic: U.S. Budget Deficit
      <desc> Description: Document will mention a
      proposal to decrease the U.S. budget deficit.
      ...<con> Concept(s):
      1. U.S. budget deficit, federal budget shortfall
      2. foreign affairs budget, defense budget, en-
      titlements
      3. increased revenues, tax increase, tax re-
      form, auction quota
      4. reduction in expenditures, spending cuts,
      cutting domestic
      programs, eliminating government subsidies
      5. NOT financing the U.S. budget deficit


                                          173

      <nat> Nationality: U.S.


Given this topic, the gen~quely program generates the fol-
lowing set of query strings:

      007 000 U.S.
      007 000 U.S. budget deficit
      007 001 federal budget shortfall
      007 002 foreign affairs budget
      007 003 defense budget
      007 004 entitlements
      007 005 increased revenues
      007 006 tax increase
      007 007 tax reform
      007 008 auction quota
      007 009 reduction in expenditures
      007 010 spending cuts
      007 011 cutting domestic programs
      007 012 eliminating government subsidies
      007 013 NOT financing the U.S. budget deficit

Each line of the query output corresponds to one nationality
or concept string from the topic. The nationality strings are
listed first, followed by the concept strings in the order they
appeared. There is also a rank number associated with each
query string. Generally, as concept strings were used in the
topics, strings that appeared earlier in the list were more
important than strings that appeared later. The actual filter-
ing program also uses these rank numbers to weight the
query strings in importance. Each string receives a weight
equal to 2k - r, where k is the number of query strings for
the topic and r is the string's rank value. Any nationality or
concept string can also have a NOT prefix, which triggers a
slighfly different kind of processing for these strings,
described below.

2.2 The Problem of Finding Matches While
Filtering Text
In Figure 1 above, the process labelled "zcat" represents a
Unix utility program that takes a compressed file and
returns its uncompressed form. The process labelled "Find
Matches" represents a programfind~matches, which takes a
set of queries for all of the topics at once, and applies them
to a single uncompressed text file. Given the filter architec-
ture for this system, it was important for us to minimize the
total amount of I/O. By letting find_matches process the
queries for all topics simultaneously, the system has to pass
the data only once.

To make this process as efficient as possible, find_matches
does most of its matching using bit-vectors representing the
presence or absence of particular N-grams. Suppose that we
had a string that we wanted to search for in a document
using N-grams. An obvious way to do it would be to per-
form the following steps:

  Break the query string up into N-grams.

* For every line in the document, count how many of
  those N-grams occurred in the line. This count would be
  the match score for that line.

* Keep a sorted list of the top-scoring lines, bumping the
  lowest scoring entry on this list when a new line has a
  better score.

 Unfortunately, this naive implementation of N-gram-based
 matching, which uses nested loops to compare every
 N-gram in the query string with every N-gram in a line, is
 computationally very cosdy.

 2.3 Matching and Scoring Lines
 We can do significantly better than the naive N-gram-based
 algorithm described above by using hashing. Figure 2 illus-
 trates this process for bi-grams. The key idea here is to
 break the matching process into two parts:

 * Initialize a hash-code-based look-up table for all of the
   N-grams in the query string. Each slot simply records
   whether any N-gram hashed to that position (1) or not
   (0). Thus the hash table is a simple bit vector.

 * For each line in the document, make a single pass for all
   of the line's N-grams, checking each one to see if it is in
   the table. In other words, see if the slot Q,it) correspond-
   ing to the N-gram contains a 1. If it does, add 1 to the
   line's score.

 If we make the hash table big enough, we can make the
 probability of collisions arbitrarily small. For example, con-
 sider that there are only 52,022 possible bi-grams and tri-
 grams for the alphabet consisting of A-Z, 0-9, and space. A
 hash table size of, say, 20,000 slots (625 32-bit words or
 2500 bytes) provides far more than ample space to avoid
 collisions. if there is a collision between the hash of an
 N-gram in the query string and that of some different
 N-gram in the line, its only effect is to cause that line's score
 to be one higher than it should be. It is even more unlikely
 that two or more of the N-grams would also collide. Thus by
 allowing collisions in the hash table, we get a simple, quick
 algorithm at the cost of a small probability of small distor-
 tions in the line scores. See [2] for a further discussion of
 how to estimate the probability of these kinds of collisions.
 Note that we include leading and trailing spaces for words
 in counting N-grams. Also recall that in our full implemen-
 tation of this we process tri-grams the same way and at the
 same time. The match score between two strings includes
 the count of both matching bi-grams and matching tri-
 grams.

When this algorithm finishes, we will have a list of the top
scoring matches for our query string. We could make this
process a little more sophisticated by putting a threshold of
some sort on it. If a line's match score was below this
threshold, say 80% of the number of N-grams in the query
string, we could ignore this line as simply not containing
enough of the query string to consider.


                                           174

To get an additional gain in efficiency, we can carry this
hash table scheme for counting N-grams another step by
using parallelism to match a single line from a document
against a number of query strings all at the same time. Fig-
ure 3 illustrates this scheme. Again, matching is a two-step
process:

                      �4RIN;\\
                     S, ST, TR, RI, IN, NG, G_


           1           1   1    1 ii   1       1      I


                                          Boxes indicate matches
                                          between N-grams from
                                          the two strings. The
                                          match score for these
                                          two strings is 3.

                                                        Bi-grams


                                                         Hash
                                                         Table
                                                         Bit
                                                         Vector

                    ~S,SP,PR,RP,UN,NG,G_                  Bi-grams


                          SPRUNG

            Figure 2: Using Bi-Grams to Compare "STRING" and "SPRUNG"

 * Initialize a hash table for the entire set of query strings.
   Each slot in the hash table again represents one N-gram,
   but it is now a bit vector in which the kth bit indicates
   whether the N-gram occurs in the kth query string.

 * For every line in the document, hash each N-gram to
   find the corresponding bit-vector. Then add that bit vec-
   tor to the current score vector for the document. The kth
   element of this score vector keeps the current match
   score for the document against the kth query string.

 One other significant aspect of this multi-query-string
 implementation is that we also represent the final score vec-
 tor for all of the queries as a set of parallel bit-vectors. This
 effectively allows us to perform 32 additions in parallel by
 adding one 32-bit word of an N-gram bit-vector from the
 table to one 32-bit word representing the 1-bits of 32 query
 string scores, and then propagate the carry bits up to bit-vec-
 tors representing the higher order bits of the scores. Effec-
 tively, this lets us compute scores for 32 query strings in
 close to the same amount of time as it takes to compute the
 score for a single query string.


                                          175

24 Scoring and Ranking Documents
After the system computes the N-gram match scores of a
single line against all of the query strings, it applies a
threshold criterion. If the score for a particular query string
is below its threshold, the system treats it as a zero score.
The threshold is defined as a percentage of the maximum
possible N-gram score for the query string. For example, if
a particular query string has a maximum possible score of
20 matching N-grams, and the tbxeshold percentage (as
determined by a command line argument to the program) is
70%, then the system will ignore any line whose score is
less than 14. There is a similar command line argument
which determines the threshold percentage for negation
query strings (those tagged with a NOT prefix in the query
string set). If a document has a line whose score for a nega-
tion query string exceeds its threshold, the system will dis-
card the document. For most runs of the system, we got the
best results with a match threshold at 70% and a negation
threshold at 95%. In other words, we were willing to accept
a line as a match for a given query string if it had 70% of the

Document   "expansion of Medicare to cCP+v catastrophic illnesses, faster"
Line                              7


                                 hash

                              L;�2on

Hashed N-gram
Indices

Query String
Indices

        *

A

Accumulated
parallel score
bit-vectors for
hashed N-grams
over all query
strings, up to
and including
"ve" bi-gram

8  42    1

Bit-vector for
kth query string

~______ ______ I


                I

Bits for
score of  Ii
kth query
string

                                                                         Add selected
                                Bit-vector for
                                presence of a                            bit-vector to
                                particular N-gram                        l's score vector,
                                                                         and propagate
                                in all query strings
                                                                         carry bits up

               Figure 3: Parallel Implementation for Multiple Query Strings

requisite N-grams, but needed a nearly exact match in order
to discard it for matching a negation query.

To compute the scores for a whole document, we accumu-
lated the line scores for each query string, but applied a cap
at twice the maximum possible N-gram score. In other
words, if a document mentions a particular query string
twice, that is twice as good as mentioning it once. However
mentioning it three or more times is of no additional value.
We put this cap into the system after observing that initial
versions of the system would overrate some documents
because they mentioned some low ranking query string for a
given topic many times, but completely missed any of the
higher ranking query strings for the topic.

To generate the final scores for a document against a topic,
the system accumulates the scores for each query string in
the topic, multiplying them by the weighting factor com-
puted earlier. If this aggregate weighted score exceeds a
hard~oded threshold of 40, the system then writes out this
score as a relevance judgement for the document. Later a
separate program reads the file containing the relevance
scores, computes the top 1000 scores for each topic, and
writes them out as the final system result.


                                          176

3.0 Multi-Processor Execution

Even with all of this attention to efficient implementation,
the system still required a large amount of computation.
Fortunately, ERIM had available a network of Sun worksta-
tions over which we could distribute the processing load.
This network consisted of two Sun SpARCstations and five
or six Sun SpARCstation 2 machines. We used a very sim-
ple partitioning scheme, assigning each machine a fixed set
of documents to evaluate against all of the topics. Given this
arrangement, a full run of topic set 3 against Disks 1 and 2
took 12 to 13 hours if all of the machines were fully avail-
able. When there was significant contention on some of the
machines, the run might take as much as 24 hours. An obvi-
ous drawback with using the fixed partitioning scheme was
that often one or more of the machines would not finish
until many hours after all of the others had. This was usually
due to unexpected heavy contention on one or more
machines. If we had used a more intelligent and dynamic
document file assignment mechanism (implemented in, say,
the Linda coordination language), we would have had better
elapsed execution times.

TABLE 1. Selected TREC-2 Test Results

Test                                 Query     Negation   Average              At 100      R-
Number      Note          Test Type  Thresh.   Thresh.    Precision At 10 Does Does        Precision
erirnal     official      adhoc      80        80         0.1589    0.3960     0.3016      0.2179
eri~        official      adhoc      75        80         0.1885    0A480      0.3426      0.2494
a30         official      ad hoc     70        95         0.1604    0A040      0.3038      0.2198
a31         officialw/    adhoc      70        95         0.1153    OAOOO      0.259       0.1904
            X queries     __________                               __________
a33         spanning      adhoc      70        95         0.0734    0.15       0.1484      0.1333
            lines
a34         span lines;   ad hoc     90        95         0.1099    0.314      0.2276      0.1776
            exp. decay
            weighting     _______                                  _______
a35         span lines;   ad hoc     80        95         0.1336    0.3000     0.2442      0.2004
            exp. decay
            weighting     _______ _______                          _______ _______
a35-AP      APonly        adhoc      80        95         0.2717    0A740      0.2838      0.3140
a36         spanlines;    adhoc      70        95         0.0792    0.1780     0.1678      0.1387
            exp. decay
            weighting     __________ __________ __________                     __________ __________
erirnrl     official      routing    80        80         0.1219    0.3580     0.2304      0.1814
erimr2      official      routing    75        80         0.1415    0A240      0.2524      0.2031
rS          official      routing    70        95         0.1225    0.3600     0.2310      0.1818
~           span lines;   routing    80        95         0.1015    0.2880     0.1954      0.1604
            exp. decay
_______     weighting     _______ _______ _______ _______ _______
r7          span lines;   routing    70        95         0.0602    0.2200     0.1388      0.1123
            exp. decay
_______     weighting     _______ _______ _______ _______ _______              _______ _______

4.0 Results
To test our system, we ran it a number of times, varying dif-
ferent system parameters. After the conference, we were
also able to make a few changes and run it again. Table 1
above summarizes some highlights of our results, which
show a number of interesting points:

  The tests numbered erimal, erima2, erimrl and erimr2
  were the official results turned in on June 1. The first two
  were the ad hoc results, and the second two were the
  routing results.

  The tests numbered ~0, ~5, a36, rS, and r7 were all
  attempts to determine good sernngs for the query thresh-
  old and negation threshold parameters. Unfortunately,
  the results from these test runs simply do not provide
  anywhere near enough data to perform a complete sensi-
  tivity analysis for these parameters. Also, we noticed in
  some of our testing on the ThEC-1 queries that there is


                                           177

considerable difference in the optimum values of these
thresholds for different topic sets/data set combinations.

One of the motivations for using N-gram-based match-
ing was that it provides good matching performance in
the face of textual errors. To test this idea, we ran test
~ 1 using deliberately damaged query strings. In this
test, we took each query string produced by gen~query,
and replaced the third character with the letter "X". This
works out to be an effective character recognition error
rate of 4.5% over the whole body of query strings.
Although the system took a considerable hit in perfor-
mance, the interesting thing was that it still functioned at
all. Many of the other systems in the ThEC evaluation
would most likely have completely failed in an analo-
gous test, since they depend heavily on exact word
matches.

One serious drawback to the original system was that it
did not span lines when matching. That is,if the text that

   a partiCular query should match was split across two
   document lines, then the match would fail. After the con-
   ference, we found a relatively easy way to rectify this
   design deficiency and were able to run tests a33 though
   ~6, and'6 and r7. Unfortunately, this change greatly
   affected the system's response to the query and negation
   thresholds, and we were not able to run enough tests to
   find the optimum values for these parameters. The end
   result is that the best results we have for this improved
   system are still not as good as the official results we
   reported.

 * Another serious problem with the original system came
   to light during the official conference. As Donna Harman
   pointed out in her closing talk, no one has yet really
   explored the full ramifications of changes in term
   weighting strategies. Our original system used a very
   simple-minded linear-falloff weighting scheme. Our
   assumption was that concept strings appeared in reverse
   order of importance. However, we began to suspect,
   based on different concepts that were mentioned in a
   number of the conference presentations, that this was
   overly simplistic. We decided to implement a straight-
   forward exponential-decay weighting scheme. In this
   approach, the first query string gets a weight of 100, the
   second gets a weight of, say, 90, the third a weight of 81,
   the fourth a weight of 72, and so on, with each succeed-
   ing weight tacing 90% of its predecessor's value. Unfor-
   tunately, we did not have time to tune the system's
   response to this change either, and its results (a34, a35,
   a36, ~, and r7) are worse than the official results as
   well. However, it appears that there is plenty of room for
   experimentation with different term weighting schemes
   and we will continue working with them.

 * In our early work with the system before sending in the
   official results, we did a fair amount of testing using just
   the Associated Press documetit set. We were perhaps
   misled by how well the system did on these documents,
   and missed some chances to improve the system earlier.
   The test labelled a35-AP shows what the results look
   like for ~5 if we restrict the new system to just return-
   ing AP documents, and restrict the relevance judgements
   to just AP documents. Even with this imperfectly tuned
   new version, we see that the system is capable of signifi-
   cantly better performance. It is unclear why there should
   be such variation between the retrievability of the AP
   documents and the other document collections.

 At its best, our system performed as well as most of the sys-
 tems that participated in ~~C-l. However, there is ample
 room for improvement, as we have noted above, especially
 in comparison to many of the systems that came back for
 ThEC-2.


                                          178

 5.0 Further Research

 The ~fl~C-2 task is the first real application for our
 N-gram-based multiple-query system. As in any experiment
 of this nature, the results and problems suggest many more
 possible avenues of research. These ideas fall into two cate-
 gories.

 5.1 Analyzing the Current System's
 Performance
 Further analysis of the existing system will allow us to bet-
 ter understand its behavior and limitations. Some ways to do
 that include:

 * It is likely that generating query strings from the topic
   concept strings may have significantly limited perfor-
   mance. For example, Topic 74 about instances where the
   U.S. government propounds conflicting policies com-
   pletely failed to mention terms such as policy or regula-
   tion in the concept list. Thus, our system had only a very
   small chance whatsoever of finding matching docu-
   ments. Zirnmerman's filtering system [4] did well with
   handcrafted queries, so we should also try manually gen-
   erated queries.

 * Currently the system has a hard~oded cutoff threshold
   of 40 for the weighted aggregate score. The purpose of
   the threshold was to prevent the system from returning
   results that were guaranteed to be noise because of their
   very low score. This value was set more or less arbi-
   trarily, so we should experiment with changing this
   threshold to determine its true effect. In all likelihood, it
   could be a fair amount higher, preventing the system
   from generating other useless low-scoring results.

 * Currently the system sets a cap of three times the maxi-
   mum N-gram score for any query string score. Again,
   this value was determined only by a very rough empin-
   cal process, so we should experiment with changing this
   cap, to see how much impact it has.


 5.2 Extending the System
 We can also make some significant changes to the system to
 explore possibilities for other performance improvements.

 * Currently the system treats upper and lower case alike
   for both documents and queries. Since acronyms and
   brand names have different meanings sometimes from
   uncapitalized words having the same letters, perhaps
   there is a way to take the case of letters into account
   when computing a match. That is, we could count a

   match between two words with different capitalizations
   as a good match, but a match between two words with
   the same capitalizations as a better one.

 * Our system is basically just a text filter. As such, it is
   akeady close to being useful for various routing tasks.
   However, to use N-gram-based matching for on-line
   retrieval, we will have to implement a true index-based
   system. Ideally, we would like to integrate the text
   retrieval capability of our zview system with the flexibil-
   ity of the multi-query filtering system described here.
   Although we have an initial design sketched out, it will
   take an investment of further time and machine
   resources to implement this idea and test it. We will also
   be using some of our new compact N-gram representa-
   tion techniques to reduce the large amount of index stor-
   age and computation required.

 Acknowledgments
 We would like to thank our sponsors at DARPA/SISTO for
 allowing us the opportunity to participate in TREC-2. We
 would also like to thank Donna Harman and her staff at
 NIST for all of their help. Finally, we are grateful for the
 United States Postal Service's sponsorship of the original
 N-gram-based matching technology that inspired our TREC
 research.


                                          179

References


[1] W. B. Frakes. Stemming Algorithms. In William B.
   Frakes and Ricardo Baeza-Yates, Editors. Injo~~~ation
   Retrieval: Data Structures & Algorithms, pages 131-
   160. Prentice Hall, Inc. Englewood Cliffs, NJ, 1992.

[2] William B. Cavnar and Alan J.Vayda. Using superim-
   posed coding of N-gram lists for efficient inexact
   matching. In Proceedings of the Fifth USPS Advanced
   Technology Conference, pages 253-267, Washington,
   DC, 1992.

[3] William B. Cavnar and Man J.Vayda. N-gram-based
   matching for multi-field database access in postal appli-
   cations. In Proceedings of the 1993 Symposium On
   Document Analysis and Information Retrieval, pages
   287-297, University of Nevada, Las Vegas.

[4] Mark Zimmerman. Proximity-correlation for document
   ranking: The PARA Group's TREC Experiment. In
   Proceedings of the First Text REtrieval Conftrence
   (TREC-1), NIST Special Publication 500-207, 1992.