NATURAL LANGUAGE PROCESSING IN LARGE-SCALE TEXT RETRIEVAL TASKS


                                  Tomek Strzalkowski
                        Courant Institute of Mathematical Sciences
                                  New York University
                                715 Broadway, rm. 704
                                  New York, NY 10003
                                  tomek@cs.nyu.edu

                   ABSTRACT
We developed a prototype text retrieval system
which uses advanced natural language processing
techniques to enhance the effectiveness of key-word
based docuinent retrieval. The backbone of our sys-
tern is a tradifional stafistical engine which builds
inverted index files from pre-processed documents,
and then searches and ranks the documents in
response to user queries. Natural language process-
ing is used to (1) preprocess the documents in order
to extract contents-carrying terms, (2) discover inter-
term dependencies and build a conceptual hierarchy
specific to the database domain, and (3) process
user's natural language requests into effective search
queries. For the present TREC effort, the total of 500
MBytes of Wall Street Journal articles have been
processed in two batches of 250 MBytes each. Due to
time and space limits, two separate inverted indexes
were produced for each half of the data, with a partial
concept hierarchy built from the first 250 Mbytes
only but used for retrieval on either half. Retrieval
were performed independently on both halfs of the
database and the partial results were merged to pro-
duce the final rankings.


INTRODUCTION
    A typical information retrieval (IR) task is to
select documents from a database in response to a
user 5 query, and rank these documents according to
relevance. This has been usually accomplished using
statisfical methods (often  coupled with manual
encoding) that (a) select terms (words, phrases, and
other units) from documents that are deemed to best
represent their contents, and (b) create an inverted
index file (or files) that provide and easy access to
documents containing these terms. A subsequent
search process will attempt to match a preprocessed
user query (or queries) against term-based represen-
tations of documents in each case determining a
degree of relevance between the two which depends
upon the number and types of matching terms.
Although many sophisticated search and matching


                                               173

methods are available, the crucial problem remains to
be that of an adequate representation of contents for
both the documents and the queries.
    The simplest word-based representations of
contents are usually inadequate since single words
are rarely specific enough for accurate discrimina-
tion, and their grouping is often accidental. A better
method is to identify groups of words that create
meaningful phrases, especially if these phrases
denote impor~~~t concepts in database domain. For
ex~ple, joint venture is an important term in Wall
Street Journal (WSJ henceforth) database, while nei-
ther joint nor venture is importㄏint by itself. In the
retrieval experiments with the training TREC data-
base, we noticed that both joint and venture were
dropped from the list of terms by the system because
their idf (inverted document frequency) weights were
too low. In large databases, such as TIPSTER, the
use of phrasal terms is not just desirable, it becomes
necessary.
    The question thus becomes, how to identify the
correct phrases in the text? Both statistical and syn-
tactic methods were used before with only limited
success.  Statistical methods based on word co-
occurrences and mutual information are prone to high
error rates, turning out many unwanted associations.
Syntactic methods suffered from low quality of gen-
crated parse structures that could be attributed to lim-
ited coverage grㄏnmars and the lack of adequate lex-
icons. In fact, the difficulties encountered in applying
computational linguistics technologies to text pro-
cessing have contributed to a wide-spread belief that
automated natural language processing may not be
suitable   in  IR. These  difficulties  included
inefficiency, lack of robustness, and prohibitive cost
of manual effort required to build lexicons and
knowledge bases for each new text domain. On the
other hand, while numerous experiments did not
establish the usefulness of linguistic methods in IR,
they cannot be considered conclusive because of their

limited scale.
    The rapid progress in Computational Linguis-
tics over the last few years has changed this equation
in various ways. First of all, large-scale resources
became available: on-line lexicons, including Oxford
Advanced Learner's Dictionary (OALD), Longman
Dictionary  of Contemporary  English (LDOCB),
Webster's Dictionary, Oxford English Dictionary,
Collins Dictionary, and others, as well as large text
corpora, many of which can now be obtained for
research purposes.  Robust text-oriented software
t(x)ls have been built, including part of speech
taggers (stochastic and otherwise), and fast parsers
capable of processing text at speeds of 2600 words
per minute or more (e.g., TTP parser developed by
the author). While many of the fast parsers are not
very accurate (they are usually partial analyzers by
design),2 some, like TTP, perform in fact no worse
than standard full-analysis parsers which are many
times slower and far less robust. 3
    An accurate syntactic analysis is an essential
prerequisite for term selection, but it is by no means
sufficient. Syntactic parsing of the database contents
is usually attempted in order to extract linguistically
motivated phrases, which presumably are better indi-
cators of contents than "stafistical phrases" where
words are grouped solely on the basis of physical
proximity (e.g., "college junior" is not the same as
"junior college"). However, creafion of such com-
pound terms makes term matching process more
complex since in addition to the usual problems of
synonymy and subsumption, one must deal with their
structure (e.g., "college junior" is the same as "junior
in college"). In order to deal with structure, parser's
output needs to be "normalized" or "regularized" so
that complex terms with the same or closely related
meanings would indeed receive matching representa-
tions. This goal has been achieved to a certain extent
in the present work. As it will be discussed in more
detail below, indexing terms were selected from
atnong head-modifier pairs extracted from predicate-
argument representations of sentences.


     Standard IR benchmark collections are statistically too
small and the experiments can easily produce counterintuitive
results. For example, Cranfield collection is only approx. 180,000
English words, while CACM-3204 collection is approx. 200,000
words.
     Partial parsing is usually fast enough, but it also generates
noisy data: as many as 50% of all generated phrases could be in-
correct (Lewis and Croft, 1990).
    3 TTP has been shown to produce parse stmctures which are
no worse in recall, precision and crossing rate than those generated
by full-scale linguistic parsers when compared to hand-coded
Treebank parse trees.


                                             174

     Introduction of compound terms also compli-
cates the task of discovery of various semantic rela-
tionships among them, including synonymy and sub-
sumption. For example, the term natural language
can be considered, in certain domains at least, to sub-
sume any term denofing a specific human language,
such as English. Therefore, a query containing the
former may be expected to retrieve documents con-
tairn'ng the latter. The same can be said about
language and English, unless language is in fact a
part of compound term programming language in
which case the association language - Fortran is
appropriate. This is a problem because (a) it is a stan-
dard practice to include both simple and compound
terms in document representation, and (b) term asso-
ciations have thus far been computed primarily on
word level (including fixed phrases) and therefore
care must be taken when such associations are used
in term matching. This may prove particularly trou-
blesome for systems that attempt term clustering in
order to create "meta-terms" to be used in document
representation.
     The system presented here computes term
associations from text on word and fixed phrase level
and then uses these associations in query expansion.
A fairly primitive filter is employed to separate
synonymy and subsumption relationships from others
including antonymy and complementation, some of
which are strongly domain-dependent. This process
has led to an increased retrieval precision in experi-
ments with smaller and more cohesive collections
(CACM-3204), but may be less effective with large
databases. We are presently studying more advanced
clustering  methods along with  the   changes in
interpretation of resulting associations, as signalled in
the previous paragraph.
     Working with TREC topics has also helped to
identify other, perhaps unanticipated problems, some
of which    may render  the traditional statistical
approach to information retrieval quite unworkable.
A typical document retrieval query will specify (in
one way or another) a set of concepts that are of
interest to the originator of the query. Thus if the user
is interested in documents that report on anticipated
rail strikes, the following query may be appropriate:
(TREC topic 058) A relevant document will report an
impending rail strike .... Any other wording can be
used so long as it is going to denote a concept to be
found in a document. The system's task is then to dis-
cover that the same concept is being denoted in both
the query and a document, no matter how different
the surface descriptions happen to be. In other
words, no new information is requested, the query
being  entirely self  contained and   completely

specified. 4 An altogether different situafion arises
when  the query actually requests      that cer~un
underspecified information is found before a docu-
ment could be judged relevant. In TREC topics (e.g.,
058, as in many others) the following request is corn-
monpiace: to be relevant, the document will identiA
the location of the strike or potential strike. What we
ask the system here is to extract the value of ce~~~in
variable that satisfies certain conditions, i.e., find X
such that location-of-strike(X).  It is impossible to
properly evaluate such query using any kind of con-
stant term based retrieval. What is required, at the
minimum is a general pattern matching capability,
and appropriately advanced representation of con-
tents (including more careful NLP processes).
     In the remainder of this paper we discuss par-
ticulars of the present system and some of the obser-
vations made while processing TREC data.      The
above comments will provide the background for
situafing our present effort and state-of-the-art with
respect to where we should be in the future.


OVERALL DESIGN
     Our information retrieval system consists of a
traditional stafistical backbone (Harman and Candela,
1989) augmented with various natural language pro-
cessing components that assist the system in database
processing (stemming, indexing, word and phrase
clustering, selectional restrictions), and translate a
user's information request into an effective query.
This design is a careful compromise between purely
statistical non-linguistic approaches and those requir-
ing rather accomplished (and expensive) semantic
analysis of data, often referred to as `conceptual
retrieval'. The conceptual retrieval systems, though
quite effective, are not yet mature enough to be con-
sidered in serious information retrieval applications,
the major problems being their extreme inefficiency
and  the need for manual encoding of domain
knowledge (Mauldin, 1991). However, as pointed out
in the previous section, a more careful text process-
ing may be required for certain types of requests.
     In our system the database text is first pro-
cessed with a fast syntactic parser. Subsequently cer-
tain types of phrases are extracted from the parse
trees and used as compound indexing terms in addi-
tion to single-word terms. The extracted phrases are
statistically analyzed as syntactic contexts in order to
discover a variety of similarity links between smaller
subphrases and words occurring in them. A further
filtering process maps these similarity links onto

semantic  relations (generalization, specialization,
synonymy,  etc.)  after which they   are used to
transform user's request into a search query.
      The user's natural language request is also
parsed, and all indexing terms occurring in them are
identified. Certain highly ambiguous, usually single-
word terms may be dropped, provided that they also
occur as elements in some compound terms. For
example, "natural" is deleted from a query already
containing "natural   language because   "natural"
occurs in many unrelated contexts: "natural number",
"natural logarithm", "natural approach", etc. At the
same time, other terms may be added, namely those
which are linked to some query term through admis-
sible similarity relations. For example, "unlawful
activity" is added to a query (TREC topic 055) con-
taining the compound term "illegal activity" via a
synonymy link between "illegal" and "unlawful".
After the final query is constructed, the database
search follows, and a ranked list of documents is
returned.
      It should be noted that all the processing steps,
those performed by the backbone system, and these
performed by the natural language processing corn-
ponents, are fully automated, and no human interven-
tion or manual encoding is required.


FAST PARSING WITH TTP PARSER
      TTP (Tagged Text Parser) is based on the
Linguistic String  Grammar developed by     Sager
(1981). The parser currently encompasses some 400
grammar productions, but it is by 110 means complete.
The parser's output is a regularized parse tree
representation of each sentence, that is, a representa-
tion that reflects the sentence's logical predicate-
argument structure. For exainple, logical subject and
logical object are identified in both passive and active
sentences, and noun phrases are organized around
their  head elements.   The significance  of this
representation will be discussed below. The parser is
equipped   with  a powerful skip-and-fit recovery
mechanism that allows it to operate effectively in the
face of ill-formed input or under a severe time pres-
sure. In the runs with approximately 83 million words
of TREC's Wall Street Journal texts,5 the parser's
speed averaged between 0.45 and 0.5 seconds per
sentence, or up to 2600 words per minute, on a 21
MIPS SparcStation ELC.


try.

     4 This does not mean that the query has to accurately reflect
the user's intentions. we take what we've got and give it our best


                                                     175

      Approximately 0.5 (;Bytes of text, over 4 million sen-
tences.

     TIT' is a full grammar parser, and initially, it
attempts to generate a complete analysis for each
sentence. However, unlike an ordinary parser, it has a
built-in timer which regulates the amount of time
allowed for parsing any one sentence. If a parse is not
returned before the allotted time elapses, the parser
enters the skip-and-fit mode in which it will try to
"fit" the parse. While in the skip-and-fit mode, the
parser will attempt to forcibly reduce incomplete
constituents, possibly skipping portions of input in
order to restart processing at a next unattempted con-
stituent. In other words, the parser will favor reduc-
tion to backtracking while in the skip-and-fit mode.
The result of this strategy is an approximate parse,
partially fitted using top-down predictions. The frag-
ments skipped in the first pass are not thrown out,
instead they are analyzed by a simple phrasal parser
that looks for noun phrases and relative clauses and
then attaches the recovered material to the main parse
structure. As an illustration, consider the following
sentence taken from the CACM-3204 corpus:
    The method is illustrated by the automatic con-
    struction of both recursive and iterative pro-
    grains operating on natural numbers, lists, and
    trees, in order to construct a program satisfying
    certain specifications a theorem induced by
    those specifications is proved, and the desired
    program is extracted from the proof.
The italicized fragment is likely to cause additional
complications in parsing this lengthy string, and the
parser may be better off ignoring this fragment alto-
gether. To do so successfully, the parser must close
the currently open constituent (i.e., reduce a program
satisfying certain speeLfications to NP), and possibly
a  few  of  its parent  constituents, removing
corresponding productions from further considera-
tion, until an appropriate production is reactivated.
In this case, TTP may force the following reductions:
SI ~ to V NP; SA ~ SI; S ~ NP V NP SA, until the
production S ~ S and S is reached. Next, the parser
skips input to find and, and resumes normal process-
ing.
     As may be expected, the skip-and-fit strategy
will only be effective if the input skipping can be per-
formed with a degree of determinism. This means
that most of the lexical level ambiguity must be
removed from the input text, prior to parsing. We
achieve this using a stochastic parts of speech tagger
to preprocess the text. Full details of the parser can
be found in (Strzalkowski, 1992).


PART OF SPEECII TAGGER
     One way of dealing with lexical ambiguity is to
use a tagger to preprocess the input marking each
word  with a  tags that indicates  its syntactic


                                             176

categorization: a part of speech with selected mor-
phological features such as number, tense, mode,
case and degree. The following are tagged sentences
from the CACM-3204 collection: 6
    The/dt paper/nn presents/vbz a/dt proposai/nn
    for/in structured/vbn representation/nn of/in
    multiprogramming/vbg in/in a/dt high/ji level/nn
    language/nn ./per

    The/dt   notation/nn used/vbn  explicitly/rb
    associates/vbz  a/dt data/nns  structure/nn
    shared/vbn by/in concurrent/jj processes/nns
    with/in operations/nns defined/vbn on/in it/pp
    ./per

The tags are understood as follows: dt - determiner,
nn - singular noun, nns - plural noun, in - preposition,
ii - adjective, vbz - verb in present tense third person
singular, to - particle "to", vbg - present participle,
vbn - past participle, vbd - past tense verb, vb -
infinitive verb, cc - coordinate conjunction.
     Tagging of the input text substantially reduces
the search space of a top-down parser since it
resolves most of the lexical level ambiguities. In the
examples above, tagging of presents as "vbz" in the
first sentence cuts off a potentially long and costly
"garden path" with presents as a plural noun followed
by a headless relative clause starting with (that) a
proposal.... In the second sentence, tagging resolves
ambiguity of used (vbn vs. vbd), and associates (vbz
vs. nns). Perhaps more impo~'tntly, elimination of
word-level lexical ambiguity allows the parser to
make projection about the input which is yet to be
parsed, using a simple lookahead; in particuL'tr,
phrase boundaries can be determined with a degree
of confidence (Church, 1988). This latter property is
critical for implementing skip-and-fit recovery tech-
nique ouflined in the previous section.
     Tagging of input also helps to reduce the
number of parse structures that can be assigned to a
sentence, decreases the demand for consulting of the
dictionary, and simplifies dealing with unknown
words. Since every item in the sentence is assigned a
tag, so are the words for which we have no entry in
the lexicon. Many of these words will be tagged as
"np" (proper noun), however, the surrounding tags
may force other selections. In the following exam-
ple, chinese, which does not appear in the dictionary,
is tagged as "jj" :~
    this/dt  paper/nn  dates/vbz back/rb   the/dt
    genesis/nn of/in binary/jj conception/nn circa/in

    6 Tagged using the 35-tag Penn Treebank Tagset created at
the University of Pennsylvania.
    7 We use the machine readable version of the Oxford Ad-
vanced Leamer's Dictionaiy (OALD).

   5000/cd    years/nns  ago/rb ,/com   as/rb
   derived/vbn by/in the/di chinese/jj ancients/nns
   ./per

       The tagger which we use to process the input
text prior to parsing is based upon a bi-gram model; it
selects   most likely tag for a word   given co-
occurrence probabilities computed from a relafively
small training set.8 While the peak accuracy of the
best-tag option of the tagger is predicted to approach
97% (Meteer et al., 1991), we noted that the actual
performance on unprocessed WSJ text was in fact
somewhat worse. The main problem, it appears, were
frequent mistakes in tokenization of input, especially
in recognizing sentence boundaries. For example,
when a sentence ended with a period but wasn't fol-
lowed by at least two blanks or an end-of-line, this
and the next sentence would be collapsed together.
On the other hand, intra-sentenfial periods (like those
following   abbreviated words) were  occasionally
found followed by a new-line character, and the sen-
tence was split into two. While the parser contains a
provision to deal with the case of collapsed sen-
tences, the tags were likely to be incorrect. The fol-
lowing example is typical; note tagging errors at the
second apostrophe, and plans.

Gorbachev was rinining into trouble at home, including the
August coup, "which I thought would be the end of it," Mr.
Costa says. Still, plans to send the tank to the U.S. somehow
moved ahead.


   Gorbachev/np    was/vbd running/vbg into/in
   trouble/nn at/in home/nn ,/com including/vbg
   the/di   August/np   coup/nn ,/com  "/apos
   which/wdt   I/pp thought/vbd would/md be/vb
   the/di end/nn of/in it/pp ,/com "/nn Mr/nn ./per

   Costa/np says/vbz ./per still/rb ,/com plans/vbz
   to/to send/vb the/di tank/nn to/to the/di U.S./np
   somehow/rb moved/vbd ahead/rb ./per


WORD SUFFIX TRIMMER
       Word stemming has been an effective way of
improving document recall since it reduces words to
their common morphological root, thus allowing
more successful matches. On the other hand, stem-
ming tends to decrease retrieval precision, if care is
not taken to prevent situafions where otherwise unre-
lated  words are reduced to the same stem. In our

        The program, supplied to us by Bolt Beranek and New-
man, operates in two alternative modes, either selecting a single
most likely tag for each word (1,est-tag option, the one we use at
present), or supplying a short ranked list of alternatives (Meteer et
at., 1991).


                                                177

system we replaced a traditional morphological stein-
mer with a conservative dictionary-assisted suffix
trimmer. 9 The suffix trimmer performs essentially
two tasks: (1) it reduces inflected word forms to their
root forms as specified in the dictionary, and (2) it
converts nominalized verb forms (e.g., "implementa-
tion", "storage") to the root forms of corresponding
verbs (i.e., "implement", "store"). This is accom-
plished   by removing     a   standard suffix, e.g.,
"stor+age", replacing it with a standard root ending
("+e"), and checking the newly created word against
the dictionary, i.e., we check whether the new root
("store") is indeed a legal word, and whether the ori-
ginal root ("storage") is defined using the new root
("store") or one of its standard inflecfional forms
(e.g.,  "storing").  For  exatnple, the following
definifions are excerpted from  the O~)rd Advanced
Learner's Dictionary (OALD):
      storage n [U] (space used for, money paid for)
      the storing of goods ...
      diversion n [U] diverting...
      procession n [C] number of persons, vehicles,
      etc moving forward and following each other in
      an orderly way.

Therefore, we can reduce "diversion" to "divert" by
removing the suffix "+sion" and adding root form
suffix "+t". On the other hand, "process+ion" is not
                      10
reduced to "process
        Earlier experiments with CACM-3204 collec-
tion showed an improvement in retrieval precision by
6% to 8% over the base system   equipped with a stan-
dard morphological stemmer (the SMART stemmer).
Due to time limitations these numbers are not avail-
able for TFIEC database at this time.


HEAD-MODIFIER STRUCTURES
        Syntactic phrases extracted from TTP parse
trees are head-modifier pairs. The head in such a pair
is a central element of a phrase (main verb, main
noun, etc.), while the modifier is one of the adjunct
ㄅrguments of the head.   In the TREC experiments
reported here we extracted head-modifier word and
fixed-phrase pairs only. While TREC WSJ database
is large enough to warrant generation of ktrger coin-
pounds, we were in no posiflon to verify their effec-
tiveness in indexing (largely because of the tight
schedule). We discuss some options below.


      9 Dealing with prefixes is a more complicated matter, since
they may have quite strong effect upon the meaning of the result-
ing term, e.g., Un- usually introduces explicit negation.
      `~ Definition checking is not implemented yet.

     Let us consider a specific example from WSJ
database:
    The former Soviet president has heen a local
    hero ever since a Russian tank invaded Wiscon-
    sin.

The tagged sentence is given below, followed by the
regularized parse structure generated by TTP, given
in Figure 1.
    The/di formei;~~j Soviet/ji president/nn has/vbz
    been/vbn a/dt locaiji hero/nn ever/rb since/in
    a/dt    Russian/ji   tank/im    invaded/vbd
    Wisconsin/np Iper

     It should be noted that the parser's output is a
predicate-argument structure centered around main
elements of various phrases. In Figure 1, BE is the
main predicate (modified by HAVE) with 2 argu-
ments (subject, object) and 2 adjuncts (adv, sub_ord).
INVADE is the predicate in the subordinate clause
with 2 arguments (subject, object). The subject of
BE is a noun phrase with PRESDENT as the head
element, two modifiers (FOR~~R, SOVIET) and a
determiner (THE). From this structure, we extract
head-modifier pairs that become candidates for com-
pound terms. The following types of pairs are con-
sidered: (1) a head noun and its left adjecfive or noun
adjunct, (2) a head noun and the head of its right

[assert
 [[pert [HAVE]l
  [(verb [BEll
    [subject
     [np
        [n PRESIDENT]
        (t~pos ThE]
        [adj [FORMERII
        ladi [SOVIET] Ill
    [object
     [np
        [n HEROI
        [t~pos Al
        [adj [LOCALIIII
    [adv EVER]
    [sub_ord
     [SINCE
        [[verb [INVADE] I
         (subject
           [np
            [n TANK]
            [t~os Al
            [adj [RUSSIAN]]]]
         (object
           [np
            (name [WISCONSIN]]]]]]j]]]
         Figure 1. Predicate-argument parse structure.


                                                   178

adjunct, (3) the main verb of a clause and the head of
its object phrase, and (4) the head of the subject
phrase and the main verb.   These types of pairs
account for most of the syntactic variants for relafing
two words (or simple phrases) into pairs carrying
compatible semantic content. For example, the pair
retrieve+information will be extracted from any of
the following fragments: information retrieval sys-
tem, retrieval of information from databases: and
information  that can be retrieved by   a user-
controlled interactive search process. In the example
at  hand,  the following head-modifier pairs are
extracted (pairs containing low-contents elements,
such as BE and FORMER, or names, such as
WISCONSIN, will be later discarded):
     [PRESIDENT,BE]
     [PPESIDENT,FORMER]
     [PRESIDENT,SOVIET]
     [BE,HERO]
     [HERO,LOCAL]
     [TANK,INVADEJ
     [TANK~USSIAN]
     [INVADE,WISCONSIN]

We may note that the three-word phrase former
Soviet president has been broken into two pairs
former president and Soviet president, both of which
denote things that are potenfially quite different from
what the original phrase refers to, and this fact may
have potentially negafive effect on retrieval preci-
sion. This is one place where a longer phrase appears
more appropriate. An further example is shown in
Figure 2.11 One difficulty in obtaining head-modifier
pairs of highest accuracy is the notorious ambiguity
of nominal compounds. For example, the phrase
natural  language   processing should   generate
language+natural and processing+language, while
dynamic information processing is expected to yield
processing +dynamic  and processing +information.
Since our parser has no knowledge about the text
domain, and uses no semantic preferences, it does not
attempt to guess any internal associations within such
phrases. Instead, this task is passed to the pair extrac-
tor module which processes ambiguous parse struc-
tures in two phases. In phase one, all and only unam-
biguous head-modifier pairs are extracted, and the
frequencies of their occurrences are recorded. In
phase two, frequency information about pairs gen-
erated in the first pass is used to form associations
from   ambiguous   structures. For example,  if

     "Note that working with the parsed text ensures a degree of
predsion in capturing the meaningful phrases, which is especially
evident when compared with the results usually obtained from ei-
ther unprocessed or only partially processed text (Lewis and Croft,
1990). Note also that names, pronouns and dummy verbs are not
allowed to create pairs.

SENThNclll:

()orbachev ordered the tank shipped to a plant in the Ukraine,
where seven mechanics worked for three months restoring it.


PARSE STRUCTURE:

[assert
 [[verb [ORDERII
  [subject
    [np
     [name [GORBACHEVIJII
  [object
    [[verb []I
     [subject
       [np
        [n TANKI
        [t~os ThEI
        [m~wh
         [[verb [SHIPII
          [subject ANYONEI
          [object VARI
          [TO
            [np
             in PLANTI
             [t~s AJII
          [IN
            [np
             [name [UKRAINEII
             [t~s ThEI
             [m_wh
              [[verb [WORKJI
                [subject
                 [np
                  [n MECHANICI
                  [count [SEVENIII
                [FOR
                 [np
                  [n MONTh]
                  [count [THREEIIII
                [sa~wh
                 [[verb ~ESTO~l
                  [subject PROI
                  [object
                    [np

significantly fewer times or perhaps none at all, then
we will prefer the former association as valid.
    Although the noun phrase disambiguation rou-
tine has been implemented to work with the pair
extractor program, it has not been used in the current
installment of ThEC. A more conservative version of
pair extractor was used instead (it generates fewer
pairs) since that version was found effective in test
runs. 12


TERM CORRELATIONS FROM TEXT
    Head-modifier pairs form compound terms
used in database indexing. They also serve as
occurrence contexts for smaller terms, including
single-word terms. In order to determine whether
such pairs signify any important association between
terms, we calculate the value of the Informational
Contribution (IC) function for each element in a pair.
This is important because not every syntactic associa-
tion necessarily translates into a semantic one, and
we may also need to eliminate spurious pairs gen-
erated by the parser. Higher values of IC indicate
stronger association, and  moreover the element
which has the larger value is considered semantically
dominant. These values are context dependent, and
will vary from one corpus to another. 13
    The likelihood of a given word being paired
with another word, within one predicate-argument
structure, can be expressed in statistical terms as a
conditional probability. In our present approach, the
required measure had to be uniform for all word
occurrences, covering different types of predicate-
argument links, i.e., verb-object, noun-adjunct, etc.
This is reflected by an additional dispersion parame-
ter, introduced to evaluate the heterogeneity of word
associations. The resulting new formula IC (x, [x,y])
is based on (an estimate of) the conditional probabil-
ity of seeing a word y as a modifier of the word x,
normalized with a dispersion parameter for x.

   IC(x,[x,y))=     f:,y
                 `zX + d~ - 1

EXTRAC~D PAIRS:

MECHANIC WORK
SHIP TANK
TANK SHIP

          Figure 2. Extraction of syntactic pairs.


language+natural has Occurred unambiguously a
number times in contexts such as parser for natural
language, while processing+natural has occurred


                                                 179

where ~ is the frequency of [x,y] in the corpus, n~
is the number of pairs in which x occurs at the same
position as in [x,yJ, and d(x) is the dispersion pararn-
eter understood as the number of distinct words with
    12 In a few test runs with TREC topics 001 to 005 against the
training database (disk 1), we observed that the inclusion of corn-
pound terms obtained from head-modifier pairs increased both re-
call and precision quite sugnificantly. In particular, for topic 003 no
relevant documents could be found without compound terms.
    13 For more details please refer to (Strzalkowski and Vau-
they, 1992).

which x is paired. When JC(x, [x,y]) = 0, x and y
never    occur   together (i.e.,   ~ = 0);    when
IC(x, [x,yJ) = 1, x occurs only with y (i.e., ~ =
and d~= 1).
       So defined, IC function is asymmetric, a pro-
perty found desirable by Wuks et al. (1990) in their
study of word co-occurrences in the Longman dic-
tionary. In addition, IC is quite stable even for rela-
tively low frequency words (dispersion parameter
helps here), and in this respect it compares favour-
ably to Fano's mutual information formula. The lack
of stability on low frequency terms is particularly
worrisome for IR applications since many important
indexing terms could be eliminated from considera-
tion.
       It should also be pointed out that the parficular
way of generating syntactic pairs was dictated, to
some degree at least, by statistical considerations.
Our original experiments with IC formula were per-
formed on the relafively small CACM-3204 collec-
tion, and therefore we combined pairs obtained from
different  syntactic  relations  (e.g.,  verb-object,
subject-verb, noun-adjunct, etc.) in order to increase
frequencies   of some   associations.  This became
largely unnecess~uy in a large collection such as TIP-
STER, but we had no means to test alternative
options, and thus decided to stay with the original. It
should not be difficult to see that this was a
compromise solution, since many important distinc-
tions were potentially lost, and strong associations
could be produced where there weren't any. A way to
improve things is to consider different syntacfic rela-
tions independently, perhaps as independent sources
of evidence that could lend support (or not) to certain
term similarity predictions. We have already started
testing  this  option.  A few     examples   of IC
coefficients obtained from CACM-3204 corpus are
listed in Table 1.
       IC values for terms become the basis for calcu-
lating term-to-term similarity coefficients. If two
terms tend to be modified with a number of common
modifiers and otherwise appear in few distinct con-
texts, we assign them a similarity coefficient, a real
number between 0 and 1. The similarity is deter-
mined by comparing distribution characteristics for
both terms within the corpus: how much information
contents do they c~trry, do their information contribu-
tion over contexts vary greatly, are the common con-
texts in which these terms occur specific enough? In
general we will credit high-contents terms appearing
in identical contexts, especially if these contexts are
not    too commonplace.14 The     relative similarity

       `~ It would not he appropriate to predict similarity between
language and logarithm on the basis of their co-occurrence with
natural.


                                                   180

      word        head+modifier      IC coeff,
  distribute     distribute+normal    0.040
  normal         distribute+normal    0.115
  minimum        minimum+relative     0.200
  relative       minimum+relative     0.016
  retrieve       retrieve+inform      0.086
  inform         retrieve+inform      0.004
  size           size +medium         0.009
  medium         size +medium         0.250
  editor         editor+text          0.142
  text           editor+text          0.025
  system         system+parallel      0.001
  parallel       system+parallel      0.014
  read           read+character       0.023
  character      read+character       0.007
  implicate      implicate+legal      0.035
  legal          implicate+legal      0.083
  system         system+distribute    0.002
  distribute     system+distribute    0.037
  make           make +recommend      0.024
  recommend      make +recommend      0.142
  infer          infer+deductive      0.095
  deductive      infer+deductive      0.142
  share          share +resource      0.054
  resource       share +resource      0.042


Table 1. Examples of IC coefficients.

between two words x1 and x2 is obtained using the
following formula (0: is a large constant): 15
SIM(x1 ,x2) = log (0: ~ sim~(xi ,x2))


where

sim~(x1,x2)=MJN(IC(x1,[x1,yJ),!C(x2,[x2,yJ))
             * MIN (IC ty, [x1,y]),IC("', [x2,yJ))

The similarity function is further normalized with
respect to SIM(x1 ,x1).
      In addition, we require that words x1 and x2
appear in at least two distinct common contexts,
where a common context is a couple of pairs [x1 ,y]
and [x2,y], or [y~~] and [yx2] such that they each
occurred at least twice. Thus, banana and baltic will

      `~ This is inspired by a formula used by Hindle (1990), and
subsequently modified to take into account the asymmetry of IC
measure.

not be considered for similarity relation on the basis
of their occurrences in the common context of repub-
lic, no matter how frequent, Unless there is another
such common context comparably frequent (there
wasn't any in TREC WSJ database). 16
    It may be worth pointing out that the similari-
ties are calculated using term co-occurrences in syn-
tactic rather than in document-size contexts, the latter
being the usual pracfice in non-linguistic clustering
(e.g., Sparck Jones and Barber, 1971; Crouch, 1988;
Lewis and Croft, 1990). Although the two methods of
term clustering may be considered mutually comple-
mentary in certain situafions, we believe that more
and stronger associations can be obtained through
syntactic-context clustering, given sufficient amount
of data and a reasonably accurate syntactic parser. 17


QUERY EXPANSION
    Similarity relations are used to expand user
queries with new terms, in an attempt to make the
final search   query more comprehensive      (adding
synonyms) and/or more pointed (adding specializa-
tions).'8 It follows that not all similarity relafions will
be equally useful in query expansion, for inst�'tnce,
complementary and antonymous relations like the
one between australian and canadian, or accept and
reject may actually harm system's performance,
since we may end up retrieving many irrelevant
documents. Similarly, the effectiveness of a query
containing vitamin is likely to diminish if we add a
similar but far more general term such as acid. On
the other hand, database search is likely to miss
relevant documents if we overlook the fact that for-
tran is a programming language, or that infant is a
baby and baby is a child. We noted that an average
set of similarities generated from a text corpus con-
tains about as many "good" relations (synonymy,
specialization)  as     "bad"  relations  (antonymy,

    16 However, SJM(banana,dominican) was generated since
two independent contexts were indeed found: republic and plant,
even though word plant apparently occurs in different senses in ba-
nana plant and dominican plant.
    17 Non-syntactic contexts cross sentence boundaries with no
fuss, which is helpful with short, succinct documents (such as
CACM abstracts), but less so with longer texts; see also (Grishman
etal., 1986).
    I' Query expansion (in the sense considered here, though not
quite in the same way) has been used in information retrieval
research before (e.g., Sparck Jones and Tait, 1984; Harman, 1988),
usually with mixed results. An altemative is to use term clusters to
create new terms, "metaterms", and use them to index the database
instead (e.g., Crouch, 1988; Lewis and Croft, 1990). We found that
the query expansion approach gives the system more flexibility, for
instance, by making room for hypertext-style topic exploration via
user feedback.


                                                  181

complementation, generalization), as seen from the
query expansion viewpoint. Therefore any attempt to
separate these two classes and to increase the propor-
tion of "good" relations should result in improved
retrieval. This has indeed been confirmed in our ear-
her experiments where a relatively crude filter has
visibly increased retrieval precision.
      In order to create an appropriate filter, we
expanded the IC function into a global specificity
measure called the cumulative informational contri-
bution function (ICW). ICW is calculated for each
term across all contexts in which it occurs. The gen-
eral  philosophy here  is  that  a   more specific
word/phrase would have a more limited its use, i.e., a
more specific term would appear in fewer distinct
contexts. In this respect, ICW is similar to the stan-
dard inverted document frequency (idf) measure
except that term frequency is measured over syntactic
units rather than document size units.19 Terms with
higher ICW values are generally considered more
specific, but the specificity comparison is only mean-
ingful for terms which are akeady known to be suni-
lar. The new function is calculated according to the
following formula:
          F JCL(W) * JCR(W) if both exist

JCW(w) =~JCR(W)             if only ICR (w) exists

          Lo                otherwise
where (with n~, d~ > 0):

ICL(W) = IC ([w,~]) = d~(n~+d~- 1)

ICR (w) =IC([_,wJ) = d~(n~+d~-l)

For any two terms w1 and w2, ~`lnd constants ~ > 1,
~2 > 1, the following situations were considered. If
JCW(w2) >- 6~ * ICW(w1) then w2      is considered
more specific than w1. If ICW(w2) < 62 * ICW(w1)
and ICW(w2)> ICW(w1) then w2 is considered
synonymous     with    w 1 20   In   addition,     if
SIM~rm(W 1,w2) = 0> 0, where 0 is an empirically
established threshold, then w2 can be added to the
query containing term w1   with weight o.21 The

      19 We believe that measuring term specificity over
document-size contexts (e.g., Sparck Jones, 1972) may not be ap-
propriate in this case. In particular, syntax-based contexts allow for
processing texts without any intemal document structure.
      20 In TkBC runs we used 8 = 10 and 82 = 3.
      21 For CACM-3204 collection the filter was most effective at
 = 0.57. For TREC we changed the similarity formula slightly in
order to obtain correct normalizations in all cases~ This however
lowered similarity coeffidents in general and a new threshold had
to be selected. We used tt = 0.1 in TREC runs.

cutoffs could be different for synonymy and speciali-
zation.  For example, the following were obtained
from TRBC WSJ training database:


with

    ICW (child)          - 0.000001
    ICW (baby)           - 0.000013
    JCW (infant)         - 0.000055


    SJM (child, infant)  - 0.131381
    SIM (baby, child)    - 0.183064
    SJM (baby, infant)   -0.323121
Therefore both baby and infant can be used to spe-
cialize child (with 6t = 10), while baby and infant
can be considered synonyms (with ~ = 5). Note that
if ~ is well chosen, then the above filter will also help
to reject antonymous and complementary relations,
such      as  SIMnorm(back,front)=0.305287    with
ICW (back )=0.000001    and  ICW ("rnntY-0.000006.
We continue working to develop more effective
filters.  Bxwnples of   filtered similarity relafions
obtained from TREC corpus are listed in Tables 2
and 3.


SUMMARY OF RESULTS
        We have processed the total of 500 MBytes of
~trticles from Wall Street Journal section of ThEC
database. Runs were made independently on training
corpus (250 MBytes) and the test corpus (another 250
Mbytes). Natural language processing of each half of
the corpus, including tagging, stemming, parsing and
pair extraction required approximately 2 weeks of
nearly uninterrupted processing on two Sparc works-
tations (we use an assortment of SparcStation 1, BLC
and 2, with MWS rates varying from 13 to 21 to
28.5). 22 Computing term similarities from the first
250 Mbytes of text took up another week. This time
limitations  in our   computing   resources, mostly
memory, were the main reason. With a sufficient
amount of RAM (on the order of 256 MBytes or
more) we estimated that this process should take no
more than 20 hours. It should be noted that we com-
puted term similarities ktsed only on the training
corpus, but we used them in retrieval freim either
database.   We assumed that the underlying set of
concepts and describing them terms does not vary
greatly between the two databases (the first covered
Wall Street Journal from 1987 to 1989, the second
from 1990 to 1992).

        22 Processing of the training corpus took significantly longer
in practice (real time) since we had only one Sparcstation available
at ~y given time, and also due too errors made and resulting froin
them need for re-processing soitie parts.


                                                   182

    Subsequent indexing process performed by the
NIST IR system required additional 2 weeks for each
half of the database, on a single SparcStation 2.23 In
tot�'tl, we created three inverted file indexes: two were
derived from the training corpus, and one from the
test corpus. The first index created from training
corpus did not include compound terms; the remain-
mg two indexes included them. No cumulative index
was produced, since we estimated it would require
more than 6 weeks to build.  Some interesting
discrepancies were observed. While the postings file
generated from the training corpus was 25% larger
that the one generated from the test corpus, the
corresponding dictionary of terms was about 25%
smaller in the training corpus. This seems to suggest
that in 1990-92 volumes of Wall Street Journal (test
corpus) a greater number of unique terms were used,
while these terms were on average occurring more
sparsely in the text. This fact may partially under-
mine an earlier assumption that term similarities gen-
erated from training corpus were adequate for test
corpus as well.
    During the training stage of TREC we per-
formed only very limited tests, mostly because the
base system retrieval was quite slow on a database of
this size (approx. 173 MBytes, 99+K documents). 24
We run experiments with topics 001 to 005, using
relevance judgements supplied with the training data-
base. The purpose of these tests was to (1) observe
any improvement in performance resulting from the
use of compound phrase terms, (2) tune the term
similarity filter by adjusting certain threshold values
in an attempt to obtain the most effective set of simi-
larities, and (3) note if any manual intervention into
the queries could improve retrieval results. As may
be expected, none of these experiments were con-
clusive, but this was all we could rely upon in the
short time given. First of all, we noticed that system
overall performance (cumulative recall and precision
levels) were what we could have expected from runs
on smaller benchmark collections such as CACM-
3204. This tnay be partially explained by the diver-
sity of the subject tnatter covered in the TREC data,
and also by the character of the queries (i.e., requests
to extract information), but some problem with the
base search engine may also be to blame. 25 We were

    23 Indexing process could not l~ done in parallel because
NIST system requires serial processing. Moreover, as indexing
progressed, and the partial database grows in size, the process
slowed down significantly from approx. 6 mm/file to 37 mm/file.
    24 The system needed approx. 60 minutes to process a quety.

      In the middle of the training run we discovered that the
NIST system has been designed to handle databases up to 65K do-
cuments (21b) Subsequently we rewrote portions of the code to fix
this, but we were not certain if other decisions made by the re-
tneval module (e.g., bitmapping, idf cutoffs) were not in fact coun-

     wordi            word2        SIMnorm
 abm
 absence
 accept
 accord
 acquire
 speech
 adjustable
 maxsaver
 affair
 affordable
 disease
 medium+range
 aircraft
 aircraft
 airline
 alien
 anniversary
 anti+age
 anti+clot
 contra
 candidate
 contend
 property
 attempt
 await
 stealth
 child
 baggage
 ban
 bearish
 bee
 roller+coast
 two +income
 television
 soldier
 treasury
 research
 withdrawal


Table 2. Filtered
more specific term).

*anti+ballistic
*maternity
acquire
pact
purchase
address
one +year
*advance +purchase
scandal
low+income
*ailment
*air+to+air
*jetliner
plane
carrier
immigrate
*bicentennial
anti+wrinkle
cholesterol+lower
*anti+sandinista
*aspirant
*aspirant
asset
bid
pend
*b+1
*baby
luggage
restrict
bullish
*honeybee
*bumpy
two+earner
tv
troop
*short+term
study
*pullout

0.534894
0.233082
0.179078
0A92332
0A49362
0.263789
0.824053
0.734008
0.684877
0.181795
0.247382
0.874508
0.166777
0.423831
0.345490
0.270412
0.588210
0.153918
0.856712
0.294677
0.116025
0.143459
0.285299
0.641592
0.572960
0.877582
0.183064
0.607333
0.321943
0.847103
0.461023
0.898278
0.293104
0.806018
0.374410
0.661133
0.209257
0.622558

word similarifies (* indicates the

terproductive with  larger databases. Later experiment have
confirmed this as we obtained new code from NIST.


                                                            183

     wordi             word2         SIMnorm
   lord           abuser             0.261507
   break          accelerator        0.106775
   accept         reject             0.275802
   accord         level              0.152023
   cent           accrue             0.259478
   acquire        deficient          0.615525
   fix+rate       adjustable         0.930180
   advertise      environmental      0.124026
   petroleum      aerospace          0.207406
   afghan         nicaraguan         0A21478
   banana         dominican          0.444193
   begin          month              0.346558
   german         british            0.112465
   superpower     reagan+ gorbachev  0.400145
   republic       sea                0.227662
   sunglasses     sneakers           0.126749


Table 3. Some (apparently?) "bad" similarities gen-
erated.

also quite disappointed with the effect that the query
expansion had (or did not have) on either precision or
recall figures.  This was somewhat more directly
related to the wide domain scope of WSJ articles, and
we noticed many obvious word sense mixups leading
to unwanted term associations.  Also, the rather
straightforward method of query expansion terms
proved less adequate here, and we are considering
new methods as outlined briefly in the introduction to
this paper.  Our final test with manually altered
queries (terms were deleted or added after consulting
the original topic) brought only negative results: any
kind of intervention appeared only to further decrease
both recall and precision.
     Final runs were performed as follows:
(1)  Topics 1-25 were processed using only direct
     fields  (<title>, <desc>, and  <narr>) with
     respect to the training database (disk 1). They
     were subsequently run in the routing mode
     against the test database (disk 2). 26 Unfor-
     tunately, the routing results were hindered by a
     bug in our program that removed duplicate
     26 While other fields provided much useful information
about various aspects of a given topic (including definitions of spe-
cialized terms), we thought that their inclusion in the final query
would make it difficult to assess the advantages of linguistic pro-
cessing. This especially applies to Concepts fiels <con> which list-
ed hand-extracted keywords. Subsequenfly, we realized that more
often this field in fact provided further clues not found in other sec-
tions of a topic.

     terms from queries while preparing them for a
     routing run.
(2)  Topics 51-75 were processed using the same 3
     fields as above with respect to the test database
     only (disk 2). They were then run in ad-hoc
     mode independently against the training data-
     base and the test database, and the top 200
     documents from each run (for each query)
     were merged using document scores.    The
     upper 200 documents were selected for the
     final result. It should be noted that we run
     topics against disk 1 using idf weights on terms
     from disk 2 (a roufing-like mode). This, it
     turned out, created an even distribufion of hits
     between   the two databases. We   call this
     method the uniform merge.
(3)  We repeated run (2), but this time we run
     topics 51-75 in ad-hoc mode on both disks (i.e.,
     different weights were used in these runs).
     Again, top 200 documents were selected. We
     call this method the bruteft)rce merge.
(4)  Finally,  we repeated uniform merge   with
     queries   manually  pruned   before  actual
     retrieval.
     Summary statistics for these runs are shown in
~tbles 4 and 5. It has to be pointed out that these
results are lower than expected because of various
problems with the database search prograin and
because the Concepts field in TREC topics was not
included in search queries. Subsequent runs with
Concepts field included in search produced results
shown in Table 6.


ACKNOWLEDGEMENTS
     We would like to thank Donna Harinan of
NIST for making her IR system available to us. We
would also like to thank Ralph Weischedel, Marie
Meteer and Heidi Fox of BBN for providing and
assisfing in the use of the part of speech tagger. Jose
Perez Carballo has contributed a number of valuable
observations during the course of this work, and his
assistance in processing the TREC data was critical
for our being able to finish on time. This paper is
based  upon    work supported  by  the  Defense
Advanced Research Project Agency under Contract
N00014-90-J-1851   from  the   Office of  Naval
Research, under Contract N00600-88-D-3717 from
PRC Inc., and the National Science Foundation under
Grant IRI-89-02304. We also acknowledge support
from Canadian Institute for Robotics and Intelligent
Systems (IRIS).


                                               184

      Run         nyuirl      nyuir2
      Name        routing     routing
      Queries       25         25
      Tot number of does over all queries
      Ret         5000        5000
      Rel         3766        3766
      RelRet        834    ~  1177
       Recall     Precision Averages
        0.00      0.5691      0.6983
        0.10      0.3618      0.5072
        0.20      0.1846      0.3508
        0.30      0.1314      0.2997
        0.40      0.1065      0.2286
        0.50      0.0481      0.1481
        0.60      0.0295      0.0604
        0.70      0.0104      0.0296
        0.80      0.0104      0.0260
        0.90      0.0000      0.0125
        1.00      0.0000      0.0000
        Average precision for all points
      11-pt       0.1320      0.2147
      Average precision at 0.20, 0.50, 0.80
      3-pt        0.0811      0.1750
                  Recall at
      5 docs      0.0338      0.0374
      l5docs      0.0671      0.0961
      30docs      0.0979      0.1533
      lOodocs     0.2117      0.3119
      200docs     0.2988      0.4298
                 Precision at
      Sdocs       0.3360      0.4480
      15 docs     0.2960      0.4587
      30 docs     0.2787      0.3973
      lOodocs     0.2276      0.3080
      200docs     0.1668      0.2354


Table 4. Routing run statistics: (1) duplicate terms
(unintenfionally) removed from queries; and (2) du-
plicate terms retained and Concepts fields included.

    Run        nyuirl ~ nyuir2 } nyuir3
    Name       brute__~ prune  j_uniform
    Queries      25   ~  25    J   25
      Tot number of docs over all queries
    Ret        4986    4990       4989
    Rel        3318    3318       3318
    ReiRet     1161      958      1039
     Recall        Precision_Averages
      0.00     0.7830  0.7168    0.7368
      0.10     0.6216  0.4858    0.5592
      0.20     0.4891  0.3722    0.4660
      0.30     0.3815  0.2032    0.3235
      OAO      0.2350  0.1382    0.2099
      0.50     0.1871  0.0599    0.1385
      0.60     0.1447  0.0419    0.0801
      0.70     0.0705  0.0058    0.0560
      0.80     0.0229  0.0030    0.0056
      0.90     0.0000  0.0000    0.0000
      1.00     0.0000  0.0000    0.0000
          Average precision for all points
    11-pt      0.2669  0.1843    0.2341
      Average precision at 0.20,0.50,0.80
    3-pt       0.2330  0.1450    0.2034
    _________     Recall at
    5 docs     0.0713  0.0444    0.0571
    l5docs     0.1326  0.0929    0.1297
    30 docs    0.1868  0.1424    0.1734
    lOodocs    0.3350  0.2550    0.2918
    200 docs   0.4294  0.3466    0.3908
                 Precision at
    S docs     0.5680  0.5200    0.5360
    15 docs    0.5200  0A267     0.4907
    30 does    0.4320  0.3627    0.4147
    100 docs   0.3140  0.2684    0.2916
    200docs    0.2322  0.1916    0.2078

Table 5. Ad-hoc run statistics with Automatic Brute
Force Merge, Uniform Merge with hand pruning, and
Automatic Uniform Merge.


                                            185

   Run         nyuirl {nyui14~       nyuirs
   Name        brute      concepts  negations
   Queries      25      j251           25
        Tot number of  docs over all queries
   Ret         4986        4984       4984
   Rel         3318        3318       3318
   RelRet      1161        1291       1309
   Recall              Precision Averages
      0.00     0.7830      0.7685    0.7823
      0.10     0.6216      0.6521    0.6625
      0.20     0.4891      0.5396    0.5460
      0.30     0.3815      0.4306    0.4419
      OAO      0.2350      0.2671    0.2755
      0.50     0.1871      0.2094    0.2085
      0.60     0.1447      0.1457    0.1457
      0.70     0.0705      0.0660    0.0660
      0.80     0.0229      0.0385    0.0385
      0.90     0.0000      0.0000    0.0000
      1.00     0.0000      0.0000    0.0000
          Average precision for all points
   11-pt     [~ 0.2669     0.2834    0.2879
        Average precision at 0.20, 0.50,0.80
   3-pt      ~ 0.2330      0.2625    0.2643
                   Recall at
   S docs      0.0713      0.0738    0.0741
   l5docs      0.1326      0.1362    0.1386
   30docs      0.1868      0.2007    0.2011
   100 does    0.3350      0.3513    0.3565
   200 does    0.4294      0.4739    0.4828
                  Precision at
   5 docs      0.5680      0.6080    0.6080
   l5docs      0.5200      0.5360    0.5493
   30 docs     0.4320      0.4760    0.4773
   100 does    0.3140      0.3432    0.3484
   200 does    0.2322      0.2582  ~__0.2618

Table 6. Ad-hoc run statistics using Automatic Brute
Force Merge: without Concepts field, with Concepts
field, and with Concepts excluding negated terms.

REFERENCES
Church, Kenneth Ward and Hanks, Patrick. 1990.
       "Word associafion norms, mutual informa-
       tion, and lexicography."       Computational
       Linguistics, 16(1), MIT Press, pp.22-29.
Crouch, Carolyn J. 1988. "A cluster-based approach
       to thesaurus construcfion." Proceedings of
       ACM SIGIR-88, pp.309-320.
Grishman, Ralph, Lynette Hirschman, and Ngo T.
       Nhan. 1986. "Discovery procedures for sub-
       language selectional patterns: initial experi-
       ments". Computational Linguistics, 12(3), pp.
       205-215.
Grishman, Ralph and Tomek Strzalkowski. 1991.
       "Information Retrieval and Natural Language
       Processing." Position paper at the workshop
       on Future Directions in Natural Language Pro-
       cessing in Informafion Retrieval, Chicago.
Harman, Donna. 1988. "Towards interacfive query
       expansion." Proceedings of ACM SIGIR-88,
       pp.321-331.
Harman,   Donna   and  Gerald    Candela.    1989.
       "Retrieving Records from a Gigabyte of text
       on a Minicomputer Using Statistical Rank-
       mg."  Journal of the American Society for
       Information Science, 41(8), pp.581-589.
Hindle, Donald. 1990.  "Noun classificafion from
       predicate-argument structures."   Proc. 28
       Meeting of the ACL, Pittsburgh, PA, pp.268-
       275.
Lewis, David D. and W. Bruce Croft. 1990. "Term
       Clustering of Syntactic Phrases". Proceedings
       of ACM SIGIR-90, pp.385-405.
Mauldin, Michael. 1991. "Retrieval Performance in
       Ferret: A Conceptual Information Retrieval
       System." Proceedings of ACM SIGIR-91, pp.
       347-355.
Meteer,   Marie, Richard  Schwartz,     and  Ralph
       Weischedel. 1991. "Studies in Part of Speech
       Labelling." Proceedings of the 4th DAJJPA
       Speech and Natural  Language      Workshop,
       Morgan-Kaufman, San Mateo, CA. pp.331-
       336.
Sager, Naomi. 1981. Natural Language Information
       Processing. Addison-Wesley
Sparck Jones, Karen. 1972. "Stafistical interpreta-
       tion of term specificity and its application in
       retrieval." Journal of Documentation, 28(1),
       pp.11-20.
Sparck Jones, K. and B. 0. Barber. 1971.    "What
       makes automatic keyword classificafion effec-
       tive?" Journal of the American Society fi)r
       Information Science, May-June, pp.166-175.
Sparck Jones, K. and J. I. Tait. 1984.  "Automatic
       search term variant generation." Journal of
       Documentation, 40(1), pp.50-66.


                                                 186

Strzalkowski, Tomek and Barbara Vauthey. 1991.
      "Fast         Text  Processing     for    Information
      Retrieval."       Proceedings of the 4th DARPA
      Speech        and Natural    Language     Workshop,
      Morgan-Kaufman, pp.346-351.
Strzalkowski, Tomek and Barbara Vauthey. 1992.
      "Information Retrieval Using Robust Natural
      Language Processing." Proc. of the 30th ACL
      Meefing, Newark, DE, June-July. pp.104-111.
Strzalkowski, Tomek.        1992.   "TTP:     A Fast and
      Robust        Parser  for    Natural      Language."
      Proceedings of the 14th International Confer-
      ence     on Computafional Linguistics         (COL-
      ING), Nantes, France, July 1992. pp.198-204.
Wilks, Yorick A., Dan Fass, Cheng-Ming Guo,
      James B. McDonald, Tony Plate, and Brian M.
      Slator. 1990.       "Providing machine tractable
      dictionary tools." Machine Translation, 5, pp.
      99-154.


APPENDIX A: EXAMPLE QUERY
      We show TRBC topic 057 and a part of result-
mg search query with only top ranked terms showed
in Table A.1. Please note that we rank terms by their
idf scores even though their actual scores are idf *
weight. It is worth pointing out, however, that the
NIST system uses idf scores to decide if a term falls
below a preset "significance" threshold.          We only
show fields used for query generation.
<top>
<num> Number: 057
<title> Topic: MCI
<desc> Description:
Document will discuss how MCI has been doing since the Bell
System breakup.
~arr> Narrative:
A relevant document will discuss the financial health of MCI Com-
munications Corp. since the breakup of the Bell System (AT&T
and the seven regional Baby Bells) in January 1984. The status in-
dicated may not necessarily be a direct or indirect result of the
breakup of the system and ensuing regulation and deregulation of
Ma Bell or of the restrictions placed upon the seven Bells; it may
result from any number of factors, such as advances in telecom-
munications technology,   MCI  initiative, etc. MCI's financial
hcalth may be reported directly: a broad statement about its eam-
ings or cash flow, or a report containing financial data such as a
quarterly report; or it may be reflected by one or more of the fol-
lowing: credit ratings, share of customers, volume growth, cuts in
capital spending,   figure net loss, pre-tax charge, analysts' or
MCI's own forecast about how well they will be doing, or MCI's
response to price cuts that AT&T makes at its own initiative or
under orders   from  the  Federal Communications  Commission
(FCC), such as price reductions, layoffs of employees out of a per-
ceived need to cut costs, etc. Daily OTC trading stock market and
monthly short interest reports are NOT relevant; the inventory
must be longer te~, at least quarterly.
<Aop>

             teim              idf      weight

       deposit+financial    16.185341   0.192091
       result+ breakup      16.185341   1.000000
       restrict+action      16.185341   0.441260
       number+factor        16.185341   1.000000
       deposit+financial    16.185341   0.192091
       layoff+job           15.600377   0.199677
       report+health        15.185340   1.000000
       share+cut            15.185340   1.000000
       benefit+pnce         14.863412   0.263889
       share+annual         14.600377   0.249365
       document+ relevant   14.377985   1.000000
       growth+custom        14.185340   1.000000
       need+perceive        14.185340   1.000000
       report+short         14.185340   1.000000
       earnings+ cash       14.015415   1.000000
       report+ interest     13.725908   1.000000
       share+benejit        13.484900   0.263889
       restrict+place       13.185340   1.000000
       share+growth         13.097878   1.000000
       layoff+employee      13.015415   1.000000
       breakup+system       12.661778   1.000000
       bell+atarm           12.484900   0.303537
       number+adjust        12.377985   0.421292
       hoard+cash           12.278449   0.253572
       rebate+cash          12.015415   0.112116
       growth+volume        11.661778   1.000000
       flow+earnings        11.512915   1.000000
       pre+tas              11.430452   1.000000
       report+monthly       11.056057   1.000000
       flow+capital         10.827788   0.434692
       report+ quarterly    10.827788   1.000000
       bell+ baby           10.541484   1.000000
       reduce+price         10.314976   1.000000
       health+financial     10.005431   1.000000
       hoard                 9.956521   0~253572
       infusion +cash        9.717734   0.332548
       ensue                 9.505860   1.000000
       stock+trade           9.327359   1.000000
       boost+price           9.225338   0.280143
       breakup               9.135491   1.000000
       disposable            9.108524   0.216766
       rebate                8.918553   0.112116
       fcc                   8.810301   1.000000
       cut+price             8.763275   1.000000
       mci                   8.555984   1.000000


Table A.1. Partial search query for topic 057


                                                       187