RECENT DEVELOPMENTS IN NATURAL LANGUAGE TEXT RETRIEVAL Tomek Strzalkowski and Jose Perez Carballo Courant Institute ofMathematical Sciences New York University 715 Broadway, rm. 704 New York, NY 10003 tomek@cs.nyu.edu ABSTRACT This paper reports on some recent developments in our natural language text retrieval system. The system uses advanced natural language processing teclmiques to enhance the effectiveness of term-based document retrieval. The backbone of our system is a traditional sta- tistical engine which builds inverted index files from pre-processed documents, and then searches and ranlls the documents in response to user queries. Natural language processing is used to (1) preprocess the docu- ments in order to extract content-carrying terms, (2) dis- cover inter-term dependencies and build a conceptual hierarchy specific to the database domain, and (3) pro- cess user's natural language requests into effective search queries. For the present ThEC-2 effort, the total of 550 MBytes of Wall Street Journal articles (ad-hoc queries database) and 300 MBytes of San Jose Mercury articles (routing data) have been processed. In terms of text quantity this represents approximately 130 million words of Fnglish. Unlilte in IREC-1, we were able to create a single compound index for each database, and therefore avoid merging of results. While the general design of the system has not changed since IREC-1 conference. we nonetheless replaced several components and added a number of new features which are described in the present paper. INTRODUCTION A typical information retrieval (IR) task is to select documents from a database in response to a user's query, and rank these documents according to relevance. This has been usually accomplished using statistical methods (often coupled with manual encoding) that (a) select terms (words, phrases, and other units) from documents that are deemed to best represent their content, and (b) create an inverted index file (or files) that provide an easy access to documents containing these terms. A sub- sequent search process will attempt to match a prepro- cessed user query (or queries) against term-based representations of docurnents in each case determing a degree of relevance between the two which depends upon the number and types of matching terms. Although 123 many sophisticated search and matching methods are available, the crucial problem remains to be that of an adequate representation of content for both the docu- ments and the queries. The simplest word-based representations of con- tent are usually inadequate since single words are rarely specific enough for accurate discrirnation, and their grouping is often accidental. A better method is to iden- tify groups of words that create meaningful phrases, especially if these phrases denote iniportant concepts in database domain. For example, joint venture is an impor- tant term in the Wall Street Joumal (WSJ henceforth) database, while neither joint nor venture is important by itself. In the retrieval experiments with the training `IREC database, we noticed that both joint and venture were dropped from the list of terms by the system because their idf (inverfrd document frequency) weights were too low. In large databases, such as TIPSTER, the use of phrasal terms is not just desirable, it becomes necessary. An accurate syntactic analysis is an essential prere- quisite for selection of phrasal terms. Various statistical methods, e.g., based on word co-occuirences and mutual information, as well as partial parsing technlques, are prone to high error rates (sometimes as high as 50%), tuuing out many unwanted associations. Therefore a good, fast parser is necessary, but it is by no means sufficient. While syntactic phrases are often better indi- cators of content than `statistical phrases' -- where words are grouped solely on the basis of physical proximity (e.g., "college junior" is not the same as "junior college") -- the creation of compound terms makes term matching process more complex since in addition to the usual problems of synonymy and subsumption, one must deal with their structure (e.g., "college junior" is the same as jumor m college"). In order to deal with structure, the parser's output needs to be "normalized" or "regularized" so that complex terms with the same or closely related meanings would indeed receive matching representa- tions. This goal has been achieved to a certain extent in the present work. As it will be discussed in more detail below, indexing terms were selected from among head- modifier pairs extracted from predicate-argument representations of sentences. Introduction of compound terms also complicates the task of discovery of various semantic relationships among them, including synonymy and subsumption. For example, the term natural language can he considered, in certain domains at least, to subsume any term denoting a specific human language, such as English. Therefore, a query containing the former may be expected to retrieve documents containing the latter. The same can he said about language and English, unless language is in fact a part of the compound term programming language in which case the association language - Fortran is appropriate. This is a problem because (a) it is a standard practice to include both simple and compound terms in document representation, and (1,) term associations have thus far been computed primarily at word level (includ- ing fixed phrases) and therefore care must he taken when such associations are used in term matching. This may prove particularly troublesome for systems that attempt term clustering in order to create "meta-terms" to he used in document representation. The system presented here computes term associa- tions from text at word and fixed phrase level and then uses these associations in query expansion. A fairly primitive filter is employed to separate synonymy and subsumption relationships from others including anto- nymy and complementation, some of which are strongly domain-dependent. This process has led to an increased retrieval precision in experiments with both ad-hoc and routing queries for IREC-1 and ThEC-2 experiments. However, the actual improvement levels can vary sub- stantially hetween different databases, types of runs (ad- hoc vs. routing), as well as the degree of prior processing of the queries. We continue to study more advanced clustering methods along with the changes in interpreta- tion of resulting associations, as signaled in the previous paragraph. In the remainder of this paper we discuss particu- lars of the present system and some of the observations made while processing TREC-2 data. The above coin- ments will provide the background for situating our present effort and the state-of-the-art with respect to where we should he in the future. OVERALL DESIGN Our information retrieval system consists of a trad- itional statistical backbone (NIST's PRISE system; Har- man and Candela, 1989) auginented with various natural language processing components that assist the system in database processing (steiming, indexing, word and phrase clustering, selectional restrictions), and translate a user's information request into an effective query. This design is a careful compromise hetween purely statistical non-linguistic approaches and those requiring rather 124 accomplished (and expensive) semantic analysis of data, often referred to as `conceptual retrieval'. In our system the database text is first processed with a fast syntactic parser. Subsequendy certain types of phrases are extracted from the parse trees and used as compound indexing terms in addition to single-word terms. The extracted phrases are statistically analyzed as syntactic contexts in order to discover a variety of simi- larity links hetween smaller subphrases and words occur- ring in them. A further filtering process maps these simi- larity links onto semantic relations ~eneralization, spe- cialization, synonymy, etc.) after which they are used to transform a user's request into a search query. The user's natural language request is also parsed, and all indexing terms occurring in it are identilied. Cer- tain highly ambiguous, usually single-word terms may he dropped, provided that they also occur as elements in some compound terms. For example, "natural" is deleted from a query already containing "natural language" because "natural" occurs in many unrelated contexts: "natural numher", "natural logarithin", "natural approach", etc. At the same time, other terms may he added, namely those which are linked to some query term through admissible similarity relations. For exam- ple, "unlawful activity" is added to a query ~EC topic 055) containing the compound term "illegal activity via a synonymy link hetween "illegal" and "unlawful". One of the striking observations made during the course of ThEC-2 was to note that removing low-quality terms from the queries is at least as important (and often more so) as adding synonyms and specializations. In some instances (e.g., routing runs) low-quality terms had to he removed (or inhibited) before sin:ular terms could he added to the query or else the effect of query expan- sion was all but drowned out by the increased noise.' Mter the final query is constructed, the database search follows, and a ranked list of documents is returned. It should he noted that all the processing steps, those performed by the backbone system, and those per- formed by the natural language processing components, are fully automated, and no human intervention or manual encoding is required. FAST PARSING WITH TTP PARSER TIP Cragged Text Parser) is based on the Linguis- tic String Grammar developed by Sager (1981). The parser currendy encompasses some 400 grammar pro- ductions, but it is by no means complete. The parser's output is a regularized parse tree representation of each `We would like to thank Donna Hannan for tunling our attention to the importance of term weighting achemes, including term deletion. sentence, that is, a representation that reflects the sentence's logical predicate-argument structure. For example, logical subject and logical object are identified in both passive and active sentences, and noun phrases are organized around their head elements. The parser is equipped with a powerfui skip-and-fit recovery mechan- ism that allows it to operate effectively in the face of ill- formed input or under a severe time pressure. In the runs with approximately 130 million words of TREC's Wall Street Joimial and San Jose Meruury texts,2 the parser's speed averaged between 0.3 and 0.5 seconds per sen- tence, or up to 70 words per second, on a Sun's SparcS- tatio~. In addition, III' has been shown to produce parse structures which are no worse than those generated by full-scale linguistic parsers when compared to hand- coded Treebank parse trees. TIP is a full grammar parser, and initially, it attempts to generate a complete analysis for each sen- tence. However, unlike an ordinary parser, it has a built- in timer which regulates the amount of time allowed for parsing any one sentence. If a parse is not returned before the allotted time elapses, the parser enters the ski~and-fit mode in which it will try to "fit" the parse. While in the skip-and-fit mode, the parser will attempt to forcibly reduce incomplete constituents, possibly skip- ping portions of input in order to restart processing at a next unattempted constituenL In oth& words, the parser will favor reduction to backtracking while in the skip- and-fit mode. The result of this strategy is an approxi- mate parse, partially fitted using top-down predictions. The fragments skipped in the first pass are not thrown out, instead they are analyzed by a simple phrasal parser that looks for noun phrases and relative clauses and then attaches the recovered material to the main parse struc- ture. Full details of TIP parser have been described in the TREC-1 report (Strzalkowski, 1993a), as well as in other works (Strzalkowski, 1992; Strzalkowski & Scheyen, 1993). As may be expected, the skip-and-fit strategy will only be effective if the input skipping can be performed with a degree of determihism. This means that most of the lexical level ambiguity must be removed from the input text, prior to parsing. We achieve this using a sto- chastic parts of speech tagger to preprocess the text (see ~fl~EC-1 report for details). For TREC-2 a number of problems have been corrected in the tagger, including unproper tokenization of input and handling of abbrevia- tions. 2Approximately 0.85 GBytes of text, over 6 mrnion sentences. 125 WORD SUFFIX TRIMMER Word stemming has been an effective way of improving document recall since it reduces words to their common morphological root, thus allowing more suc- cessful matches. On the other hand, stemming tends to decrease retrieval precision, if care is not taken to prevent situations where otherwise unrelated words are reduced to the same stem. In our system we replaced a traditional morphological stemmer with a conservative dictionary-assisted suffix trhr~er. 3 The suffix trimmer performs essentially two tasks: (1) it reduces inflected word forms to their root forms as specified in the diction- ary, and (2) it converts nominalized v&b forms (e.g., "implementation", "storage") to the root forms of corresponding verbs (i.e., "implement", "store"). This is accomplished by removing a standard suffix, e.g., "stor+age", replacing it with a standard root ending ("+e"), and checking the newly created word against the dictionary, i.e., we check whether the new root ("store") is indeed a legal word. HEAD-MODIFIER STRUCTURES Syntactic phrases extracted from TIP parse trees are head-modifier pairs. The head in such a pair is a cen- tral element of a phrase (main verb. main noun, etc.), while the modifier is one of the adjunct arguments of the head. In the TREC experiments reported here we extracted head-modifier word and fixed-phrase pairs only. While TREC databases are large enough to warrant generation of larger compounds, we were m no position to verify their effectiveness in indexing. This was largely because of the tight schedule, but also because of rapidly escalating complexity of the indexing process: even with 2-word phrases. compound terms accounted for nearly 88% of all index entries, in other words, including 2- word phrases increased the index size approximately 8 times. Let us consider a specific example from the WSJ database: The former Soviet president has been a local hero ever since a Russian tank invaded Wisconsin. The tagged sentence is given below, followed by the reg- ularized parse structure generated by TJP, given in Fig- urel. The/dt forrner/jj Soviet/jj president/nn has/vbz been/vbn a/dt locai~~j hero/nn ever/rb since/in a/cit Russian/ji tank/nn invaded/vbd Wisconsin/np Iper 3Deating with prefixes is a more complicated matter, since they may have quite strong effect upon the meaning of the resulting term, e.g., Un- usually introduces explicit negation. [assert [[perf [HAVE]] [[verb [BE]] [subject [np [n PRESIDENT] [~pos ThE] [adj ~ORMER]] [adj [SOVIEII]]] [object [np [n HERO] [~)os A] [adj [LOCAL]]]] [adv EVER] [sub~ord [SINCE [[verb [INVADE]] [subject [np [n TANK] [t~pos A] [adj ~USSIAN]]]] [object [np [name PVISCONSIN]]]]]]J]]J F~gure 1. Prelicate-argument parse structure. It should be noted that the parser's output is a predicate-argument structure centered around main ele- ments of various phrases. In Figure 1, BE is the main predicate (modified by HAVE) with 2 arguments (sub- ject, object) and 2 adjuncts (adv, sub_ord). INVADE is the predicate in the subordinate clause with 2 arguments (subject, object). The subject of BE is a noun phrase with PRESIDENT as the head element, two modifiers (FORMER, SO'EIET) and a determiner C[HE). From this structure, we extract head-modifier pairs that become candidates for compound terms. The following types of pairs are considered: (1) a head ndun and its left adjec- five or noun adjunct, (2) a head noun and the head of its right adjunct, (3) the main verb of a clause and the head of its object phrase, and (4) the head of the subject phrase and the main verb. These types of pairs account for most of the syntactic variants for relating two words (or simple phrases) into pairs carrying compatible semantic content. For example, the pair retrieve+information will be extracted from any of the following fragments: information retrieval system; retrieval of information from databases; and information that can be retrieved by a user-controlled interactive search process. `[1 the example at hand, the following head-modifier pairs are extracted (pairs containing low- content elements, such as BE and FORMER, or names, such as WISCONSIN, will be later discarded): PRESIDENT+BE, PRESIDENT+FORMER, PRESIDENT+SOVIET, BE+HERO, HERO+LOCAL, TANK+INVADE, TANK+RUSSIAN, INVADE+WISCONSIN We may note that the three-word phrase former Soviet presulent has been broken into two pairs former president and Soviet president, both of which denote things that are potentially quite different from what the original phrase refers to, and this fact may have poten- tially negative effect on retrieval precision. This is one place where a longer phrase appears mole appropriate. The representation of this sentence may therefore contain the following terms (along with their inverted document frequency weights): PRESIDENT SOVIEF PRESIDENT+SOVIET PRESIDENT+FORMER HERO HERO+LOCAL INVADE TANK TANK+INVADE TANK+RUSSIAN RUSSIAN WISCONSIN While generating compound terms we took care to iden- tify `negative' terms, that is, those whose denotations have been explicifly excluded by negation. Even though matching of negative terms was not used in retrieval (nor did we use negative weights), we could easily prevent matching a negative term in a query against its positive counterpart in the database by removing known negative terms from queries. As an example consider the follow- ing fragrnent from topic 067: It should NOT be about economically-mofivated civil disturbances and NOT be about a civil distur- bance directed against a second country. 2.623519 5.416102 11.556747 14.594853 7.896426 14.314775 8.435012 6.848128 17.402237 16.030809 7.383342 7.785689 126 The corresponding compound terms are: NOT disturb+civil NOT countxy+second NOT dtrect+disturb The particular way of interpreting syntactic con- texts was dictated, to some degree at least, by statistical considerations. Our inltial experiments were pefformed on a relatively small collection (CACM-3204), and there- fore we combined pairs obtained from different syntactic relations (e.g., verb-object, subject-verb, noun-adjunct, etc.) in order to increase frequencies of some associa- tions. This became largely unnecessary in a large collec- tion such as TIPSTER, but we had no means to test alter- native options, and thus decided to stay with the original. It should not be difficult to see that this was a comprom- ise solution, since many important distinctions were potentially lOSt, and strong associations could be pr~ duced where there weren't any. A way to improve things is to oonsider different syntactic relations independently, perhaps as independent soures Of evidence that could lend support (or not) to cerain term similarity predic- tions. We have started investigating this option during ThEC-2, however, it has not been sufficiently tested yet. One difficulty in obtaining head-modifier pairs of highest accuracy is the notorious ambiguity of nominal compounds. For example, the phrase natural language processing should generate Ianguage+natural and processing+language, while dynamic information pro- cessing is expected to yield processing+dynamic and processing+information. A still another case is executive vice president where the association president+executive may be stretching things a bit too far. Since our parser has no knowledge about the text domain, and uses no semantic preferences, it does not attempt to guess any internal associations within such phrases. histead, this task is passed to the pair extractor module which processes ambiguous parse structures in two phases. In phase one, all and only unambiguous head-modifier pairs are extracted, and the frequencies of their occurrences are recorded. `[1 phase two, frequency information about pairs generated in the first pass is used to form associa- tions from ambiguous structures. For example, if language+natural has occurred unambiguously a number of times in contexts such as parser for natural language, white processing+natural has occurred significantly fewer times or perhaps none at all, then we will prefer the former association as valid. InThEC-2 phrase disambiguation was not used, instead we decided to avoid ambiguous phrases altogether. While our disam- big program worked generally satisfactorily, we could resolve only a small fraction Of cases (about 7%) and thus it's impact on the overall system's performance was limited. However, query-level disambiguation may be more importanL TERM CORRELATIONS FROM TEXT Head-modifier pairs form compound terms used in database indexing. They also serve as occurrence con- texts for smaller temis, including single-word terms. if two terms tend to be modified with a number of common modifiers and otherwise appear in few distinct contexts, we assign them a similarity coefficient, a real number between 0 and 1. The sinillarity is determined by com- paring distribution characteristics for both terms within the corpus: how much information content do they carry, do their information contribution over contexts vary greatly, are the common contexts in which these terms occur specific enough? In general we will credit high- content terms appearing in identical contexts, especially 127 if these contexts are not too commonplace.4 For JREC-2 runs we used a similarity formula which, unlike the similarity formula used in ~~EC-1, produces clusters of related words and phrases, but will not generate uniform term simllarity ranking across clus- ters. This new formula, however, appeared better suited to handie the diverse subject matter of the WSJ database. We used a (revised) variant of weighted Tanimoto's measure described in (Grefenstette, 1992): ~MIN (W([x,aaj),W (~,att]) SIM(x1,x2) = aU LMAX (W ([x,att]),W (ly,att]) att with W ([x,y]) = GEW (x)*log ~ log LnxY1 1 GEW(x)=1+~~ L~*l;[)Y) J In the above, ~ stands for absolute frequency of pair [x,y], ~ is the frequency of term y, and N is the number of single-word terms. Sample clusters obtained from approx. 250 MByte (42 million words) subset of WSJ (years 19~1992) are given in Table 1. In order to generate better similarities, we require that words x1 and x2 appear in at least M distinct com- mon contexts, where a common context is a couple of pairs [x1,y] and [x2,y], or ~xi] and Lyx2] such that they each occerred at least three times. Thus, banana and Bal- tic will not be considered for simllarity relation on the basis Of their occurrences in the common context Of republic, no matter how frequent, unless there is another such common context comparably frequent (there wasn't any in ~I~~EC's WSJ database). For smaller or narrow domain databases M=2 is usually sufficient. For large databases covering a rather diverse subject matter, like WSJ or SJMN (San Jose Mercury News), we used M~.5 This, however, turned out not to be sufficient. We would still generate fairly strong similarity links between terms such as aerospace and pharmaceutical where 6 and more common contexts were found. In the example at hand the following common contexts were located, all occurring at the head (left) position Of a pair (at right are their glo- bal entropy weights and frequencies with aerospace and 4 It would not he appropriate to predict similarity between language and logarithm on the basis of their co~occw~ce with natur- aL 5 For example banana and Dominican were found to have two common contexts: republic and plant, althou~i this second occuered in apparently different senses in Dominican plant and banana plant. pharmaceutical, respectively):6 firm GEW~.58 fxly=9 fx2y=22 industry GEW=O.51 f'cly=~4 f~2y=56 sector GEW=O.61 f'cly=5 fx2y=9 concern GEW=O.50 ~ly=l30 ~2y=ll5 analyst GEW=O.62 rxly=23 fx2y=8 division GEW~.53 fxly=36 fx2y=28 giant GEW=O.62 fxly=15 fx2y=12 Note that while some of these weights are quite low (less than 0.6-- GEW takes values between 0 and 1), thus indicating a low importance context, the frequencies with which these contexts occrred with both terms were high and balanced on both sides (e.g., concern), thus adding to the strength of association. We are now considering addi- tional thresholds to bar low importance contexts from being used in similarity calculation. It may be worth pointing out that the simllarities are calculated using term co-occurrences in syntactic rather than in document-size contexts, the latter being the usual practice in non-linguistic clustering (e.g., Sparck Jones and Barber, 1971; Crouch, 1988; lewis and Croft, 1990). Although the two methods of term clustering may be considered mutually complementary in certain situa- tions, we believe that more and stronger associations can be obtained through syntactic-context clustering, given sufficient amount of data and a reasonably accurate syn- tactic parser.7 QUERY EXPANSION Similarity relations are used to expand user queries with new terms, in an attempt to make the filial search query more comprehensive (addliig synonyms) and/or more pointed (adding specializations).8 It follows that not all similarity relations will be equally useful in query expansion, for instance, complementary and antonymous relations like the one between Australian and Canadian, accept and reject, or even generalizations like from 6 Other conunon contexts, such as compai~ or market, have al- ready been rejected because they were paired with too many different words (a high dispersion ratio, see note 12). , Non-syntactic contexts cross sentence boundaries with no fuss, which is heiplul with short, succinct documents (such as CACM abstracts), but less so with longer texts; see also (Grishman et aL, 1986). 8 Ouery expansion (in the sense considered here, though not quite in the same way) has been used in information retrieval research before (e.g., Sparck Jones and Tait, 1934; Harman, 1988), usually with mixed results. An alternative is to use term clusters to create new terms, "meta- terms", and use them to index the database instead (e.g., Crouch, 1988; lewis and Croft, 1990). We found that the query expansion approach gives the system more flexibility, for instance, by making room for hypertext-style topic exploration via user feedback. 128 aerospace to industry may actually harm system's per- formance, since we may end up retrieving many irrelevant documents. On the other hand, database search is likely to miss relevant documents if we overlook the fact that vice director can also be deputy director, or that takeover can also be merge, buy-out, or acquisition. We noted that an average set of similarities generated from a text corpus contains about as many "good" relations (synonymy, specialization) as "bad" relations (antonymy, complementation, generalization), as seen from the query expansion viewpoint. Therefore any attempt to separate these two classes and to increase the proportion of "good" relations should result in improved retrieval. I~s has indeed been confirmed in our experiments where a relatively cmde filter has visibly increased retrieval pre- cision. In order to create an appropriate filter, we devised a global term specificity measure (GTS) which is calcu- lated for each term across all contexts in which it occurs. The general philosophy here is that a more specific word4)hrase would have a more limited use, i.e., a more specific term would appear in fewer distinct contexts. In this respect, GTS is similar to the standard inverted docu- ment frequency (id]) measure except that term frequency is measured over syntactic units rather than document size units.9 Terms with higher GTS values are generally considered more specific, but the specificity comparison is only meaningful for terms which are already known to be similar. The new function is calculated according to the following formula: ICL(w) * ICR (w) if both exist if only ICR (w) exists GTS(w)= ICR (w) L ICL(w) otherwise where (with n~, dw >0): ICL(w) =IC([w,_]) = d~(n~+~-l) ICR(w)=IC([_,w])= For any two terms w1 and w2, and a constant b> 1, if GTS(w2)> b * GTS(w1) then w2 is considered more specific than w1. In addition, if SIMmrm(W1,W2) = 0> 0, where 0 is an empirically established threshold, then w2 can be added to the query containing term w1 with weight o.l0 For example, the following were obtained 9We believe that measuring term specificity over document-size contexts (e.g., Sparck Jones, 1972) may not be appropriate in this case. In particular, syntax-based contexts allow for processing texts without any internal document structure. 10 For TREC-2 we used ~ = 0.2; 5 varied between 10 and 100. from the WSJ training database: GTS (takeover) GTS (merge) GTS (buy -out) GTS (acquire) with =0.00145576 =0.00094518 =0.00272580 =0.00057906 SIM (takeover,merge) = 0.190444 SIM (takeover,buy -out) =0.157410 SIM (takeover,acquire) =0.139497 SIM (merge, buy -out) = 0.133800 SIM (merge,acquire) =0.263772 SIM (buy -out,acquire) = 0.109106 Therefore both takeover and buy-out can be used to spe- cialize merge or acquire. With this filter, the relation- ships between takeover and buy-out and between merge and acquire are either both discarded or accepted as synonymous. At this time we are unable to tell synonymous or near synonymous relationships from those which are priinarily complementary, e.g., man and woman. In ThEC-1 the impact of query expansion through term similarities on the system's overall performance was generally disappointing. For TREC-2 we have made a number of changes to the term cottelation model, but again time limitations prevented us from properly testing all options. Among the most important changes are: (1) Exclusion of pairs obtained from SUBJEGF-- VERB relations: we detennined that these con- texts are generally of litfie use as neither subject nor verb subeategorizes well for the other. More- over we observed that the presence of these pairs was the source of many unwanted term associa- tions.11 (2) Automatic pruning of low~ontent terms from the queries: terms with low idf weights, terms with low information contribution weights that are elements of compound terms, are removed from queries before database search. As we tuned various cutoff thresholds we noted that a significant increase in both recall and precision could be obtained. 12 Subject-Verb pairs were retained as eompound terms, however. 12 The Information Contribution Ineasure indicates the strength of j;j word pairings, and is defined as IC (x, fx,y]) = where f,~ is n,+d~-l the absolute frequency of pair [x,y] in the corpus, n, is the frequency of term x at the head position, and d~ is a dispersion parameter understood as the number of distinct Syntactic contexts in which term x is found. 129 word cluster takeover merge, buy-out, acquire, bid benefit compensate, aid, expense capital cash, jund, money staff personnel, employee, force attract lure, draw, woo sensitive crucial, difficult, critical speculate rumor, uncertainty, tension president director, executive, chairman vice deputy outlook forecast, prospect, trend law rule, policy, legislate, bill earnings profit, revenue, income portfolio asset, invest, loan inflate growth. demand, earnings industry business, company, market growth increase, rise, gain firm bank, concern, group, unit environ climate, condition, situation debt loan, secure, bond lawyer attorney counsel attorney, administrator, secretary compute machine, sofiware, equipment competitor rival, competition, buyer alliance partnership, venture, consortium big large, major, huge, significant fight battle, attack, war, challenge base facile, source, reserve, support shareholder creditor, customer, client investor, stockholder Table 1. Selected clusters obtained from syntactic contexts, derived from approx. 40 million words of WSJ text, with weighted Tanimoto formula. CREATING AN INDEX Most of the problems we encountered when adapt- ing NIST's PRISE system for use in TREC-2 had to do with the size of the data that had to be indexed. We had to deal with the restrictions imposed by the resources we had (e.g., only % MBytes of virtual memory). The rest of this paragraph signals some of tile changes we made to the NIST system in order to deal with our restrictions. The original system would request twice the previously requested amount of memory each time it needed more. As a result of this the system would reach the limit of virtual memory after only a relatively small portion of the total number of documents had been indexed. `flour version, tile memory requested by the system grows linearly. The increments are estimated in such a way that the system never requests too much memory. The indexing process became too fragile when the limits of the enviromnent were approached. When a large portion of the virtual memory and of the disk space was being used by the indexing process, crashes became very likely. Unfortunately, it turned out that the process was very difficult to restart after some crashes (e.g., in the rebuild phase), thus leading to time consuming repeats. Indexing also takes too long at present. Given tile size of tile data to be indexed the whole process takes at least 250 hours if everything goes well, which happens seldom. Given TREC-2's deadlines we could not afford to perform too many experiments: we barely had time to index the corpus once. Most of the previous problems could be solved by distributing the indexing process to several different machines and perforting the indexing in parallel. We believe that it is possible to create several small indexes instead of a single very large one. if cer- tain rules are followed when creating the distributed index, it should be possible to merge the results of query- ing the set of small indexes and to obtain a performance (recall and precision) co(nparable to the results obtained using a single index. The test setup we bullt in order to perform the experiments required for TREC-2 should allow us to test these hypotheses. The advantages of a distributed index are clear: (1) The indexing process would be faster. (2) Each one of the distributed indexing processes would be smaller and less fragile. (3) Even if one of the distributed processes crashes restarting it would be less expensive. (4) A distributed system would be much easier to update, i.e., adding a new document would not require to reindex the whole corpus. 130 (5) A distributed system would be more likely to be useflil in order to study the kinds of problems and solutions that are likely to be encountered in a real world situation. SUMMARY OF RESULTS We have processed the total of 850 MBytes of text during TREC-2. The first 550 MBytes were articles from the Wall Street Journal which were previously processed for TREC-1; we had to repeat most of tile processing to correct early tokenization errors introduced by tile tagger. The entire process (tagging, parsing, phrase extraction) took just over 4 weeks on 2 Sun's SparcSta- tions (1 and 2). Building a comprehensive index for the WSJ database took up another 2 weeks. This time we were able to create a single index thanks to the improved indexing software received from NIST. The final index size was 204 MBytes, and included 2,274,775 unique terms (of which about 310,000 were single word terms, and the remaining 1,865,000 were syntactic word parrs) oceur]iug in 173,219 documents, or more than 13 (unique) terms per document. Note that this gives poten- tially much better coverage of document content than single word terms alone with less than 2 unique terms per document. We say `potentially' since the proc~ss of deriving phrase-level terms from text is still only par- tially understood, including the complex problem of `normalization' of representation. The remaining 300 MBytes were articles from tile San Jose Mercury News, which were contained in TIP- SThR disk-3. Processing of this part, and creating an index for routing purposes took about 3 weeks. While natural language processing required 2 weeks to com- plete (at approximately the same speed as WSJ data- base), we were able to cut indexing time in half by using a faster in-memory version of the NIST system. This new version reduces the time required by the first phase of indexing froni days to hours, however the second phase remains slow (days) and fraglle (we had to redo it 3 times). The final size of the SJMN index was 101 MBytes, with 1,535,971 unique terms occurring in 86,270 documents (nearly 18 unique terms per docu- ment).13 Two types of retrieval have been done: (1) new topics 101-150 were run in the ad-hoc mode against WSJ database, and (2) topics 51-100, previously used in TREC-1, were run in the routing mode against SJMN database. "leach category several runs were attempted 13 ~ has to be noted that the ratios at which new terms are gen- erated are nearly identical in both databases; at 86,319 documents (or about half way through WSJ database) 1,335,622 unique terms had been recorded. with a different combination of fields from the topics used to create search queries. A typical topic is shown below: 41ead> Tipster Topic Description Number: 107 Domain: International Economics Topic: Japanese Regulation of Insider Trading <desc> Description: Document will inform on Japan's regulation of insider trading. <narr> Narrative: A relevant document will provide data on Japanese laws, regulations, and/or practices which help the foreigner understand how Japan controls, or does not control, stock market practices which could be labeled as insider trading. <con> Concept(s): 1. insider trading 2. Japan 3. Ministry of Finance, Securities and Exchange Council, Osaka Securities Exchange, Tokyo Stock Exchange 4. Securities and Exchange Law, Article 58, law, legislation, guidelines, self-regulation 5. Nikko Securities, Yamaichi Securities, Nomura Securities, Daiwa Securities, Big Four brokerage firms <fac> Factor(s): <nat> Nationality: Japan <Ifac> <Aop> This topic actually consists of two different statements of the same query: the natural language specification con- sisting of <desc> and <narr> fields, and an expert- selected list of key terms which are often far more infor- mative than the narrative part (in some cases these terms were selected via feedback in actual retrieval attempts). The table below shows the seatch query obtained from fields <desc> and <nair> of topic 107, after an expansion with similar terms and deleting low-content terms. Ouery 107 standard+trade idf=16.81 weight""0.38 standard+trade idf=16.8l weight""0.38 regulate+japanese idf=15.40 weight=1.00 standard+japanese idf=14.08 weight"'0.38 regulate+trade idf=12.84 weight=l.00 regulate+trade idf=12.84 weight=l.00 controls idf"'9.97 weight=l.00 labele idf~.20 weight=l.00 trade+inside idf=8.62 weight=l.OO trade+inside idf=8.62 weight=l.00 inside idf"'iA9 weight=l.00 inside idf""iA9 weight=l.00 inside idf~A9 weight=l.00 regulate idf=5.66 weight=l.00 regulate idf=5.66 weight=l.00 131 regulate idf=5.66 weight=l.00 practice idf=5A6 weight=l.00 practice idf""5.46 weight=l.00 data idf=4.91 weight=l.00 data idf=4.91 weight"'0.51 data idf=4.91 weight=O.26 japanese idf=4.84 weight=l.00 japanese idf=4.84 weight=l.00 standard idf=4.81 weight=O.38 standard idf1"4.81 weight=O.38 standard idf=4.81 weight~.38 inform idf=4.71 weight=l.00 inform idf"'4.71 weight"'0.26 inform idf=4.71 weight"'0.51 protect idf=4.69 weight=O.41 Note that many `function' words have been removed from the query, e.g., provide, understand, as well as other `common words' such as document and relevant (this is in addition to our regular list of `stopwords'). Some still remain, however, e.g., data and inform, because these could not be uniformly considered as `common' across all queries. Results obtained for queries using text fields only and those based primarily on keyword fields are reported separately. The purpose of this distinction was to demon- strate that (or whether) an intensive natural language pro- cessing can make an imprecise and frequendy convo- luted narrative into a better query that an expert would create.14 The ad-hoc category runs were done as follows (these are the official IkEC-2 results): (1) nyuirl: An automatic run of topics 101-150 against the WSJ database with the following fields used: <tide>, <desc>, and <narr> only. Both syntactic phrases and term similarities were included. (2) nyuir2: An automatic run of topics 101-150 against the WSJ database with the following fields used: <tide>, <desc>, <con> and 4ac> only. Both syntactic phrases and term similarities were included. 14 Some results on the impact of different fields in ~~EC topics on the final recall/precision results were reported by Broglio and Croft (1993) at the ARPA HLT workshop, although text-only runs were not included. One of the most striking observations they have made is that the narrative field is entirely disposable, and moreover that its inclusion in the query actually hurts the system's performance. Croft (personal communication, 1992) has suggested that excluding all expert-made fields (i.e., <con> and <fac>) would make the queries quite ineffective. Broglio (personal communication, 1993) confirms this showing that text-oaly retrieval (i.e., with <desc> and <narr>) shows an average pre- cision at more than 30% below that of <con>based retrieval. (3) nyuir3: A run Of manually pruned topics 101-150 against the WSJ database with the following fields used: <tide>, <dese>, <con> and ~ac> only. Both syntactic phrases and term simliarities were included. Manual intervention involved removing some terms from queries before data- base search. Summary statistics for these runs are shown in Table 2. In addition, the `base' column reports the system's performance on text fields with no language preprocessing, and no phrase terms or similarities useCL We must note, however, that in all cases the topics have been processed with our suffix-trirniner, which means some NLP has been done already (tagging + lexicoii), and therefore what we do not see here a performance of `pure' statistical system. In the routing category only automatic runs were done (again, these are the official ThEC-2 results): (1) nyuirl: An automatic run of topics 51-100 against the SJMN database with the following fields used: <tide>, <desc>, and <n&r> only. Both syntactic phrases and term similarities were included. (2) nyuir2: An automatic run of topics 51-100 against the SJMN database with the following fields used: <tide>, <desc>, <con> and 4ac> only. Both syntactic phrases and term similarities were included. A (simulated) routing mode run means that queries (i.e., terms and their weights) were derived with respect to a (different) training database (here WSJ), and were subsequendy run against the new database (here SJMN). In particular, this means that the terms and their relative importailce (reflected primarily through idf weights) were those Of WSJ database rather than S~N database. Routing runs are summarized in Table 3. Again a column `base' is added to show the system's perfor- mance without ~IP module. We may note that the rout- ing results are generally well below the ad-hoc results, both because the base system performance is inferior and because query processing has a different effect on the final statistics. The last column is a post-~fl~C run.15 `51t should he noted that in category B runs, three tooics (63,65, and 88) had no relevant documents in SJMN database. Unfortunately, the evaluation program counts those as if there were relevant documents but none had been found, thus underestimating the system's perfor- mance by 5 to 8%. Excluding these three topics from consideration we obtain, in the last column, the average precision of 0.2624 and the R- precision of 0.3000. 132 Run base nyuirl nyuir2 nyuir3 Name ad-hoc ad-hoc ad-hoc ad-hoc Queries 50 50 50 50 Tot nimiher of docs over all queries Ret 49387 49834 49876 49877 Rd 3929 3929 3929 3929 ReIRet 2740 2983 3274 3281 Recall (interp) Precision Averages 0.00 0.7038 0.7013 0.7528 0.7528 0.10 0A531 0.4874 0.5567 0.5574 0.20 0.3708 0.4326 0A721 0A724 0.30 0.3028 0.3531 0A060 0A076 0.40 0.2550 0.3076 0.3617 0.3621 0.50 0.2059 0.2637 0.3135 0.3142 0.00 0.1641 0.2175 0.2703 0.2711 0.70 0.1180 0.1617 0.2231 0.2237 0.80 0.0766 0.1176 0.1667 0.1697 0.90 0.0417 0.0684 0.0915 0.0916 1.00 0.0085 0.0102 0.0154 0.0160 Average precision over all rel docs Avg 0.2224 0.2649 0.3111 0.3118 Precision at S docs 0.4640 0.4920 0.5360 0.5360 10 docs 0A140 0.4420 0A880 0A880 15 docs 0.3867 0.4240 0A693 0.4707 20 docs 0.3670 0.4050 0A390 0.4410 30 docs 0.3253 0.3640 0A067 0.4080 100 does 0.2304 0.2720 0.3094 0.3094 200does 0.1626 0.1886 0.2139 0.2140 S00docs 0.0911 0.1026 0.1137 0.1140 1000 docs 0.0548 0.0597 0.0655 0.0656 R-Precision (after ReIRet) ~0.26050.30030.33200.332l Table ~ Automatic ad-hoc run statistics for queries 101-150 against WSJ database: (1) base- statistical terms only with <desc> and <narr> fields; (2) nyuirl - using syntactic phrases and similarities with <desc> and <narr> fields only; (3) nyuir2 - same as 2 but with <desc>, <con>, and <fac> fields only; and (4) ~uir3 - same as 3 but queries manually pruned before search. Run ~ base nyniri nyufr2 nyuir~ Name {_routing routing routing routing Queries 50 50 50 50 Tot number of does over all queries T Ret 50000 J 50000 50000 50000 Rel 2064 2064 2064 2064 ReIRet 1349 1390 1610 1623 Recall (interp) Precision Averages 0.00 0.5276 0.5400 0.6435 0.6458 0.10 0.3685 0.3937 0.4610 0.5021 0.20 0.3054 0.3423 0.3705 0.4151 0.30 0.2373 0.2572 0.3031 0.3185 0.40 0.2039 0.2263 0.2637 0.2720 0.50 0.1824 0.2032 0.2282 0.2379 0.60 0.1596 0.1674 0.1934 0.1899 0.70 0.1167 0.1295 0.1542 0.1571 0.80 0.0854 0.0905 0.1002 0.1163 0.90 0.0368 0.0442 0.0456 0.0434 1.00 0.0228 0.0284 0.0186 0.0158 Average precision over all rel does Avg ~ 0.1884 T 0.2038_T~_0.2337 ~ 0.2466 Precision at 5 docs 0.3160 0.3360 0A280 0.4440 10 does 0.3100 0.3240 OAOOO 0A180 15 does 0.2813 0.2933 0.3613 0.3800 20 does 0.2670 0.2790 0.3260 0.3530 30 does 0.2240 0.2404 0.2760 0.2993 l00docs 0.1306 0.1412 0.1708 0.1698 200 does 0.0865 0.0939 0.1078 0.1107 500 docs 0.0464 0.0489 0.0575 0.0570 1000 does 0.0270 0.0278 0.0322 0.0325 R-Precision (alter Rel) Exact 0.21% 0.2267 0.2513 0.2820 Table 3. Automatic routing run statistics for queries 51-100 against SJMN database: (1) base - statistical terms only with <desc> and <!1arr> fields; (2) ayuirl - using syntactic phrases and similarities with <deac> and ~arr> fields only; (3) nyuir2 - same as 2 but with <deac>, <con>, and <fac> fields only; and (4) nyuir2a - run nyuir2 repeated with new weighting for phrees. TERM WEIGHTING ISSUES Finding a proper term weighting scheme is critical in te[ln-based retrieval since the rank of a document is 133 determined by the weights of the terms it shares with the query. One popular term weighting scheme, known as tf.idf, weights terms proportionately to their inverted document frequency scores and to their in-document fre- quencies (tf). The in-document frequency factor is usu- ally normalized by the document length, that is, it is more significant for a term to occcr 5 times in a short 20-word document, than to occur 10 times in a 1000- word article. 16 In our official ThEC runs we used the normalized tf.idf weights for all terms alike: single `ordinary-word' terms, proper names, as well as phrasal terms consisting of 2 or more words. Whenever phrases were included in the term set of a document, the length of this document was increased accordingly. This had the effect of decreasing tf factors for `regular' single word terms. A standard tf.idf weighting scheme (and we suspect any other uniform scheme based on frequencies) is inappropriate for mixed term sets (ordinary concepts, proper names, phrases) because: (1) It favors terms that occur fairly frequendy in a document, which supports only general-type queries (e.g., "all you know about `star wars"'). Such queries are not typical in ThEC. (2) It attaches low weights to infrequent, highly specific terms, such as names and phrases, whose only occllfrences in a document often decide of relevance. Note that such terms cannot be reli- ably distinguished using their distribution in the database as the sole factor, and therefore syntac- tic and lexical information is required. (3) It does not address the problem of inter-term dependencies arising when phrasal terms and their component single-word terms are all included in a document representation, i.e., launch+satellite and satellite are not indepen- dent, and it is unclear whether they should be counted as two terms. In our post-ThEC-2 experiments we considered (1) and (2) only. We changed the weighting scheme so that the phrases (1)ut not the names which we did not dis- tinguish in ThEC-2) were more heavily weighted by their idf scores while the in-document frequency scores were replaced by logarithms multiplied by sufficiendy large constants. In addition, the top N highest-idf match- ing terms (simple or compound) were counted more toward the document score than the remaining terms. This `hot-spot' retrieval option is discussed in the next section. 1' mj~ is not always true, for example when all occurences of a term are concentrated in a single section or a paragraph rather than spread around the article. See the following section for more discussion. The table beloW illustrates the problem of weight- ing plirasal terms Using topic 101 and a relevant docu- ment (WSJ870226-0091). Topic 101 matches W5J870226-0091 duplicate terms not shown TERM TE JDF sdi 1750 eris 3175 star 1072 wars 1670 laser 1456 weapon 1639 missile 872 NEW WEIGHT 1750 3175 1072 1670 1456 1639 872 space+base 2641 2105 interceptor 2075 2075 exoatmospheric 1879 3480 system+defense 2846 2219 reentry+vehicle 1879 3480 initiative+defense 1646 2032 svstem+interceptor 2526 3118 DOCRANK 30 10 changing the weighting scheme for compound terms, along with other minor improvements (such as expanding the stopword list for topics, or correcting a few parsing bugs) has lead to the overall increase of precision of nearly 20% over our official ~II~EC-2 ad-hoc results. Table 4 summarizes these new runs for queries 101-150 against WSJ database. Simllar improvements have been obtained for queries 51-100. The results of the routing runs against SJMN data- base are somewhat more troubling. Applying the new weighting scheme we did see the average precision increase by some 5 to 12% (see column 4 in Table 3), but the results remain far below those for the ad-hoc runs. Direct runs of queries 51-100 against SJMN database produce results that are about the same as in the routing runs (which may indicate that our routing scheme works fine), however the same queries run against WSJ data- base have retrieval precision some 25% above SJMN runs. This may indicate some problems with SJMN data- base or the relevance judgements for it. ~OT SPOT' RETRIEVAL Another difficulty with frequency-based term weighting arises when a long document needs to be retrieved on the basis of a few short relevant passages. If the bulk of the document is not direcdy relevant to the query, then there is a strong possibility that the document will score low in the final ranling, despite some strongly relevant material in it. This problem can be dealt with by subdividing long documents at paragraph breaks, or into approximately equal length fragments and indexing the database with respect to these (e.g., Kwok 1993). While 134 Run nyuirl nyuirla nyuir2 nyuir2a Name ad-hoc ad-hoc ad-hoc ad-hoc Queries 50 50 50 50 Tot number of docs over all queries Ret 49884 5OOOO 49876 5OOOO Rel 3929 3929 3929 3929 ReIRet 2983 3108 3274 3401 Recall (interp) Precision Averages 0.00 0.7013 0.7201 0.7528 0.8063 0.10 0.4874 0.5239 0.5567 0.6198 0.20 0.4326 0.4751 0.4721 0.5566 0.30 0.3531 0.4122 0.4060 0.4786 0.40 0.3076 0.3541 0.3617 0.4257 0.50 0.2637 0.3126 0.3135 0.3828 0.60 0.2175 0.2752 0.2703 0.3380 0.70 0.1617 0.2142 0.2231 0.2817 0.80 0.1176 0.1605 0.1667 0.2164 0.90 0.0684 0.1014 0.0915 0.1471 1.00 0.0102 0.0194 0.0154 0.0474 Average precision over all rel does Avg 0.2649 0.3070 0.3111 0.3759 Precision at 5 does 0.4920 0.5200 0.5360 0.6040 10 does 0.4420 0.4900 0A880 0.5580 15 does 0.4240 0.4653 0A693 0.5253 20 does 0.4050 0.4420 0.4390 0.4980 30 does 0.3640 0.3993 0.4067 0.4607 100 does 0.2720 0.2914 0.3094 0.3346 200 docs 0.1886 0.2064 0.2139 0.2325 S00docs 0.1026 0.1103 0.1137 0.1229 1000 does 0.0597 0.0622 0.0655 0.0680 R-Precision (after Rel) Exact 0.3003 0.3332 0.3320 0.3950 Table 4. Automatic ad-hoe run statistics for queries 101-150 against WSJ database: (1) nyuirl - TREC-2 official run with <desc> and <narr> fields only; (2) ?~uir1a - revised term weighting run; (3) nyuir2 - official TREC-2 run with <desc>, <con>, and <fac> fields only; and (4) nyuir2a - revised weighting run. such approaches are effective, they also tend to be costly because of increased index size and more complicated access methods. Efficiency considerations has led us to investigate an alternative approach to the hot spot retrieval which would not require re-indexing of the existing database or any changes in document access. In our approach, tile maximum number of terms CU which a query is permitted to match a document is limited to N highest weight terms, where N can be the same for all queries of may vary from one query to another. Note that this is not the same as simply tating the N top terms from each query. Rather, for each document for which there are M match- ing terms with the query, only min(M,N) of them, namely those which have highest weights, will be con- sidered when computing the document score. Moreover, only the global importance weights for terms are con- sidered (such as icif), while local in-document frequency (eg., ti) is suppressed by either taiting a log or replacing it with a constant. The effect of this `hot spot' retrieval is shown below in the ranking of relevant documents within the top 1000 retrieved documents for topic 65: Full lf.idf retneval DOCUMENT ID RANK SCORE WSJ870304-0091 4 12228 WSJ891017~O156 7 9771 W8J920226-0034 14 8921 W8J870429-0078 26 7570 WSJ870205-0078 33 6972 WSJ8807124)()33 34 6834 WSJ9201 16-0002 37 6580 WSJ910328A)()13 74 4872 WSJ910830-0140 80 4701 WSJ8908044)138 102 4134 WSJ91 1212-0022 104 4065 WSJ870825-0026 113 3922 W8J850712-0023 135 3654 WSJ871202-0145 153 3519 Hot-spot idf-dominated with N=20 DOCUMENT ID RANK SCORE W8J920226-0034 1 11955 WSJ870304-0091 3 11565 W5J870429-0078 5 9997 WSJ9201 16-0002 7 9997 WsJ910830~140 11 8792 WSJ870205-0078 20 8402 WSJ910328-0013 29 8402 WSJ880712~OO33 71 6834 WSJ880712-0023 72 6834 W8i891017-0156 87 6834 135 WSJ890804-0138 92 6834 WSJ91 1212-0022 111 6834 WSJ871202-0145 124 6834 The final ranking is obtained by merging the two rankings by score. While some of the recall may be sacrificed (`hot spot' retrieval has, understandably, lower recall than flill query retrieval, and this becomes the lower bound on recall for the combined ranxing) the combined ranl~g precision has been consistenfly better than in either of the original rankings: an average improvement is 10-12% above the tf.idf run precision (which is often stronger of the two). CONCLUSIONS We presented in some detail our natural language information retrieval system consisting of an advanced NLP module and a `pure' statistical core engine. While many problems remain to be resolved, including the question of adequacy of term-based representation of document content, we attempted to demonstrate that the architecture described here is nonetheless viable. In par- ticular, we demonstrated that natural language processing can now be done on a fairly large scale and that its speed and robustness can match those of traditional statistical programs such as key-word indexing or statistical phrase extraction. We suggest, with some caution until more experiments are run, that natural language processing can be very effective in creating appropriate search queries out of user's initial specifications which can be fre- quenfly imprecise or vague. On the other hand, we must be aware of the Innits of NLP technologies at our disposal. While part-of- speech tagging, lexicon-based stemming, and parsing can be done on large amounts of text (hundreds of millions of words and more), other, more advanced processing involving conceptual structuring, logical forms, etc., is still beyond reach, computationally. It may be assumed that these super-advanced techniques will prove even more effective, since they address the problem of representation-level limits; however the experimental evidence is sparse and necessarily limited to rather small scale tests (e.g., Mauldin, 1991). ACKNOWLEDGEMENTS We would like to thank Donna Harman of NIST for making her PRISE system available to us. We would also like to thank Ralph Weischedel and Heidi Fox of BBN for providing and assisting in the use of the part of speech tagger. This paper is based upon work supported by the Advanced Resesich Project Agency under Con- tract N00014-904-1851 from the Office of Naval Research, under Contract N006()()-88-D-3717 from PRC Inc., and the National Science Foundation under Grant IRI-93-02615. We also acknowledge support from the Canadian Institute for Robotics and Intelligent Systems (IRIS). REWRENCES Brogho, JoIm and W. Bruce Croft. 1993. "Query Pro- cessing for Retrieval from Large Text Bases." Procedings of ARPA HLT Workshop, March 21-24, Piainsboro, NJ. Church, Kenneth Ward and Hanks, Patrick. 1990. "Word association norms, mutual information, and lexicography." Computational Linguistics, 16(1), MU" Press, pp.22-29. Crouch, Carolyn J. 1988. "A cluster-based approach to thesaurus construction." Proceings of ACM SIGIR-88, pp.309-320. Grefenstette, Gregory. 1992. "Use of Syntactic Context To Produce Term Association Lists for Text Retrieval." Proceedings of SIGIR-92, Copenhagen, Denmark. pp.89-97. Grisnman, Ralph, Lynette Iiiirschman, and Ngo T. Nhan. 1986. "Discovery procedures for sublanguage selec- tional patterns: Initial experiments". Computational Linguistics, 12(3). pp.205-215. Grishman, Ralph and Tomek Strzalkowski. 1991. "Information Retrieval and Natural Language Pro- cessing." Position paper at the workshop on Future Directions in Natural Language Processing in Infor- mation Retrieval, Chicago. Harman, Donna. 1988. "Towards interactive query expansion." Proceedings of ACM SIGIR-88, pp. 321-331. Harman, Donna and Gerald Candela. 1989. "Retrieving Records from a Gigabyte of text on a MInicomputer Using Statistical RanI~~g." Journal of the American Societyfor Information Science, 41(8), pp.581-589. Hindle, Donald. 1990. "Noun classification fro(n predicate-argurnent structures." Proc. 28 Meeting of the ACL, Pittsburgh, PA, pp.268-275. Kwok, K.L., L. Papadopoulos and Kathy Y.Y. Kwan. 1993. "Retrieval Fxperiments with a Large Collec- t[on using PIRCS." Proceedings of ThEC-1 confer- ence, NIST special publication 500-207, pp.153-172. Lewis, David D. and W. Bruce Croft. 1990. "Term Clus- tering of Syntactic Phrases". Proceedings of ACM SIGIR-90, pp.385-405. Mauldin, Michael. 1991. "Retrieval Performance in Fer- ret: A Conceptual Information Retrieval System." Proceedin~ of ACM SIGIR-91, pp.347-355. Meteer, Marie, Richard 8chwartz, and Ralph Weischedel. 1991. "Studies in Part of Speech Label- ing." Proceedings of the 4th DARPA Speech and Natural Language Workshop, Morgan-Kaufinan, San 136 Mateo, CA. pp.331-336. Sager, Naoml. 1981. Natural Language Information Processing. Addison-Wesley. Sparck Jones, Karen. 1972. "Statistical interpretation of term specificity and its application in retrieval." Journal ofDocumentation, 28(1), pp.11-20. Sparck Jones, K. and E. 0. Barber. 1971. "What makes automatic keyword classification effective?" Jour- nal of the American Society for Information Science, May-June, pp.166-175. Sparck Jones, K. and J. L Talt. 1984. "Automatic search term variant generation." Journal ofDocumentation, 40(1), pp.50-66. Strzalkowski, Toinek and Barbara Vauthey. 1991. "Fast Text Processing for Information Retrieval." Proceedings of the 4th DARPA Speech and Natural Language Workshop, Morgan-Kaufman. pp.346- 351. Strzalkowski, Tomek and Barbara Vauthey. 1992. "Information Retrieval Using Robust Natural Language Processing." Proc. of the 30di ACL Meet- ing, Newark, DE, June-July. pp.104-ill. Strzalkowskl, Tomek. 1992. "TI'P: A Fast and Robust Parser for Natural Language." Proceedings of the 14th International Conference on Computational Linguistics (COLING), Nantes, France, July 1992. pp. 198-204. Strzalkowski, Tomek. 1993. "Natural Language Process- ing in Large-Scale Text Retrieval Tasks." Proceed- ings of the First Text REtrieval Conference ~~EC- 1), NIST Special Publication 500-207, pp.173-187. Strzalkowski, Tomek. 1993. "Robust Text Processing in Automated Information Retrieval." Proc. of ACL- sponsored workshop on Very Large Corpora. Ohio State Univ. Columbus, June 22. Strzalkowski, Tomek, and Peter Scheyen. 1993. "An Evaluation of TI'P Parser: a prellIninary report." Proceedings of International Workshop on Parsing Technologies (IV~'PT-93), Tilburg, Netherlands and Durbuy, Belgium, August 10-13.