TREC-2 Routing and Ad-Hoe Retrieval Evaluation using the INQUERY System Bruce Croft, James Callan, and John Broglio Computer Science Department University of Massachusetts Amherst, MA. 01003 1 Project Goals The ARPA TIPSTER project, which is the source of the data and funding for TREC, has involved four sites in the area of text retrieval and routing. The TIPSTER project in the Information Retrieval Laboratory of the Computer Science Department, University of Massachusetts, Amherst (which includes MCC as a subcontractor), has focused on the following goals: * Improving the effectiveness of information retrieval techniques for large, full-text databases, * Improving the effectiveness of routing techniques appropriate for long-term informa- tion needs, and * Demonstrating the effectiveness of these retrieval and routing techniques for Japanese f':ill text databases [4]. Our general approach to achieving these goals has been to use improved representations of text and information needs in the framework of a new model of retrieval. This model uses Bayesian networks to describe how text and queries should be used to identify relevant documents [6, 3, 7]. Retrieval (and routing) is viewed as a probabilistic inference process which compares text representations based on different forms of linguistic and statistical evidence to representatioil$ of information needs based on siniilar evidence from natural language queries and user interaction. Learning techniques are used to modify the ini- tial queries both for short-term and long-term information needs (relevance feedback and routing, respectively). This approach (generally known as the inference net model and implemented in the INQUERY system) emphasizes retrieval based on combination of evidence. Different text representations (such as words, phrases, paragraphs, or manually assigned keywords) and different versions of the query (such as natural language and Boolean) can be combined in a consistent probabilistic framework. This type of ~tdata fusion" has been known to be effective in the information retrieval context for a number of years, and was one of the primary motivations for developing the inference net approach. Another feature of the inference net approach is the ability to capture complex structure in the network representing the information need (i.e. the query). A practical consequence 75 of this is that complex Boolean queries can be evaluated as easily as natural language queries and produce ranked output. It is also possible to represent "rule-based" or "concept-based" queries in the same probabilistic framework. This has led to us concentrating on automatic analysis of queries and techniques for enhancing queries rather than on in-depth analysis of the documents in the database. In general, it is more effective (as well as efficient) to analyze short query texts than millions of document texts. The results of the query analysis are represented in the INQUERY query language which contains a number of operators, such as #SUM, #AND, #OR, #NOT, #PHRASE, and #SYN. These operators implement different methods of combining evidence and describing concepts. Some of the specific research issues we are addressing are morphological analysis in En- glish and Japanese, word sense disambiguation in English, the use of phrases and other syntactic structure in English and Japanese, the use of special purpose recognizers (for example, company, country and people name recognizers) in representing documents and queries, analyzing natural language queries to build structured representations of informa- tion needs, learning techniques appropriate for routing and structured queries, techniques for acquiring domain knowledge by corpus analysis, and probability estimation techniques for indexing. The first TREC evaluation and the two previous TIPSTER evaluations have made it clear that a lot remains to be learned about retrieval in large, full-text databases based on complex information needs. Issues as phrases, relevance feedback, and probability es- timation have proven to be quite difficult in such environments. On the other hand, the effectiveness levels achieved have been quite good. The experiments done in the TREC- 2 evaluation, together with the 24 month TIPSTER evaluation which followed it, were designed to improve our understanding about which IR techniques work and why. 2 System Description The document retrieval and routing system that has been developed on the basis of the in- ference net model is called INQUERY [2]. The main processes in INQUERY are document indexing, query processing, query evaluation and relevance feedback. In the document indexing process, documents are parsed and index terms representing the content of documents are identified. INQUERY supports a variety of indexing tech- niques including simple word-based indexing, indexing based on part-of-speech tagging and phrase identification, and indexing by domain-dependent features such as company names, dates, locations, etc. The last type of indexing is a first step towards integrating detection and extraction systems. In more detail, the document structure is used to identify which parts will be used for indexing. The first step of this process is then to scan for word tokens. Most types of words (including numbers) are indexed, although a stopword list is used to remove very common words. Stopwords can be indexed, however, if they are capitalized (but not at the start of sentences) or joined with other words (e.g. "the The-i system"). Words are then stemmed to conflate variants. Although the Porter stemmer was used for the TREC-2 76 experiments, we have developed a new stemming algorithm that has a number of advantages for operational systems. A number of recognizers written in flex are then used to identify objects such as company names and mark their presence in the document using "meta" index terms. A company name such as IBM in the text, for example, will result in a meta term #COMPANY being recorded at that position in the text. The use of these meta terms extends the range of queries that can be specified. This completes the usual processing for document text. The document indexing process also involves building the compressed inverted ifies that are necessary for efficient performance with very large databases. Since positional information is stored, overhead rates are typically about 40% of the original database size. The query processing process involves a series of steps to identify the important concepts and structure describing a user's information need. INQUERY is unique in that it can represent and use complex structured descriptions in a probabilistic framework. Many of the steps in query processing are the same as those done in document indexing. In addition, a part-of-speech tagge? is to used to identify candidate search phrases. Domain-dependent features are recognized and meta-terms inserted into the query representation. The relative importance of query concepts is also estimated, and relationships between concepts are suggested based on simple grammar rules. An evaluation of some of the query processing techniques is presented in [1]. INQUERY also has the capability of expanding the query using relationships between concepts found by either using manually specified domain knowledge in the form of a simple thesaurus or by corpus analysis. The WORDFINDER system is a version of INQUERY that retrieves concepts that are related to the query. WORDFINDER is constructed by identifying noun groups in the text and representing them by the words that are closely associated with them (i.e. occur in the same text windows). Concept "documents" are then stored in INQUERY. This technique of query expansion was not tested in TREC-2. The query evaluation process uses the inverted ifies and the query represented as an inference net to produce a document ranking. The evaluation involves probabilistic inference based on the operators defined in the INQUERY language. These operators define new concepts and how to calculate the belief in those concepts using linguistic and statistical evidence. We are constantly experimenting with and refining these operators (for example, the operator defining a phrase-based concept) in order to improve retrieval performance. The relevance feedback process uses information from user evaluations of retrieved doc- uments to modify the original query in detection or routing environments. The INQUERY system, because it can represent structured queries, supports a wide range of learning tech- niques for query modification [5]. In general, new words and phrases are identified in the sample of relevant documents. These are added to the original query and all the terms in the query are then reweighted. With the amount of relevance information available in TIPSTER, relatively simple automatic techniques appear to produce good levels of effec- tiveness. We are also investigating the effect of using more limited information and more complex learning techniques, such as neural networks. 77 3 Query Processing In order to clarify the query processing done for the TREC and TIPSTER experiments with INQUERY, the following sections give more detailed descriptions. There are two main kinds of query styles: a natural language query and a keyword or key concept query. For example, the and fields of a TIPSTER query represent natural language queries of varying levels of abstraction. The , and <f ac> fields represent key concepts in the query. The main difference between the two types of processing is that the key concept query has more controlled information. The phrasing and emphasis are already given and do not have to be conjectured from the language structure. It is valuable to discover how to treat both styles of query, because a good user interface will make it easy for a user to input both styles. For example, a user may enter a prose query and then highlight the important words and phrases in the query in some convenient manner. These highlighted words would then be treated as key concepts in the query processing. 3.1 Prose query processing Natural language query fields are tagged for syntactic category by a part-of-speech (POS) tagger. Currently we use the tagger developed by Ken Church. We have developed our own P05 tagger, and we expect to begin using it in the fall of 1993. There are some pre- tagging and post-tagging "housekeeping" operations, such as removing parentheses. (The current version of INQUERY does not permit parentheses except as part of an operator, and we do not yet make any inferences from the presence of parentheses during the text processing.) Additionally, we change operator phrases to single words in order to simplify later processing. An example of this simplification is replacing the phrase in order to with the infinitive particle to or replacing with respect to with the word regarding. The goal of this replacement is to remove phrases which resemble noun phrases syntactically but which are really syntactic operators (e.g., phrasal prepositions) with no substantive content. At this stage, stop phrases are also removed. 3.1.1 Noun and adjective phrase capture: orthographic and syntactic clues. When the text is tagged and the potentially irrelevant material has been removed, syntactically- based noun group capture is performed. Certain kinds of noun phrase patterns are enfolded in a #PHRASE operator: 1. A noun phrase which contains more than one modifying adjective and noun is enclosed in a #PHRASE operator; 2. A head noun with no premodifiers and followed by a prepositional phrase is enclosed in a #PHRASE operator with the head noun of the prepositional phrase; 78 3.1.2 Constraint capture All text in the query is searched for constraint expressions. Among these expressions are the words company, not U. S. or a restriction in the nationality section of the <fac> field to U.S. or other nationality. A restriction to U.S. nationality as the area of interest is implemented by penalizing documents for references to foreign countries. A restriction to other nationalities is implemented by repeating that country as a term. This asymmetry depends on the fact that the document collection is drawn solely from U.S. sources, and therefore the U.S., as the default area of interest, is rarely referred to unless the government or foreign policy implementation is under discussion. There is some recognition of simple time expressions, such as since 1984 which are expanded to the set of years which might be intended by the phrase in question. Countries are recognized as such and are handled so that expressions like South Africa are phrased as #1 ( south africa ) even when they appear in the middle of a larger group of capitalized words. In addition, proper names such as country names are moved out of the scope of *PHRASE operators, since it generally increases the effectiveness of a #PHRASE to reduce the number of words in it. Nationality constraints can better be maintained within the scope of the larger and more tolerant *SUM operator. For example the phrase import ban on South African diamonds'' becomes by stages, #PHRASE (import ban on #SYN (#1 (south african) #1 (south africa)) diamonds) and finally #SUM (*SYN (#1(south african) *1(south africa)) #PHRASE(import ban on diamonds)). 3.2 Key concept query processing Key concept query processing is different from prose query processing since the concept separation provided by the user can presumably be trusted. Instead of using a part-of- speech tagger, we rely on comma delimitation of concepts, and #PHRASE the words found between each pair of delimiters. Additionally, if any constraints were found anywhere else in the query, e.g., a mention of the word company or an exclusionary geographical constraint (e.g., not USA or only USA), the query will be modified according to these constraints. For example, only USA ~ #NOT (#FOREIGNCOUNTRY ) and not USA ~ #NOT ( #USA ). If the word company is found in a query, then a second copy of the key concepts (the <con> field), is produced where each item in the field appears in an unordered window operator with the special concept #COMPANY. For example, if the word South Africa appears as a key concept (and company appears somewhere in the query), then the pre- processor would produce the term #UW5O ( #C~MPANY #1 ( south africa)) which would match any document which had a company name within fifty words of South Africa. 79 4 The TREC Experiments Four experiments were submitted to the TREC evaluation, two "ad-hoc" and two "routing". In these experiments, we emphasized automatic query processing and automatic feedback algorithms for routing. The following is a summary: . AdHoc: topics 101-150 against TIPSTER volumes 1 and 2. INQOOl Created automatically from TIP STER topics. Contains phrases. Details of query processing used are described below. INQOO2 INQOOl queries, modified manually. Modifications restricted to eliminating words and phrases, and adding paragraph-level operators around existing words and phrases. The method for doing this was done somewhat differently than last year's TREC conference, as discussed below. . Routing: topics 51-100 against TIPSTER volume 3. INQOO3 Created automatically from TIPSTER topics and relevance judgements from Volumes 1 and 2. Baseline queries (from a previous TIPSTER evalua- tion) were modified by reweighting and adding single-word terms. The term weighting and selection function used was df.idf, as described in [5]. Only the top 120 relevant documents found by INQUERY were used for feedback, and 30 terms were added to each query. INQOO4 Formed by combining (using the #SUM operator) INQOOl queries and IN- QRYP queries (used in TIPSTER 18 month evaluation). The INQRYP queries were produced automatically and then modified manually. Modifications re- stricted to eliminating words and phrases, and adding paragraph-level operators around existing words and phrases. Query Type INQOOl 1NQ002 5 Docs .62 .60 (-2.6%) Average Precision 30 Docs 100 Docs .57 .49 .59 (+3.5%) .51 (+4.1%) 11-Pt Avg .36 .36 (0%) Table 1: Results for Adhoc queries Table 1 gives the results for the adhoc queries. These show that there is little difference in effectiveness between the automatically processed queries and the semi-automatically processed queries. The query processing for the automatically processed queries has been significantly improved as described in the previous section, but there is another effect. Compared to the manual query run in the last TREC conference, paragraph-level concepts were formed in a much more mechanistic way and were constrained by the language of the description and the narrative. In the previous conference, the only constraint was the vo- cabulary used in the queries, and the user's "world knowledge" was used to group concepts. 80 This resulted in considerably better retrieval performance. Additional experiments using manually edited queries are discussed in the next section. Query Type Average Precision 5 Does 30 Does 100 Does INQOO3 INQ004 .64 .56 .45 .67 (+3.7%) .58 (+2.7%) .45 (0%) 11-Pt Avg .35 .36 (+2.4%) Table 2: Results for Routing queries The routing results show that some improvement is obtained by combining the manual queries with the queries that were automatically modified using relevance feedback tech- niques. The difference in performance between the two types of queries is considerably less than last year, however. Our own experiments have also shown that no additional gains in performance were obtained by using more than the top 150 documents from the INQUERY output. This is a significant result from a practical viewpoint, since in an operational envi- ronment we will not want to rely on having output from other systems or need thousands of relevance judgements before performance improves. 5 Other Experiments In the TIP STER 24 month evaluation, which took place soon after the TREC-2 evaluation, we did a number of experiments that complement those done in TREC. In particular, we evaluated paragraph-based retrieval, expansion using an automatically generated thesaurus, feedback techniques that use phrases, and Japanese indexing techniques. In this section, we report some of the most interesting results. The precision figures given here are calculated using the TREC-2 relevance judgements, rather than the TIP STER judgements. The first two experiments were with adhoc queries. INQO41 (the numbers are consistent with those used in TIP STER and other publications) is a run that used a different manually modified version of INQOOl. That is, the manual modifications were the same as those done in the first TIP STER and TREC evaluations, rather than the more restricted modifications done for INQOO2. INQO42 is a run that combines INQO41 with INQOOl. Query Type Average Precision 5 Does 30 Does 100 Does 11-Pt Avg INQO41 INQO42 .68 .60 .50 .36 .65 (-4.6%) .61 (+1.7%) .51 (+2.0%) .38 (+5.6%) Table 3: Results for TIPSTER adhoc queries These results show that the manually modified queries can achieve significantly better precision at low recall levels. For example, at the 5 document cutoff level, the average precision for INQO41 is 9.7% higher than INQOOl. The overall average is the same, however. This is a much smaller difference than was seen in the first TREC and TIPSTER evaluations 81 of the INQUERY system and it indicates that the automatic query processing has improved considerably. The combination search (1NQ042) is slightly worse than INQO41 at the 5 document cutoff level, but overall is better than either the automatic or manual queries on their own. An adhoc search that incorporates automatic paragraph-level matching was also tested in TIPSTER and this resulted in a further 5% improvement. INQ023 and 1NQ024 are routing query sets that were created automatically using rel- evance judgements from volumes 1 and 2. In addition to the single-word terms added in INQOO3, 10 phrase-level concepts and 20 paragraph-level concepts were added to the query. A phrase-level concept is a #UW5 two-word pattern that occurs frequently in the relevant documents, and a paragraph-level concept is a ~UW50 two-word pattern. The #UWn operator looks for co-occurrence in any order in a text window of size n. The difference between INQ023 and 1NQ024 is that 1NQ023 contains the original query terms in addition to terms extracted from relevant documents, whereas 1NQ024 contains only terms from relevant documents. Query Type 1NQ023 1NQ024 5 Docs .67 .68 (+1.5%) Average Precision 30 Docs 100 Docs 11-Pt Avg .60 .47 .38 .59 (-1.7%) .46 (-2.2%) .39 (+2.6%) Table 4: Results for TIP STER routing queries These results show that there is little difference between using the original query or just the relevant documents. This is probably due to the large number of relevance judgements available in this routing experiment. In a relevance feedback situation, where there are far fewer relevant documents, the original query is very important. It is clear that the addition of phrase and paragraph-level structure to the routing has improved performance. The average precision for 1NQ023 is 8.6% higher than INQOO3. Combining these new runs with manually modified routing queries produced further improvements. 6 Summary The TREC-2 runs, both in the adhoc and routing categories, provided further evidence that manually generated queries are not, in general, superior to automatically processed natural language queries. In the case of routing, in fact, the manual queries are significantly less effective. They do, however, improve the effectiveness of retrieval when used in combination with the automatic queries. This combination of query types has been a theme of the research at the University of Massachusetts and has been established as effective in a number of experiments. The additional TIPSTER runs showed that learning structure in the form of phrases and paragraph-level co-occurrences is effective for routing. They also showed that learning techniques significantly improve performance (the best routing runs were more than 20% 82 higher in terms of average precision than the best queries that were not modified using relevance judgements). It is becoming apparent that techniques that may not work well in relevance feedback situations with few identified relevant documents, may be very effective in routing where there are many more relevant documents identified. We are currently doing experiments with different forms of weighting, including the use of identified non-relevant documents. With regard to improving the performance of adhoc queries, we are continuing to carry out experiments with different ways of estimating the probabilities (or tf.idf weights) needed for the inference net, and with different forms of paragraph-level matching. Finally, as men- tioned earlier, we have seen some significant improvements using automatic query expansion based on corpus analysis. References [1] J. P. Callan and W.B. Croft. An evaluation of query processing strategies using the TIP- STER collection. In Proceedings of ACM SICIR International Conference on Research and Development in Information Retrieval, pages 347-356, 1993. [2] J. P. Callan, W.B. Croft, and S.M. Harding. The INQUERY retrieval system. In Proceed- ings of the 3rd International Conference on Database and Expert Systems Applications, pages 78-83, 1992. [3] W. Bruce Croft and Howard R. Turtle. Text retrieval and inference. In P. Jacobs, editor, Text-Based Intelligent Systems, pages 127-156. Lawrence Erlbaum, 1992. [4] Hideo Pujii and W. Bruce Croft. A comparison of indexing techniques for japanese text retrieval. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 237-246. ACM, 1993. [5] David Haines and W. Bruce Croft. Relevance feedback and inference networks. In Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 2-11. ACM, 1993. [6] H.R. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Transactions on Information Systems, 9(3):187-222, 1991. [7] H.R. Turtle and W.B. Croft. A comparison of text retrieval models. Compt£ter Jot£rnal, 1992. To appear. 83