The University of Massachusetts TIPSTER Project \V. Bruce Croft Coiiiputer Science Department University of' Massachusetts Amherst, MA. 01003 The TIP STER. project in the Information Retrieval Laboratory of the Computer Science Department, University of Massachusetts, Amherst (which includes MCC and David Lewis of the University of Chicago as s~)contractors), is focusing on the following goais: * Improving the effectiveness of information retneval techniques for large, full-text databases, * Improving the effectivelLess of r~1ti1ig techniques appropriate for long-term informa- tion needs, and * Demonstrating the effectiveijess of these retrieval and routing techniques for Japanese full-text (1a.tai)a.ses. Our general approaCh to (~clIievi1'g these go('ds haS been to use improved representations of text and information lLee(ls ill tiLe fra.iiie~vork of a 1LC~V model of retrieval. Retrieval (and routing) is vie~ved as a prol) a )ilis tic uLfereuce 1)ro cess which "conip ares" text represent a- tions based on different fornis of linguistic and statistical evidence to representations of hiformation needs based on siiiii1~'ir evideiLce froiii natural language queries and user inter- action. New techniques for 1ear1LiiL~ (rclev(~.nce fee~)ack) and extracting term relationships from text are also being studied. The det;~ils aud evaluation (witiL smaller test databases) of the new model, known as the uLference net model can be found in other papers [3, 2, 4]. Some of the specific research issues we are addressing are morphological analysis in En- glish and Japanese, word Sense disan~)iguation in English, the use of phrases and other syntactic structure in English and Ja.l)allese, the use of special purpose recognizers in rep- resenting documents and queries, analyzing natural language queries to build structured representations of information nee(lS, learnii~ techniques appropriate for routing and struc- tured queries, and probability estimation techniques for indexing. Comparing the TIPSTER. experiments to previous IR. experiments done using the stan- dard test collections (e.g. CACM, CISI NPL, etc.), there are a. number of interesting differences: * The size of the corpus is iiiuch larger than previous collections, both in terms of the number of documents aiLd tiLe ailiount of text. This presents a challenge to the robustness and efficiciLcy of experiliieiLtal information retrieval systems. Experiments with indexing for exaiiiple (alL take days ilLSte('L.d of minutes. * The documents in TIPSTER are nearly all full text, rather than abstracts. 101 * The documents in TIPSTER They collie the general arc~a are the ~ Tall Street Journal, technology area, Departnicnt are heterogeneous ill terms of both subject and length. of science, technology and economics, but the sources Associated Press newswire, Ziff magazines in the high of Energy abstracts, and the Federal Register. * The queries (known as `topics" in TIP STER) are longer and have more structure than those found in other test collections. * The queries have specific and These criteria (specified in the between relevance judges but information retrieval Svsteiii. strict cutena specified for documents to be relevant. "narrative" part of the topic) will reduce inconsistency are sometimes difficult to handle in the context of an * The routing experinielits are unhke any carried out before. * The retrieval and routilig experiments ~vith Japanese are also unique. The first TIPSTER. evaluatioii w('s liujited by a number of factors, the primary one being the lack of relevance judgeiiieiit.~ for the jiutial query set. This made it difficult to carry out experiments to sele(~t technique.~ appropriate for large, full-text databases. The results from this evaluation 5lL()1l1(l~ therefore, be regarded as preliminary, and indeed raise more questions than they answer. In the retrieval experinient, 50 new ~tol)ics" were used to search the "old" database, which consisted of approxiiiiately 1 GByte of text. One of the major subjects of the eval- uation was to try different forms of queries produced by processing the topics. Our basic approach to topic processing is to parse them, selecting parts to be indexed, recognizing phrases and "factors" such a." locations, dates, companies, etc. Some factors, such as "de- veloping country", which have been specifically identified as important in the topic, will be expanded using a synonym operator. Weights reflecting relative importance are attached to the concepts (words and phrases). Phr~ise-based concepts are represented by operators defined in the inference net l~u1guage. These operators use proximity of the words making up the phrase as the major form of evidence for the presence of the concept [1]. The result of topic processing is an inference net representing the information nee(l. In addition to the automatic query processing, some query versions were generated by simulating simple user interactioii with the results of the topic processing. The modifica- tions to the automatically processed topics were limited to changing the weight of concepts, deleting concepts considered uniniportant, and adding structure (such as specifying syn- onymous concepts). The lilost signifi(~ant change in the last category was the introduction of "unordered window" 01) er~i.tors to Si umlate paragraph-level retrieval. The equivalent in terms of a user user interface would 1)e to ask users to group concepts that should occur together. The results of the first evaluation are described here in terins of the average precision in the top 5, 30 and 200 documents in the ranking produced by the inference net retrieval engine (INQUERY). This evaluation method was chosen because only the top 200 documents for each query were judged for relevance. The results were as follows: 102 Query Type 5 docs T+D+C+F+phrase .64 T+D+C+F .62 1+N .60 T+C+phrase .66 1+nian .6.5 1+man+para .72 Average Precision (50 topics) 30 docs 200 docs .52 .35 .52 .35 .50 .34 .53 .36 .56 .36 .61 .39 (-3.1%) (-6.7%) (+3.1%) (+1.6%) (+12.5%) (0%) (-3.8%) (+1.9%) (+7.7%) (+17.3%) (0%) (-2.8%) (+2.8%) (+2.8%) (+10.3%) Table 1: TIPSTER Retrieval Results: Query types refer to topic fields used. T is topic, D is description, C is concepts, F is factors, N is narrative, phrase means phrase constructs used, 1 refers to the basehue (the first hue), man means manual modification, para means paragraph retrieval. These results support two main conclusions: the first being that the effectiveness of the retrieval techniques is surprisingly good considering the difficulty of the queries; the second is that paragraph-level retrieval ~i.s silLiulated by iua.nual creation of `ulLordered window" queries significantly improves eff.ectivelLess. Much of the short-term development of the inference net retrieval system will concentrate on techniques to accomplish paragraph-level retrieval automatically. The major questiolL raised by the results concerns the effectiveness of phrases. In previous experiments with mediuni-sized full-text collections, phrase-based retrieval led to significant effectiveness improvements. This is not evident in the results shown here. A possible explan~~ion for this is the size of the TIPSTER topics, where queries may have more than 50 terms, but it should also be remembered that these results are very preliminary. The routing expenments used 20 `~old" topics to search the "new" database (approxi- mately 1 GByte of text from the same sources as the "old" database, with the exception of DOE abstracts). Since the aim of these experiments was to study techniques for represent- ing and using long-term information needs, we assumed that users would be more involved in query formulation and thus the baseline used was the "1+man+para" queries. The other query types in this experiment used variations of relevance feedback to modify the baseline queries. These modifications consist of adding concepts to the query and reweighting the query concepts based on their fiequency of occurrence in the identified relevant documents. For this experiment, we had a small numl)er of relevance judgements based on documents retrieved by another system. The techniques used to select concepts to add to the query were based on local and gl~)al application of the EMIM measure of association [5] The number of terms added to a query was limited to 5. The results show that, once again the effectiveness levels are quite good (note the 50% precision value at the 200 document level). The relevance feedback techniques were not effective, except at the high precision end of retrieval. The features selected were, on inspection, reasonable, but they (10 not ~pear to be the features required by the narrative in order to make a document relevant. No definite conclusions can be made about the 103 Query Type Average Precision 5 docs 30 docs man .66 man+weights .68 (+3.0%) man+EMIM+weights (1 (+7.6%) Man+LEN1I~I+weights .68 (+3.0%) .65 .63 (-3.1%) .61 (-6.2%) .64 (-1.5%) (20 topics) 200 docs .50 .48 (-4.0%) .49(-2.0%) .50 (0%) Table 2: TIPSTER Routing Results: weights are based on frequency in relevant documents, EMIM is a global selection measure, LEMIM is a local (window-based) selection measure. feedback techniques until experillients ~vith larger sets of relevauce judgements are carried out. The third set of results ~ire re1('~te(1 to tile retrieval of Japanese text. The goal of these experiments was to comp are different app roaches to morp hologi cal analysis or word seg- mentation. Japanese text is lfl(ide up of characters from a nun~~er of alphabets (I(anji, Katakana, Hiragana, and Engb.sii). There are, however, no word separators and therefore a major part of indexing is deciding what to index. ~Ve tested two alternatives: 1. An efficient, relatively crude technique where individual Kanji (Chinese) characters and St rings of Kat akana c h arac t ers are indexed. 2. A more sophisticated dictionary and grammar-based segmentation algorithm devel- oped at Kyoto University (JUMAN). There is a significant difference in the indexing times required by these techniques. With a database of 1,100 documents from a Japanese newspaper, the character-based indexing took 4 minutes while the word-based (JUMAN) indexing took 31 minutes. The relative effectiveness of the two text representations was then tested using the average precision in the top 10 documents for 30 queries. The queries were either treated as strings of characters, or were automatically structured using tile JUMAN segmenter. In the character-based ap- proach, words found in the query were expressed using the phrase operator to combine Kanji and Katakana characters. The results slLow that the retrieval performance using Japanese seems to be comparable to similar experiments with English databases, and the relatively simple character- based indexing technique is ~irprisiiigly effective compared to more sophis- ticated word-based techniques. The latter result is interesting, but the experiment must be repeated when the larger TIPSTER Japanese dat~)ase and query set becomes avallable. We are currently carrying out a range of more detailed experiments using the relevance judgements that are now avail~)le. The results from these experiments will allow us to tune the techniques being used and to make more definite conclusions about their relative effectiveness. In addition, we will contiiuie to incorporate new approaches into the retrieval and routing software for the upcoming evaluations. 104 Query Type Average Precision in Top Ten (30 queries) Character-Based Indexing Word-Based Indexing Characters .61 Words using phrase operator .63 (+3.3%) Words .65 (+6.6%) Table 3: TIPSTER Japanese Retrieval Results References [1] W.B. Croft, H. Turtk~ D. Le'vis, `~The Use of Phrases aud Structured Queries in Information Retrieval", Proceedings of SIGIR 91, 32-45, (1991). [2] W.B. Croft and H. Turtle, Text Retrieval and Inference", in Text-Based Intelligent Systems, Paul Jacobs (ed.), Lawrence ErH)aunl, New Jersey, 127-1.56, (1992). [3] H.R. Turtle and W.B. Croft, "Evaluation of an Inference Network-Based Retrieval Model", ACM Transactions on I?zforination Systems, 9(3), 187-222, (1991). [4] H. Turtle and W.B. Croft, ~A Comp~ison of Retrieval Models", Computer Journal, 35(3), 279-290, (1992). [5] C. J. Van Rijsbergen, I'~forr'iation Rctrieval. Butterworths, (1979). 105