OCLC Online Computer Library Center, Inc. Roger Thompson Introduction We are interested in determining the extent to whi~h the effects of syntactic phrase indexing scales up to large databases. Previous investigations of the hypothesis that syntactic phrase indexing leads to improvements in retrieval performance were conducted on databases ranging from 250 records to a few thousand records (Dillon and Gray 1983, Fagan 1987, Lewis 1991, Burgin and Dillon 1992). The results have been conflicting or equivocal. However, we believe that the issue isn't settled. Technologies for phrase extraction and thesaurus construction have been evolving, and the arguments favoring syntactic phrase indexing have not been convincingly dispelled. The major argument in favor of phrase indexing is that it is a precision enhancer because words in context are less ambiguous than isolated terms, and because documents represented in indexes as key phrases indicative of their content instead of the sum of their individual terms have undergone considerable noise reduction. Nevertheless, we believe that there are upper limits to the effectiveness of syntactic phrase indexing. One of our goals in this project is to understand and elucidate these limits. In keeping with our interest in answering the simple question of whether syntactic phrase indexing scales up, we tested the effectiveness of an existing program for phrase extraction, FASIT (described in Dillon and Gray 1983, Dillon and McDonald 1983 and Burgin and Dillon 1992), in the SMART retrieval environment. The primary advantages of FASIT are that the phrase extraction process is fully automatic, the parse is shallow and time-efficient, and the logic is table-driven and easily modified. Thus, FASIT represents a kind of lower-bound estimate of what is necessary to enhance retrieval performance through automatic indexing. FASIT Description FASIT identifies noun phrases appropriate for indexing by determinining the part of speech for each word in the input text. This is done by looking up the word in a dictionary created by assigning tags derived from the Brown Corpus to all entries in the Oxford Advanced Learner's Dictionary. If the word is not found in the dictionary, its part of speech is determined from the word's suffix. Words with more than one part speech have multiple tag assignments and are eventually disambiguated by examining the tags of the words in the surrounding context. Once tagging is complete, the concept selection module consults a template to identify index phrases. Concepts in FASIT are a subset of the noun phrases encountered in the input text which are judged by syntactic criteria to be useful for indexing. These include all proper nouns, adjective-noun combinations such as "federal agency"; noun-noun combinations such as "metals technology"; or noun-prepositional phrase combinations such as "maker of furniture", which might be paraphrased as the noun-noun construction "furniture maker". The selected concepts are normalized by eliminating determiners and pronouns, and the head noun is stemmed. Table I shows a portion of a sentence as it passes phases of FASIT processing. Table I -- Stages of FASIT Processing Input Tagging Disambiguation Selected Concepts 189 through the major Most AP of OF these DTS homocides NNS VBZ NNS homoc ides have HV been BEN related VBD VBN VBD to RI TO the AT city's NN$ burgeoning VBG drug NN VB NN drug trade NN VB NN trade In previously reported work on FASIT (Dillon and Gray 1983, Dillon and McDonald 1983), additional processing was done on the extracted concepts. Nouns and adjectives which might be judged not to be indicative of the document's content, such as "current account" or "little flexibility", and might therefore be expected to reduce precision, were filtered out. Secondly, the earlier work reported an algorithm which might increase precision by performing a rudimentary degree of phrase clustering. Phrases such as "life insurance" and "life insurance policy" were collected to form a type of thesaurus entry because they share many stems. Both improvements were eliminated from the present study because of our interest in establishing a baseline performance for FASIT. However, both are active areas of research. Data Preparation All data is processed automatically; there is no hand-guiding. Queries are in the narrative-concept format. Queries and documents are represented as FASIT output. In addition, the component words of the phrases identified by FASIT are represented as terms. All words and phrases are stemmed. Because we are Plan B participants, we are working with the 350M subset of the Wall Street Journal data. A SMART database with ATC weighting was built from the FASIT output and submitted to the TREC sponsors for evaluation. We have also performed extensive testing and failure analysis of a 46449-record subset of the TREC training database which consists of all Wall Street Journal articles from 1987. We chose this subset because all of the documents relevant to the training queries for the Plan B participants were drawn from the 1987 subset. A baseline database for these experiments was created using SMART, ATC weighting and stem indexing. Results The results reported here are from the 1987 subset of the Wall Street Journal database because the results of the full database were not available at the time of this writing. Table II shows precision scores for eleven levels of recall in the baseline database; Table III shows results for a test database constructed with FASIT processing of queries and documents. Table II Precision Results for Eleven Levels of Recall with Stem Indexing 190 Num~queries: Total number of Retrieved: Relevant: Rel_ret: Trunc_ret: Recall Precis at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision 11-pt Avg: Average precision 3-pt Avg: Table III 25 documents over all queries 5000 245 166 4432 ion Averages: 0.3590 0.3031 0.2118 0.1641 0.1444 0.1269 0.0946 0.0819 0.0599 0.0322 0.0292 for all points 0.1461 for 3 intermediate points (0.20, 0.50, 0.80) 0.1329 Precision Results for Eleven Levels of Recall with FASIT Indexing Num_queries: 25 Total number of documents over all queries Retrieved: 5000 Relevant: 245 Rel_ret: 164 Trunc_ret: 4382 Recall - Precision Averages: at 0.00 0.3867 at 0.10 0.3092 at 0.20 0.2369 at 0.30 0.1918 at 0.40 0.1449 at 0.50 0.1034 at 0.60 0.0816 at 0.70 0.0666 at 0.80 0.0501 at 0.90 0.0458 at 1.00 0.0337 Average precision for all points 11-pt Avg: 0.1501 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.1301 The higher precision scores at the lowest levels of recall are a replication of the experiments with FASIT reported in Dillon and McDonald (1983) and Burgin and Dillon (1992), and support the hypothesis that phrase indexing serves primarily as a precision enhancer. In the higher levels of recall, the effects of phrase indexing and stem indexing are essentially the same, indicating that retrieval performance is not jeopardized by the reduced document representation. We expect larger gains in precision with phrase indexing as we fine-tune our phrase-extraction algorithms. References Burgin and Dillon, M (1992). Improving Disambiguation in FASIT. 191 Journal of the American Society for Information Science1 431 101-114. Dillon and Cray, A.S. (1983). FASIT: A fully automatic syntactically based indexing system. Journal of the American Society for Information Science, 35, 3-10. Dillon and McDonald1 L.K. (1983). Fully automatic book indexing. Journal of Documentation, 39, 135-154. Fagan, J.L. (1987). Experiments in automatic phrase indexing for document retrieval: A comparison of syntactic and non-syntactic methods. (Ph.D. dissertation, Cornell University.) Technical Report No. 87-868. Ithaca, NY: Cornell University. Lewis, D. (1991). Representation and learning in information retrieval. (Ph.D. dissertation, University of Massachusetts, Amherst.) COINS Technical Report 91-93. 192