OCLC Online Computer Library Center,  Inc.
Roger Thompson


Introduction

  We are interested in determining the extent to whi~h the effects of
  syntactic phrase indexing scales up to large databases.    Previous
  investigations of the hypothesis that syntactic phrase indexing leads to
  improvements in retrieval performance were conducted on databases ranging
  from 250 records to a few thousand records   (Dillon and Gray 1983, Fagan
  1987,  Lewis 1991, Burgin and Dillon 1992).  The results have been
  conflicting or equivocal.   However, we believe that the issue isn't
  settled.   Technologies for phrase extraction and thesaurus construction
  have been evolving,  and the arguments favoring syntactic phrase indexing
  have not been convincingly dispelled.

  The major argument in favor of phrase indexing is that it is a precision
  enhancer because words in context are less ambiguous than isolated
  terms,  and because documents represented in indexes as key phrases
  indicative of their content instead of the sum of their individual terms
  have undergone considerable noise reduction.    Nevertheless, we believe that
  there are upper limits to the effectiveness of syntactic phrase indexing.
  One of our goals in this project is to understand and elucidate these limits.

  In keeping with our interest in answering the simple question of whether
  syntactic phrase indexing scales up, we tested the effectiveness of an
  existing program for phrase extraction, FASIT   (described in Dillon and
  Gray 1983, Dillon and McDonald 1983 and Burgin and Dillon 1992),   in the
  SMART retrieval environment.   The primary advantages of FASIT are that the
  phrase extraction process is fully automatic,   the parse is shallow and
  time-efficient,  and the logic is table-driven and easily modified.   Thus,
  FASIT represents a kind of lower-bound estimate of what is necessary to
  enhance retrieval performance through automatic indexing.

FASIT Description

  FASIT identifies noun phrases appropriate for indexing by
  determinining the part of speech for each word in the input text.
  This is done by looking up the word in a dictionary created by
  assigning tags derived from the Brown Corpus to all entries in the
  Oxford Advanced Learner's Dictionary.    If the word is not found in
  the dictionary,  its part of speech is determined from the word's suffix.
  Words with more than one part speech have multiple tag assignments and
  are eventually disambiguated by examining the tags of the words in the
  surrounding context.

  Once tagging is complete,  the concept selection module consults a
  template to identify index phrases.    Concepts in FASIT are a subset
  of the noun phrases encountered in the input text which are judged
  by syntactic criteria to be useful for indexing.    These include
  all proper nouns,  adjective-noun combinations such as "federal
  agency"; noun-noun combinations such as "metals technology"; or
  noun-prepositional phrase combinations such as "maker of furniture",
  which might be paraphrased as the noun-noun construction "furniture
  maker".   The selected concepts are normalized by eliminating determiners
  and pronouns,   and the head noun is stemmed.

Table I shows a portion of a sentence as it passes
phases of FASIT processing.

Table I   -- Stages of FASIT Processing


Input      Tagging    Disambiguation  Selected
                                      Concepts


                                      189

through the major

   Most         AP
   of           OF
   these        DTS
   homocides    NNS VBZ      NNS         homoc ides
   have         HV
   been         BEN
   related      VBD VBN      VBD
   to           RI  TO
   the          AT
   city's       NN$
   burgeoning   VBG
   drug         NN VB        NN            drug
   trade        NN VB        NN            trade

  In previously reported work on FASIT   (Dillon and Gray 1983,  Dillon
  and McDonald 1983),  additional processing was done on the extracted
  concepts.   Nouns and adjectives which might be judged not to be
  indicative of the document's content,  such as "current account" or
  "little flexibility",  and might therefore be expected to reduce
  precision,  were filtered out.  Secondly,  the earlier work reported
  an algorithm which might increase precision by performing a rudimentary
  degree of phrase clustering.   Phrases such as "life insurance" and "life
  insurance policy" were collected to form a type of thesaurus entry because
  they share many stems.   Both improvements were eliminated from the present
  study because of our interest in establishing a baseline performance for
  FASIT.   However, both are active areas of research.

Data Preparation

  All data is processed automatically; there is no hand-guiding.     Queries
  are in the narrative-concept format.

  Queries and documents are represented as FASIT output.     In addition, the
  component words of the phrases identified by FASIT are represented as terms.
  All words and phrases are stemmed.

  Because we are Plan B participants,  we are working with the 350M subset
  of the Wall Street Journal data.   A SMART database with ATC weighting was
  built from the FASIT output and submitted to the TREC sponsors for
  evaluation.

  We have also performed extensive testing and failure analysis of a
  46449-record subset of the TREC training database which consists of all
  Wall Street Journal articles from 1987.    We chose this subset because all
  of the documents relevant to the training queries for the Plan B participants
  were drawn from the 1987 subset.   A baseline database for these experiments
  was created using SMART, ATC weighting and stem indexing.

Results

  The results reported here are from the 1987 subset of the Wall Street
  Journal database because the results of the full database were not
  available at the time of this writing.    Table II shows precision scores
  for eleven levels of recall in the baseline database; Table III shows
  results for a test database constructed with FASIT processing of
  queries and documents.


Table II


Precision Results for Eleven Levels of Recall
with Stem Indexing


                                       190

Num~queries:
Total number of
    Retrieved:
    Relevant:
    Rel_ret:
    Trunc_ret:
Recall   Precis
    at 0.00
    at 0.10
    at 0.20
    at 0.30
    at 0.40
    at 0.50
    at 0.60
    at 0.70
    at 0.80
    at 0.90
    at 1.00
Average precision
   11-pt Avg:
Average precision
    3-pt Avg:

Table III

      25
documents over all queries
    5000
     245
     166
    4432
ion Averages:
   0.3590
   0.3031
   0.2118
   0.1641
   0.1444
   0.1269
   0.0946
   0.0819
   0.0599
   0.0322
   0.0292
   for all points
   0.1461
   for 3 intermediate points  (0.20, 0.50,  0.80)
   0.1329

Precision Results for Eleven Levels of Recall
with FASIT Indexing


Num_queries:          25
Total number of documents over all queries
    Retrieved:      5000
    Relevant:        245
    Rel_ret:         164
    Trunc_ret:      4382
Recall - Precision Averages:
    at 0.00        0.3867
    at 0.10        0.3092
    at 0.20        0.2369
    at 0.30        0.1918
    at 0.40        0.1449
    at 0.50        0.1034
    at 0.60        0.0816
    at 0.70        0.0666
    at 0.80        0.0501
    at 0.90        0.0458
    at 1.00        0.0337
Average precision for all points
   11-pt Avg:      0.1501
Average precision for 3 intermediate points    (0.20, 0.50, 0.80)
    3-pt Avg:      0.1301

  The higher precision scores at the lowest levels of recall are a
  replication of the experiments with FASIT reported in Dillon and
  McDonald  (1983) and Burgin and Dillon  (1992), and support the
  hypothesis that phrase indexing serves primarily as a precision
  enhancer.   In the higher levels of recall,  the effects of phrase
  indexing and stem indexing are essentially the same,    indicating that
  retrieval performance is not   jeopardized by the reduced document
  representation.   We expect larger gains in precision with phrase
  indexing as we fine-tune our phrase-extraction algorithms.

References

Burgin and Dillon,  M  (1992). Improving Disambiguation in FASIT.


                                      191

  Journal of the American Society for Information Science1     431
  101-114.

Dillon and Cray,  A.S.  (1983).  FASIT: A fully automatic syntactically
  based indexing system.    Journal of the American Society for
  Information Science,   35, 3-10.

Dillon and McDonald1   L.K. (1983). Fully automatic book indexing.
  Journal of Documentation,   39, 135-154.

Fagan,  J.L.  (1987). Experiments in automatic phrase indexing for
  document retrieval:   A comparison of syntactic and non-syntactic
  methods.    (Ph.D. dissertation, Cornell University.)  Technical
  Report No.   87-868. Ithaca, NY: Cornell University.

Lewis,  D.  (1991). Representation and learning in information
  retrieval.   (Ph.D. dissertation, University of Massachusetts,
  Amherst.)   COINS Technical Report 91-93.


                                        192