Workshop on:
             Use of Natural Language Processing at TREC

                    [summary by: David Lewis and Alan Smeaton]


   A working group of individuals concerned with the impact and use of natural language
processing ~U~) techniques in ThEC-1 met in two sessions for lively discussion. The first
question addiessed was what exacfly it meant to say that an information retrieval (IR) system
uses NLP.  There were some interesting disagreements about which TREC systems could be
considered NLP-based. It was agreed there were a number of levels of NLP techniques that
might be used in IR systems, including traditional distinctions between lexical, morphological,
syntactic, and other levels.  It was also agreed that there are many "boxes" or linguistic
processes which could be useful for information retrieval including lexicons and gazetteers,
syntactic analyzers, knowledge bases, etc.  As such boxes become more robust and widely
available, one or more may be plugged into IR systems with varying degrees of attention paid
to linguistic issues, making the dividing line between NLP and non-NLP IR systems increas-
ingly fuzzy.
   The second issue discussed was whether NLP techniques had had an impact on retrieval
quality in the TREC- 1 experiments. The consensus was that it was impossible to tell from the
results presented. Groups did not for the most part present results on controlled comparisons
between using and not using their NLP components. Furthermore, the significant number of
alternative retrieval models (vector, probabilistic, connectionist, and other) precluded making
NLP vs. non-NLP comparisons across groups.
   The discussion turned to the challenges of experimentation with NLP in the context of
TREC- 1, revealing ample reasons that TREC- 1 participants should not be faulted for the lack
of controlled experiments. All groups found getting their system to operate on 2 gigabytes of
text very challenging. The tight schedule and limited funding available led to a "one mistake
syndrome"---the limited amount of time available allowed groups to make only one mistake if
they were to get their results in on time. Groups using NLP techniques were doubly chal-
lenged, given that the generally higher computational costs of these methods compared to
traditional word-based indexing.
   It was noted that, with the exception of the University of Massachusetts TIPSThR group,
there was little work presented on focusing NLP on queries rather than documents. This is an
area of research where the computational demands are much less than in applying NLP to
document texts, and richness and subtlety of the ThECJrIPSTER topic descriptions would
suggests that such an approach could have significant payoffs. It was agreed that this was an
important area of exploration for TREC-2.
   The issue of sharing of resources among TREC participants was discussed and the con-
sensus is that although NLP work is generically similar, there do exist fine differences in NLP
approaches which makes sharing of resources like dictionaries and parsers, troublesome at
least. Still, everyone agreed that sharing is desirable when the technological and legal (copy-
right, etc.) barriers can be overcome.
   Another interesting topic raised was the issue of how the effectiveness of NLP methods in
IR changes when one scales up from traditional small IR test collections to gigabyte scale
databases. Once again, the conclusion reached was that the TREC-1 results simply do not tell
us this. We were unable to distinguish the "large collection factor" from other factors like
long vs. short queries, long vs short documents and different query types, as well as other fac-
tors discussed above.


                                             365

     A tentative hypothesis ~ f~)rwayd by some was that NLP-based methods may be best for
paragraph or sub-document retneval (termed "nugget extraction" during discussions) and that
more traditional methods may be better for more general types of queries.  It was suggested
that testing this hypothesis, and in general getting a real understanding of the effect of NLP
techniques on IR, would require a more careful analysis of the kinds of queries used (sugges-
tions were made about how the query set might be improved or augmented), as well as details
of how relevance judgments are made and what parts of documents are relevanL
     In conclusion, it was acknowledged that the emphasis of researchers in ThEC- 1 had quite
reasonably been simply on getting their systems to work at all with such a large collecfion of
text.  It was hoped that for ThEC-2 more controlled comparisons and detailed analyses of
failures and successes could be done, to give us more insight into the strengths and
weaknesses of NLP methods in IR.


                                    366