Workshop on: Use of Natural Language Processing at TREC [summary by: David Lewis and Alan Smeaton] A working group of individuals concerned with the impact and use of natural language processing ~U~) techniques in ThEC-1 met in two sessions for lively discussion. The first question addiessed was what exacfly it meant to say that an information retrieval (IR) system uses NLP. There were some interesting disagreements about which TREC systems could be considered NLP-based. It was agreed there were a number of levels of NLP techniques that might be used in IR systems, including traditional distinctions between lexical, morphological, syntactic, and other levels. It was also agreed that there are many "boxes" or linguistic processes which could be useful for information retrieval including lexicons and gazetteers, syntactic analyzers, knowledge bases, etc. As such boxes become more robust and widely available, one or more may be plugged into IR systems with varying degrees of attention paid to linguistic issues, making the dividing line between NLP and non-NLP IR systems increas- ingly fuzzy. The second issue discussed was whether NLP techniques had had an impact on retrieval quality in the TREC- 1 experiments. The consensus was that it was impossible to tell from the results presented. Groups did not for the most part present results on controlled comparisons between using and not using their NLP components. Furthermore, the significant number of alternative retrieval models (vector, probabilistic, connectionist, and other) precluded making NLP vs. non-NLP comparisons across groups. The discussion turned to the challenges of experimentation with NLP in the context of TREC- 1, revealing ample reasons that TREC- 1 participants should not be faulted for the lack of controlled experiments. All groups found getting their system to operate on 2 gigabytes of text very challenging. The tight schedule and limited funding available led to a "one mistake syndrome"---the limited amount of time available allowed groups to make only one mistake if they were to get their results in on time. Groups using NLP techniques were doubly chal- lenged, given that the generally higher computational costs of these methods compared to traditional word-based indexing. It was noted that, with the exception of the University of Massachusetts TIPSThR group, there was little work presented on focusing NLP on queries rather than documents. This is an area of research where the computational demands are much less than in applying NLP to document texts, and richness and subtlety of the ThECJrIPSTER topic descriptions would suggests that such an approach could have significant payoffs. It was agreed that this was an important area of exploration for TREC-2. The issue of sharing of resources among TREC participants was discussed and the con- sensus is that although NLP work is generically similar, there do exist fine differences in NLP approaches which makes sharing of resources like dictionaries and parsers, troublesome at least. Still, everyone agreed that sharing is desirable when the technological and legal (copy- right, etc.) barriers can be overcome. Another interesting topic raised was the issue of how the effectiveness of NLP methods in IR changes when one scales up from traditional small IR test collections to gigabyte scale databases. Once again, the conclusion reached was that the TREC-1 results simply do not tell us this. We were unable to distinguish the "large collection factor" from other factors like long vs. short queries, long vs short documents and different query types, as well as other fac- tors discussed above. 365 A tentative hypothesis ~ f~)rwayd by some was that NLP-based methods may be best for paragraph or sub-document retneval (termed "nugget extraction" during discussions) and that more traditional methods may be better for more general types of queries. It was suggested that testing this hypothesis, and in general getting a real understanding of the effect of NLP techniques on IR, would require a more careful analysis of the kinds of queries used (sugges- tions were made about how the query set might be improved or augmented), as well as details of how relevance judgments are made and what parts of documents are relevanL In conclusion, it was acknowledged that the emphasis of researchers in ThEC- 1 had quite reasonably been simply on getting their systems to work at all with such a large collecfion of text. It was hoped that for ThEC-2 more controlled comparisons and detailed analyses of failures and successes could be done, to give us more insight into the strengths and weaknesses of NLP methods in IR. 366