Workshop on:
      Automatically Generating Adhoc and Routifig Queries
             [Summary by: Susan T. Dumais, Belicore, std@bellcore.comJ

About 20 people attended the two workshops on automatic query generation. Many different issues
were addressed, and I've tried to organize the important points under a few general headings.

Topic Statements:
We spent some time initially talking about how the topics statements were developed, what
retrieval scenarios they are representative of, and some consequences of this for research. The topic
statements are much more detailed, structured, and specific than queries associated with most
previous IR test collections, averaging about 150 words in length. Most topics (routing topics
001-025 and adhoc topics 051-100) require that fairly specific facts be retrieved. Routing topics
026-050 are more general. The topic statements were generated by subject domain experts and
reformulated using search results from two different retrieval systems. While this might be
characteristic of routing applications or of dedicated searchers, there was some question about how
likely more casual users would be to generate such queries. There was some interest in developing
a companion set of shorter topic descriptions that could be used to better explore the effects of term
expansion, feedback, and iterative query formulation. In contrast, there was also some interest in
having expert human searchers carry out much deeper searches for a few topics in order to cast a
wider net and increase the variety of documents retrieved.

Term Extraction:
Most of the fields in the topic description were used, and there was some evidence that the
<concept> field was the most useful. Almost all systems used a stop-list and some kind of
stemmer. A few systems recognized and tagged common abbreviations or acronyms, proper names,
company names, place names, etc. Everyone agreed that a compendium of this information would
be a valuable common resource. Many systems used differential term weighting. Typically
weights derived from a statistical analyses of the documents were also used to weight query terms.
Term weights sometimes depended on the topic field or syntactic slot the term occupied.

About half of the systems used phrases in addition to single words. Phrases were usually derived
by simple statistical means using word adjacency (or co-occurrence withing k positions), with high
thresholds on overall frequency of occurrence to limit the number of phrases. Some systems used
syntactic analysis to discover phrases, but most of these groups did not automatically generate their
queries. Phrases appeared to improve performance somewhat by increasing both precision and
(somewhat unexpectedly) recall.

Term expansion:
Term expansion has long been used to increase recall by making the search query more
comprehensive. Not all relations are equally useful in expansion, and the most commonly used
relation was synonymy. Queries were expanded using several different sources of information - a
thesaurus to generate semantic categories; a general, manually-constructed lexical system
(wordnet); associations automatically derived from an analysis of word usage in the documents or

                                367

smaller syntactic units; and automatic pronoun disambiguation.

Relevance feedback is closely related to term expansion. It is not fully automatic in the sense that
human judgements about the relevance of some small number of documents are required.
However, the routing queries were specifically designed to take advantage of relevance judgements
from a training corpus. More importantly, many of the same issues that arise in term expansion
also occur in the context of relevance feedback. The most common implementation of relevance
feedback was to modify the query by adding some words from relevant documents. For the ThEC
expenments as few as 5 words and as many as 250 words were added, with most systems adding
from 10-30 words. Some systems also modified term weights, used information about words in
non-relevant documents, and gave less weight to added words (compared with words in the original
query).

There were few comparisons of term expansion (or feedback) compared to no expansion in the
same system. Feedback improvements were somewhat smaller than expected based on experiments
with smaller test collections. It is too early to tell for sure, but part of this may simply be that the
original queries were very good.

The single common theme in the discussion of query expansion was be car~ul! Results were quite
variable - appropriate term expansion can improve recall, but inappropriate expansion can just as
easily harm pefformance. One major problem is that expansion is not easily limited to the intended
meaning of a word. Some groups first disambiguated the word sense by hand before automatic
expansion; others used automatic heuristics for disambiguation with some success. Other methods
discussed to help limit undesirable associations included: expanding only "hot spots"; matching on
smaller subtexts; giving less weight to added words relative to original query words; limiting the
total number of words added; limiting the syntactic or semantic relations of added words; and
limiting the influence that any single word can have in overall similarity.

Miscellaneous observations:
Few systems did anything more than extract single words and phrases. A few systems removed
negated words (often by hand), and a few systems automatically generated Boolean queries.

Some groups used what might be called a "two-pass method", first using a standard global match to
obtain a smaller group of documents which then receive more detailed processing. Some of the
more detailed processing involved breaking the query down into smaller sub-units for matching.

Swnma~y:
There were few really novel methods used for automatically generating either adhoc or routing
queries. There are now some general and fairly comprehensive lexical resources that might be
useful. The problems with over-expanding queries were quite noticeable in the TREC application.
Systems that automatically generated queries often performed quite well compared to other
systems. However, there were few direct comparisons of manual vs. automatic query generation, or
of individual components (term expansion vs no expansion) within a system, and this is what is
needed to understand the usefulness of such methods. Hopefully this will happen in TREC-2.


                                 368