W~or ksho~ on:
               Machine Learning and Relevance F'eedback

                   [summary by: Norbert Fuhr and Stephen Robertson]


Note: This report covers the two separate workshop sessions on the topic.


   A machine might use a number of different sources of information from which to learn.
One major source, but not the only one, is relevance feedback information. Others include for
example the user's selection of terms or information about the user's context or background.
   One can also distinguish possible objects of a machine learning process -- what it might
learn about. First, the object may be the current topic or query only -- this is the usual
domain of relevance feedback.    Second, the machine may learn about other topics or quenes.
This clearly requires a degree of abstraction. For example, the behaviour of a specific term in
respect of one query is unlikely to tell us much about its behaviour in relation to a different
query; however, it may tell us something about the behaviour of other terms sharing certain
characteristics with it.
   Somewhere in between these two kinds of object is a third: the machine may learn about
this particular user, in a way that is applicable to his or her subsequent queries or uses of the
system. It is not possible to study this kind of machine learning within the framework of the
TREC experiment, because of the way in which topics and relevance judgments have been
obtained. However, there is some work on such systems.
   We should also ask the purpose of machine learning. In the context of TREC, this should
presumably be the improvement of retrieval performance.
   It is arguable whether some of the processes included in the above discussion actually
qualify as Machine Learning in the sense in which the phrase is usually used: for example, a
form of "learning't which affects only the next iteration of a single search hardly counts as
ML. However, in an IR context there appears to be some continuity between such systems
and those which come closer to "proper91 ML.
   One area of concern is the extraction of features.   If there is a candidate set of features,
then relevance feedback can contribute to their selection, but the original identification of
features in text is a much more difficult task.
   Another point worth noting for TREC is the desirability of considering learning methods
in the context of interactive systems. The present TREC experimental design is not suited for
evaluating interactive systems; it would be of great value in this context if the experimental
design could be adapted to allow such evaluation.
   For the areas where machine learning (ML) methods can be applied within IR, one may
distinguish between models and representations. For retrieval, first representations of docu-
ments and queries (e.g. both as sets of terms) have to be formed. Based on these representa-
tions, models are applied in order to compute the retrieval status value for query-document
pairs. In order to improve the representations, learning methods may be applied e.g. for the
detection of phrases or for word sense disambiguation. Examples of learning methods used
within models are regression methods, classification trees or genetic algorithms.
   The features which ML methods are based upon may be derived from the text, but also
from additional attributes (e.g. the source of a document may already indicate that it is not
relevant for certain queries).  Most ML methods assume that the features to be considered
form a set, whereas tree- or graph-like structures have to be mapped onto this flat structure


                                           369

first. The level of abstraction used in the definition of the features plays an important role for
the applicability or non-applicability of certain learning methods, since most methods require
a certain amount of data. In general, a higher level of abstraction yields more leaning data;
on the other hand, the decision resulting from the learning algorithm may become too
unspecific. For this reason, there is a need for different levels of abstraction, from which one
may choose the most appropriate one for the actual circumstances. However, for text there
are effectively only two levels of abstraction, namely either the term itself or its statistical
parameters (e.g. within-document-frequency and inverse document frequency).  A possible
intermediate level would be sets of synonym terms.
  Jn order to improve the effectiveness of ML methods for a given learning sample, prior
knowledge plays an important role. For example, we may assume that the weight of a term
with respect to a document is a monotonic function of its within-document-frequency. For
this reason, a regression method which is implicitly based on this tppe of prior knowledge is
more appropriate than e.g. a classification tree which makes no such assumption.
  We may distinguish different sources of learning data. Most important, we have relevance
feedback data.  Here one may think of different levels of response, either by using mul-
tivalued relevance scales or by indicating important paragraphs within a long document. For
some applications, it may also be necessary to get more specific feedback with respect to
internal decisions of the system. For example, for tuning a phrase detection algorithm, it
would be useful to get decisions about each specific phrase.  As a third possible source of
learning data, the combination of different sources of knowledge (e.g. thesauri and corpora)
also might yield new information.
  For the TREC initiative, there are two possible improvements which would ease the
further application of ML methods. First, relevance feedback data should be enriched by mdi-
cating the most important paragraph of the document. As a precondition, there should some
method for identifying single paragraphs, e.g. by additional SGML-like tags. Since the asses-
sors will not give the requested type of judgements, the TREC participants would have to do
this job, and NIST should act as collector and distributor for these judgements.


                                          370