- (1) Introduction
-
Definition of the Task:
Given a topic description and some example relevant documents,
build a filtering profile which will select
the most relevant examples from an incoming stream of documents.
In the TREC 2002 filtering task we will continue to stress adaptive
filtering. However, the batch filtering and routing tasks
will also be available. The details of the requirements
of each sub-task are provided below.
- (2) Detailed Description of Sub-Tasks
-
The Filtering Track will consist of three separate sub-tasks:
Adaptive Filtering, Batch Filtering, and Routing.
- (A) Adaptive Filtering
-
The system starts with a set of topics and a document stream,
and a small number of examples of relevant documents. For each
topic, it must create an initial filtering profile and then analyze
the incoming documents and make a binary decision to retrieve or not
to retrieve each document on a profile by profile basis. If a document
is retrieved for a given profile and a relevance judgement exists for
that document/topic pair, this judgement is provided to the system and
the information can be used to update the profile. This step is
designed to simulate interactive user
feedback. Systems are specifically prohibited from using relevance
judgements for documents which are not retrieved by the appropriate
filtering profile, or any relevance judgements (including training data)
for any other topics.
Documents must be processed in the specified time order; once a document
has been selected or rejected for a specific topic it is no longer
available for subsequent reconsideration for that topic.
(The accumulating collection of processed documents may however be
used for such information as term statitistics or score distributions.)
The final
output is an unranked set of documents for each topic. Evaluation will
be based on the measures defined in section (4).
- (B) Batch Filtering
-
The system starts with a set of topics and a set of training documents
for each topic which have been labelled as relevant or not relevant.
For each topic, it must use this information to create a filtering
profile and a binary classification rule which will be applied to an
incoming stream of new documents. Systems are specifically prohibited
from using any relevance judgements (including training data)
for any other topics.
Note that this year's definition does not
include a batch-adaptive task -- a fixed binary classification rule
should be applied to the complete test set.
Thus it may be regarded as similar to routing, except that the final
output is an unranked set of documents for
each topic. Evaluation will be based on the measures defined
in section (4).
- (C) Routing
-
The system starts with a set of topics and a set of training documents
for each topic which have been labelled as relevant or not relevant.
For each topic, it must use this information to create a routing
profile which assigns retrieval scores to the incoming documents.
Systems are specifically prohibited
from using any relevance judgements (including training data)
for any other topics.
The final output is the top 1000 ranked documents
for each topic. The primary evaluation measure will be
uninterpolated average precision as defined in section (4).
Each task will use slightly
different resources and evaluation strategies, as described below.
There are no mandatory task requirements this year: groups may
participate in any or all of these tasks. The
Adaptive Filtering sub-task will be the primary task in the
Filtering track; all groups are strongly encouraged to participate in this task.
The Batch Filtering and Routing tasks are included mainly for historical
continuity. However, it is recognised that some groups have systems
which are not suitable for Adaptive Filtering.
These tasks can also be used as a testing ground
for text categorization and batch-style machine learning systems.
- (3) System Training
-
Since the Filtering Track is working with old data, it is
important to be very clear about what resources may be used to train
the systems. In general, participating groups are free to take
advantage of topics, documents, and relevance judgements included in
the TREC collection or in other collections
which are not being used for the Filtering Track
this year (i.e. any resource not covered in the rest of this
section). Furthermore, other kinds of external resources are also permitted
(dictionaries, thesauri, ontologies,
etc...).
Specific rules apply to the test corpus. The only data to be
used from the test collection is specified in 4(iv) below; specifically,
systems may not use the Reuters categories assigned to the documents.
Furthermore, although the distributed relevance judgements for the
present test collection would allow participants to evaluate their own
runs and choose which runs to submit, participants should take care to
avoid doing this. Ideally, decisions about all aspects of the system
(including the settings of any tuning parameters) should be made on the
basis of external data, including for example
experiments on other test corpora. Participants should decide in
advance exactly what the parameters of their submitted test runs are
to be, and make only those runs. (It is recognised that a run may
fail for technical reasons; under these conditions, of course a new
run may be made and submitted.)
The use of any other Reuters data, including the 1987
Reuters corpus, is also restricted. Participants should avoid training
their systems in any way on the old Reuters corpus.
Notes: (a) We do not anticipate that any significant benefits are likely to be
obtained from the use of the 1987 Reuters data,
since the training set described below
provides some more closely associated term statistical information.
Similarly (although this is more debatable),
we do not anticipate any great benefits from training on the TREC-10 Filtering
Track, since the present topics have very different characteristics.
(b) Nevertheless, we would like participants, as far as possible, to
minimise the effects of such training. We recognise that
some participants may have trained/tested their systems on the 1987
Reuters corpus or on the Reuters Corpus 1 to be
used in the present experiments, in previous experiments (including the
TREC-10 filtering track).
We understand that it may be difficult
to disentangle the effects of such experiments. The fairest way would be to
retrain on some different dataset those system parameters that were trained or
modified as a result of experiments on Reuters data. We ask participants who
have used Reuters data in the past,
and who may be unclear about what is required, to contact one of the
coordinators.
Participants will be asked to provide
information about the
resources which were used, as detailed in (5)(v).
- (4) Topics, documents and relevance judgements
- (i) Documents
The TREC-9 filtering track will use the new Reuters Corpus.
This corpus (RCV1 and RCV2) can be obtained from NIST, see detailed instructions at
Reuters Corpora.
The document collection is divided into training and test sets.
The training set may be used, in the limited fashion specified below
for each task, as part of the construction of the initial profiles.
The training set consists of all documents with dates up to and
including 30 September 1996 (83,650 documents). The test set
consists of all remaining documents (723,141).
- (A) Adaptive Filtering
For each topic, the system begins with a very small number of training
documents (positive examples) from the training set, and the topic
statement. For TREC 2002 the number of training examples per topic has
been set at three. The system may also use any non-relevance-related information
from the training set. During processing of the test set, the system
may use additional
relevance information on documents retrieved by the profile, as specified
in (iii) below. Output is a set of documents from the test set.
- (B) Batch filtering
For each topic, the system may use the full training set
(all relevance information for that topic and any non-relevance-related
information), and the topic
statement, for building the filtering profile. The test set is processed
in its entirety against the profile thus constructed. Output is a
set of documents from the test set.
- (C) Routing
For each topic, the system may use the full training set
(all relevance information for that topic and any non-relevance-related
information), and the topic
statement, for building the routing profile. The test set is processed
in its entirety against the profile thus constructed. Output is a ranked
list of documents from the test set.
- (ii) Topics
The topics have been specially constructed for this TREC, and are of
two types. A set of 50 are of the usual TREC type, developed by the NIST
assessors (with assessor relevance judgements, see below). A second set of
50 have been constructed artificially from intersections of pairs of Reuters
categories. The relevant documents are taken to be those to which both
category labels have been assigned. This second set is included as an
experiment, to help assess whether such a method of construction produces
useful topics for experimental purposes. Participants are expected to process
both sets of topics together.
The topics will be in standard TREC format, and are expected to
be ready in early June. The two sets will form a single sequence.
- (iii) Relevance Judgements
As part of the process of preparing the assessor topics, the NIST assessors
have been making extensive relevance judgements. These judgements will be
made available for the purposes of training and adaptation, to be used in
the manner specified. In addition, some further judgements will be made after
the submissions are in on documents submitted by participants. The final
evaluation will be based on all the available judgements.
The file of judgements provided will indicate which documents have been
judged for a particular topic, and which are judged relevant or not relevant
(positive or negative).
Any document not in this list will not have been judged by the assessors.
Within the rules given, participants are free to use these judgements in
any way. For example, unjudged documents may be treated as non-relevant
or ignored or used in some other way.
For the intersection topics, documents which have been assigned both
category labels in the Reuters database are taken to be the (complete)
relevance judgements, and the final evaluation will be based on them.
As there is no natural equivalent to the "judged negative" / "unjudged"
distinction, one has been artificially created for these topics.
For each intersection topic, a sample of documents has been taken from those
that have been assigned one but not both of the category labels. This sample
is taken as the explicit negative judgements; all other documents are to be
regarded as unjudged. This method is intended to imitate the fact that
negative judgements are more likely to have been made on documents "close"
in some sense to the relevant ones, and also to provide a similar number of
negative examples to topics in the assessor set. As before, participants
are free to use the "judged negative" and "unjudged" documents in any way,
in any combination.
Please see (3) above for the rules on system training.
- (A) Adaptive Filtering
For initial profile
construction, a small number of positive examples (documents
judged relevant) may be used -- the specified three per profile.
Statistics not relating to relevance from the training data
may also be used (e.g. term statistics). However, no further relevance
information from the training set should be used at any time.
As the system is processing the test
collection, it may use the relevance judgement from any document
that is retrieved to update the profile for filtering future
documents. That is, each adaptive filtering topic can make use of
any training documents (for it or any other adaptive filtering topics)
and the relevance judgements for the appropriate topics on any documents
already retrieved (for
it or any other adaptive filtering topics). No information concerning
any future documents, and no relevance information about any
document-topic pair for which the document was not retrieved, may be used.
In particular, the percentage that
are relevant over the entire test set for a particular query cannot be
used (though of course such statistics accumulated over previously
retrieved documents may be used).
The text
of any document processed (retrieved or unretrieved) may be used to update
term frequency statistics or auxilliary data structures.
- (B) Batch filtering and Routing
The documents and relevance judgements from the training set may
be used to construct the initial filtering profiles. Since neither task is
adaptive, no information from any documents in the test set may be used
in any way in profile construction. Thus neither term statistics, nor
relevance judgements, nor the percentage judged relevant over the test
set, may be used.
For Routing, the top 1000 documents should be returned in score order.
For batch filtering, the binary retrieval decision should not make any
use of information about the test set. Specifically, thresholds cannot be
adjusted according to the distribution of scores in the test set.
- (iv) Processing of documents
The only fields (elements) of the document records
that may be used are: "newsitem" (as indicated above), "headline",
"text", "dateline" and "byline". The category
code fields may not be used: all relevance judgements derived from those fields
(for training or test) will be provided as separate files.
Note: the same restrictions apply to the use of the 1987 Reuters data,
as discussed at the beginning of (3).
For Adaptive Filtering, the test documents should be processed in
order of date, and within date
in order of itemid. Both date and itemid appear in the "newsitem"
element at the start of each document, as "date" and "itemid" tag.
They also appear embedded in filenames in the Reuters distribution:
the date (in shortened form, without the hyphens) as the name of
the zipfile containing all documents for that day, and the itemid
in the name of the zipped file containing the document.
(Note that
itemids are unique, but not necessarily appropriately ordered between
days. Within one
day's zipfile the individual document files may not be in itemid order.
Itemids range from 2286 to 86967 (training set) and 86968 to 810936
(test set), but there are some gaps. A file
specifying the required processing order will be available from the NIST
website.)
Documents may be processed individually
or in batches, but in the batched case no information derived
from the batch (e.g. statistics or relevance judgements) can
be used to modify the profiles applied to the batch.
- (v) Summary
To summarize, participants may use any material (for example
TREC documents, topics and relevance judgements) other than Reuters
for initially developing their system. Restrictions
concerning the use of the Reuters document collection, other Reuters
data, and the relevance
judgements for the test topics are defined on a task specific basis, as
described above.
- (5) Evaluation
For each sub-task, a system will return a set of documents for
evaluation. For the filtering tasks the retrieved set is assumed to be
unordered and can be of arbitrary size. The retrieved set of the
routing task is the top 1000 documents. Evaluation will be based on
the relevance judgements already provided for adaptive filtering,
together with a small number of new relevance judgements, made by
NIST assessors on the basis of the sampled and pooled output from
the returned search results for each of the assessor topics.
The filtering sub-tasks will be evaluated primarily
according to a utility measure and a version of the F measure,
and the routing task will be
evaluated primarily according to average uninterpolated precision.
The exact method of pooling and sampling output for the additional
relevance assessments has yet to be decided. It is likely to involve a
selection of runs from the different subtasks, and sampling to give a
fixed number of documents per run selected. These documents will then
be pooled, and any documents already judged will be removed from the
pool, before they are passed to the assessors.
All runs will be evaluated based on the full document test set.
Adaptive filtering systems will also be evaluated at
earlier points to test the learning rates of the systems. This will be
done automatically by the evaluation program. There will be no need to
submit separate retrieved sets for each time point used for adaptive
filtering evaluation.
- (A) Utility
-
Linear utility assigns a positive worth or negative cost to each element in
the contingency table defined below.
Relevant Not Relevant
Retrieved R+ / A N+ / B
Not Retrieved R- / C N- / D
Linear utility = A*R+ + B*N+ + C*R- + D*N-
The variables R+/R-/N+/N- refer to the number of documents in each
category. The utility parameters (A,B,C,D) determine the relative
value of each possible category. The larger the utility score, the
better the filtering system is performing for that query. For this TREC
we will use a single specific linear function:
T11U = 2*R+ - N+
(that is, A=2, B=-1, C=D=0).
This is the same linear utility function as for TRECs 9 and 10
(but see below concerning averaging).
Filtering according to a linear utility function is equivalent to filtering
by estimated probability of relevance. T11U is equivalent to the retrieval rule:
retrieve if P(rel) > .33
For the purpose of averaging across topics, the method used will be that
proposed by Ault, a variant on the scaled utilities used in TRECs 8 and 10.
Topic utilities are first normalised by the maximum for the topic, then scaled
according to some minimum acceptable level,
and the scaled normalised values are averaged. The maximum utility for
the topic is
MaxU = 2*(Total relevant)
so the normalised utility is
T11U
T11NU = ----
MaxU
The lower limit is some negative normalised utility, MinNU,
which may be thought of as the minimum (maximum negative) utility
that a user would tolerate, over the lifetime of the profile.
If the T11NU value falls below this minimum, it will be assumed that the
user stops looking at documents, and therefore the minimum is used.
max(T11NU, MinNU) - MinNU
T11SU = ------------------------- for each topic, and
1 - MinNU
MeanT11SU = Mean T11SU over topics
Different values of MinU may be chosen. The primary evaluation measure will have
MinNU = -0.5
but results will also be presented using different values of MinU.
- (B) F-beta
-
This measure, used in TREC-10, is a function of recall
and precision, together with a free parameter beta which determines the relative
weighting of recall and precision. For any beta, the measure lies in the range
zero (bad) to 1 (good). For this TREC, as for TREC-10,
a value of beta=0.5 has been chosen,
corresponding to an emphasis on precision (beta=1 is neutral). The measure (with this
choice of beta) may be expressed in terms of the quantities above as follows:
{ 0 if R+=N+=0
{
T11F = { 1.25*R+
{ ---------------------- otherwise
{ 0.25*R- + N+ + 1.25*R+
This measure will be calculated for each topic and averaged
across topics.
- (C) Range of measures for filtering
-
Both the above measures, and recall and precision, will be
calculated for all adaptive and batch filtering
runs across the full test set. The reason for this is to
provide rich information about each run, and also to investigate the
behaviour of the different measures.
For adaptive filtering, further analyses will be performed on
different time periods. We may also report variations on the
chosen measures (e.g. for different values of MinNU or beta), again
for the purpose of investigating the behaviour of the measures.
All batch or adaptive filtering runs should be declared as being
optimized for one particular measure, namely either T11U or T11F.
Specifically, participants are encouraged to submit one adaptive
filtering run optimized for the T11U measure
- (D) Average Uninterpolated Precision
Average uninterpolated precision is the primary measure of evaluation
for the routing sub-task and is defined over a ranked list of
documents. For each relevant document, compute the precision at its
position in the ranked list. Add these numbers up and divide by the
total number of relevant documents. Relevant documents which do not
appear in the top 1000 receive a precision score of zero.
(6) Submission requirements
The deadline for submitting results to the TREC 2002 filtering track is
September 3, 2002.
All runs should follow the traditional TREC submission format and
should be sent to NIST by the date given above.
The format to use when submitting results is as follows, using a space
as the delimiter between columns. The width of the columns in the format
is not important, but it is important to include all columns and have at
least one space between the columns.
...
R125 Q0 285736 0 4238 xyzT11af5
R125 Q0 644554 1 4223 xyzT11af5
R125 Q0 92760 2 4207 xyzT11af5
R125 Q0 111222 3 4194 xyzT11af5
R125 Q0 801326 4 4189 xyzT11af5
etc.
where:
- the first column is the topic number. Topics are numbered R101-R200.
- the second column is the query number within that topic. This is
currently unused and should always be Q0.
- the third column is the official document number of the retrieved
document -- the itemid as described in (4)(iv).
- the fourth column is the rank the document is retrieved, and the fifth
column shows the score (integer or floating point) that generated the
ranking. For routing runs this score must be in descending
(non-increasing) order and
is important to include so that we can handle tied scores (for a given
run) in a uniform fashion (the evaluation routines rank documents from
these scores, not from your ranks).
For batch and adaptive filtering runs these columns will be
ignored in the evaluation, but should nevertheless be present. The fourth
column is arbitrary and could be 0. It is suggested that the fifth column
could still contain the score which the document achieved, for the
possible benefit of future researchers. It is understood that some
systems may not base the retrieval decision on a single score for each
document, and that scores may not be comparable over the lifetime of an
adaptive filtering profile.
- the sixth column is called the "run tag" and should be a unique
identifier for your group AND for the method used. That is, each
run should have a different tag that identifies the group and
the method that produced the run. Please change the tag from year
to year, since often we compare across years (for graphs and such)
and having the same name show up for both years is confusing.
Also please use 12 or fewer letters and numbers, and NO
punctuation, to facilitate labeling graphs and such with the tags.
NIST has a routine that checks for common errors in the result files
including duplicate document numbers for the same topic, invalid document
numbers, wrong format, and duplicate tags across runs. This routine will
be made available to participants to check there runs for errors
prior to submitting them. Submitting runs is an automatic process
done through a web form, and runs that contain errors cannot be processed.
As a safeguard, please send in a sample of results from at least two topics
as soon as possible if you are new to TREC. These will not be used for
anything other than to make sure your formats are correct.
In addition to the retrieved document file, each run should be tagged
with the following information. This should be supplied to NIST via the
usual web form:
- (i) Document File Name - the name of the retrieved document file to
which these tags apply
- (ii) Sub-Task: (ADAPTIVE, BATCH, ROUTING)
-
Exactly one of these
- ADAPTIVE - Adaptive Filtering Run (see 2A)
- BATCH - Batch Filtering Run(see 2B)
- ROUTING - Routing Run (see 2C)
- (iv) Optimization measure: (LINEAR-UTILITY, F-BETA)
Exactly one of these for ADAPTIVE or BATCH.
Since experimental filtering systems are generally optimized for
a specific performance measure, groups should also
specify the measure for which each filtering run has been optimized.
The only two choices this year are:
- LINEAR-UTILITY: utility using the T11U function
- F-BETA: F measure using the T11F function
All routing runs will be automatically evaluated using average
uninterpolated precision.
- (v) Additional Tags: (RESOURCE-TREC, RESOURCE-OTHER)
Any, all or none of these.
- RESOURCE-TREC - Did this run make any use of any part(s) of the TREC collection
for training, term collection statistics, or building auxilliary data
structures? If yes, include this tag.
- RESOURCE-REUTERS - Did this run make any use of any Reuters data,
for example the 1987 Reuters corpus for training, term collection statistics,
or building auxilliary data
structures? If yes, include this tag.
- RESOURCE-OTHER - Did your system make any use of external resources (other than
TREC documents, topics, and relevance judgements, or Reuters data)
for training, term collection statistics, or building auxilliary data
structures? If yes, include this tag, and please describe them.
- (vi) Number of runs allowed
-
Here are the limits on the number of runs allowed for submission:
(A) Adaptive filtering runs 4
(B) Batch filtering runs 2
(C) Routing runs 2
Therefore, each group may submit between 1 and 8 runs. However, the
limits are defined per category (e.g. submitting 8 batch filtering
runs will not be allowed). As always, groups are encouraged to
generate and evaluate unofficial runs for comparative purposes.
It is likely that not all submitted runs will be sampled for additional
relevance judgements. A final decision on this procedure has not yet been
taken, but if it involves selection of runs, participants may be asked
to nominate which runs they would prefer to have sampled.
|