- (1) Introduction
-
Definition of the Task:
Given a topic description, build a filtering profile which will select
the most relevant examples from an incoming stream of documents. As
the document stream is processed, the system may be provided with a
binary judgement of relevance for some of the retrieved documents.
This information can be used to adaptively update the filtering
profile.
In the TREC 2001 filtering task we will continue to stress adaptive
filtering. By adaptive filtering, we mean that the system
will not receive a large pool of evaluated documents in advance,
rather the relevance judgements will be provided one by one for
documents which are retrieved as the filtering system is operating.
Once a document has been considered by the system (and a filtering
decision against each topic has been taken) it is no longer available
for subsequent retrieval by the system. The accumulated document
collection, and any statistics derived therefrom, may however be used
in any other way to improve the profiles and thresholds
The batch filtering task starts with a training set, but is subject
to the same constraints as regards the test set. The traditional routing
option will also be available. The exact details of the requirements
of each sub-task are provided below.
- (2) Detailed Description of Sub-Tasks
-
The TREC-9 Filtering Track will consist of three separate sub-tasks:
Adaptive Filtering, Batch Filtering, and Routing.
- (A) Adaptive Filtering
-
The system starts with a set of topics and a document stream,
and a small number of examples of relevant documents. For each
topic, it must create an initial filtering profile and then analyze
the incoming documents and make a binary decision to retrieve or not
to retrieve each document on a profile by profile basis. If a document
is retrieved for a given profile and a relevance judgement exists for
that document/topic pair, this judgement is provided to the system and
the information can be used to update the profile. (What constitutes
a relevance judgement for the specific collection and topics to be used
is discussed below.) This step is designed to simulate interactive user
feedback. Systems are specifically prohibited from using relevance
judgements for documents which are not retrieved by the appropriate
filtering profile, or any relevance judgements (including training data)
for any other topics.
The document stream will be ordered (roughly) as a
function of time. Documents must be processed in this order. The final
output is an unranked set of documents for each topic. Evaluation will
be based on the measures defined in section (4).
- (B) Batch Filtering
-
The system starts with a set of topics and a set of training documents
for each topic which have been labelled as relevant or not relevant.
For each topic, it must use this information to create a filtering
profile and a binary classification rule which will be applied to an
incoming stream of new documents. Systems are specifically prohibited
from using any relevance judgements (including training data)
for any other topics.
Note that this year's definition does not
include a batch-adaptive task -- a fixed binary classification rule
should be applied to the complete test set.
Thus it may be regarded as similar to routing, except that the final
output is an unranked set of documents for
each topic. Evaluation will be based on the measures defined
in section (4).
- (C) Routing
-
The system starts with a set of topics and a set of training documents
for each topic which have been labelled as relevant or not relevant.
For each topic, it must use this information to create a routing
profile which assigns retrieval scores to the incoming documents.
Systems are specifically prohibited
from using any relevance judgements (including training data)
for any other topics.
The final output is the top 1000 ranked documents
for each topic. The primary evaluation measure will be
uninterpolated average precision as defined in section (4).
In the adaptive filtering task, documents are processed in date order, and
once a document has been selected or rejected
for a specific topic it is no longer available for retrieval for that
topic. However, this document (after selection or rejection) may be added to
an accumulating collection, and this collection
may be used in any other way to improve the profiles. For example, any
term statistics or score distributions from the accumulated collection may
be used to modify a profile or its associated threshold, for application
to the not-yet-considered documents.
Each task will use slightly
different resources and evaluation strategies, as described below.
There are no mandatory task requirements this year: groups may
participate in any or all of these tasks. However, the
Adaptive Filtering sub-task will be the primary task in the TREC 2001
Filtering track. All groups are strongly encouraged to participate in this task.
The Batch Filtering and Routing tasks are included mainly for historical
continuity. Some groups may have systems which are not suitable for
Adaptive Filtering and we don't want to exclude them completely from
the Filtering Track. These tasks can also be used as a testing ground
for text categorization and batch-style machine learning systems.
- (3) Topics, Documents, and Relevance Judgements
-
Since the Filtering Track is working entirely with old data, it is
important to be very clear about what resources may be used to train
the systems. In general, participating groups are free to take
advantage of topics, documents, and relevance judgements included in
the TREC collection or in other collections
which are not being used for the Filtering Track
this year (i.e. any resource not covered in the rest of this
section). Furthermore, other kinds of external resources are also permitted
(dictionaries, thesauri, ontologies,
etc...).
Specific rules apply to any data from Reuters. The hierarchical structure
and relationships in the categories, as indicated in the "codes" files
distributed with the Reuters corpus, may be used. Concerning
the 1987 Reuters corpus which has already been used by many people for various
experiments: participants may make use of restricted fields of
the 1987 Reuters documents, essentially the text fields, for (for example) term
statistics. The fields concerned are those specified below for the test corpus
(see 3(iv) below). They may
not however use any Reuters data, or data from any other source,
involving Reuters categories in any form.
Notes: (a) We do not anticipate that any great benefits are likely to be
obtained from the use
of the text of the 1987 Reuters data, since the training set described below
provides some more closely associated term statistical information. (b) It is
possible that some participants have trained their systems on the 1987
Reuters corpus, including categories, in previous experiments.
We understand that it may be difficult
to disentangle the effects of such experiments. The fairest way would be to
retrain on some different dataset those system parameters that were trained or
modified as a result of experiments on 1987 Reuters. We ask participants who
have used 1987 (or any other) Reuters data with categories in the past,
and who may be unclear about what is required, to contact one of the
coordinators.
Participants will be asked to provide
information about the
resources which were used, as detailed in (5)(v).
- (i) Documents
The TREC-9 filtering track will use the new Reuters Corpus.
This corpus (RCV1 and RCV2) can be obtained from NIST, see detailed instructions at
Reuters Corpora.
The document collection is divided into training and test sets.
The training set may be used, in the limited fashion specified below
for each task, as part of the construction of the initial profiles.
The training set consists of the documents dated August 1996, that is
all documents with itemids up to and including 26150. The test set
consists of all remaining documents.
- (A) Adaptive Filtering
For each topic, the system begins with a very small number of training
documents (positive examples) from the training set, and the topic
statement. The system may also use any non-relevance-related information
from the training set. During processing of the test set, the system
may use additional
relevance information on documents retrieved by the profile, as specified
in (iii) below. Output is a set of documents from the test set.
- (B) Batch filtering
For each topic, the system may use the full training set
(all relevance information for that topic and any non-relevance-related
information), and the topic
statement, for building the filtering profile. The test set is processed
in its entirety against the profile thus constructed. Output is a
set of documents from the test set.
- (C) Routing
For each topic, the system may use the full training set
(all relevance information for that topic and any non-relevance-related
information), and the topic
statement, for building the routing profile. The test set is processed
in its entirety against the profile thus constructed. Output is a ranked
list of documents from the test set.
- (ii) Topics
The topics are derived from the Reuters category codes.
However, they will be provided on the NIST website as a separate
file (details to follow). The category codes on individual document
records from the Reuters corpus
should not be used. Any hierarchical information given in the "codes"
files distributed with the Reuters corpus may be used.
- (iii) Relevance Judgements
Similarly, the relevance judgements are derived from the
assigned category codes, but will be provided in a separate file,
and should not be taken from the document records.
- (A) Adaptive Filtering
For initial profile
construction, a small number of positive examples (documents
judged relevant) may be used -- two per profile.
Statistics not relating to relevance from the training data
may also be used (e.g. term statistics).
As the system is processing the test
collection, it may use the relevance judgement from any document
that is retrieved to update the profile for filtering future
documents. That is, each adaptive filtering topic can make use of
any training documents (for it or any other adaptive filtering topics)
and the relevance judgements for the appropriate topics on any documents
already retrieved (for
it or any other adaptive filtering topics). No information concerning
any future documents, and no relevance information about any
document-topic pair for which the document was not retrieved, may be used.
In particular, the percentage that
are relevant over the entire test set for a particular query cannot be
used (though of course such statistics accumulated over previously
retrieved documents may be used).
The text
of any document processed (retrieved or unretrieved) may be used to update
term frequency statistics or auxilliary data structures.
- (B) Batch filtering and Routing
The documents and relevance judgements from the training set may
be used to construct the initial filtering profiles. Since neither task is
adaptive, no information from any documents in the test set may be used
in any way in profile construction. Thus neither term statistics, nor
relevance judgements, nor the percentage judged relevant over the test
set, may be used
- (iv) Processing of documents
The only fields (elements) of the document records
that may be used are: "newsitem" (as indicated above), "headline",
"text", "dateline" and "byline". The category
code fields may not be used: all relevance judgements derived from those fields
(for training or test) will be provided as separate files.
Note: the same restrictions apply to the use of the 1987 Reuters data,
as discussed at the beginning of (3).
For Adaptive Filtering, the test documents should be processed in
order of date, and within date
in order of itemid. Both date and itemid appear in the "newsitem"
element at the start of each document, as "date" and "itemid" tag.
They also appear embedded in filenames in the Reuters distribution:
the date (in shortened form, without the hyphens) as the name of
the zipfile containing all documents for that day, and the itemid
in the name of the zipped file containing the document.
(Note that
itemids are unique, but not necessarily appropriately ordered between
days. Within one
day's zipfile the individual document files may not be in itemid order.
Itemids range from 2286 to 26150 (training set) and 26151 to 810936
(test set), but there are some gaps. A file
specifying the required processing order is available from the NIST
website.)
Documents may be processed individually
or in batches, but in the batched case no information derived
from the batch (e.g. statistics or relevance judgements) can
be used to modify the profiles applied to the batch.
- (vi) Summary
To summarize, participants may use any material (for example
TREC documents, topics and relevance judgements) other than Reuters
for initially developing their system. Restrictions
concerning the use of the Reuters document collection, other Reuters
data, and the relevance
judgements for the test topics are defined on a task specific basis, as
described above.
- (4) Evaluation
For each sub-task, a system will return a set of documents for
evaluation. For the filtering tasks the retrieved set is assumed to be
unordered and can be of arbitrary size. The retrieved set of the
routing task is the top 1000 documents. Please note that there will
be no additional relevance judgements on the retrieved documents.
All evaluation will be on the basis of the existing relevance
judgements (it is assumed that the Reuters categories have been applied
to all documents to which they are appropriate).
The filtering sub-tasks will be evaluated
according to a utility measure and a version of the F measure,
and the routing task will be
evaluated according to average uninterpolated precision.
All runs will be evaluated based on the full document test set.
Adaptive filtering systems will also be evaluated at
earlier points to test the learning rates of the systems. This will be
done automatically by the evaluation program. There will be no need to
submit separate retrieved sets for each time point used for adaptive
filtering evaluation.
- (A) Utility
-
Linear utility assigns a positive worth or negative cost to each element in
the contingency table defined below.
Relevant Not Relevant
Retrieved R+ / A N+ / B
Not Retrieved R- / C N- / D
Linear utility = A*R+ + B*N+ + C*R- + D*N-
The variables R+/R-/N+/N- refer to the number of documents in each
category. The utility parameters (A,B,C,D) determine the relative
value of each possible category. The larger the utility score, the
better the filtering system is performing for that query. For TREC 2001
we will use a single specific linear function:
T10U = 2*R+ - N+
(that is, A=2, B=-1, C=D=0).
This is the same linear utility function as for TREC-9, but without the lower
bound (but see below concerning averaging).
Filtering according to a linear utility function is equivalent to filtering
by estimated probability of relevance. T10U is equivalent to the retrieval rule:
retrieve if P(rel) > .33
For the purpose of averaging across topics, the method used will be that
used in TREC-8. That is, the topic utilities are scaled between limits,
and the scaled values are averaged. The upper limit is the maximum utility for
that topic, namely
MaxU = 2*(Total relevant)
The lower limit is some negative utility, MinU,
which may be thought of as the maximum number of non-relevant documents
that a user would tolerate, with no relevants, over the lifetime of the profile.
If the T10U value falls below this minimum, it will be assumed that the
user stops looking at documents, and therefore the minimum is used.
max(T10U, MinU) - MinU
T10SU = --------------------- for each topic, and
MaxU - MinU
MeanT10SU = Mean T10SU over topics
Different values of MinU may be chosen: a value of zero means taking all
negative utilities as zero and then normalising by the maximum. A value of
minus infinity is equivalent (in terms of comparing sytems) to using
unnormalised utility. The primary evaluation measure will have
MinU = -100
but results will also be presented using different values of MinU.
- (B) F-beta
-
This measure, based on one defined by van Rijsbergen, is a function of recall
and precision, together with a free parameter beta which determines the relative
weighting of recall and precision. For any beta, the measure lies in the range
zero (bad) to 1 (good). For TREC 2001, a value of beta=0.5 has been chosen,
corresponding to an emphasis on precision (beta=1 is neutral). The measure (with this
choice of beta) may be expressed in terms of the quantities above as follows:
{ 0 if R+=N+=0
{
T10F = { 1.25*R+
{ ---------------------- otherwise
{ 0.25*R- + N+ + 1.25*R+
This measure will be calculated for each topic and averaged
across topics.
- (C) Range of measures for filtering
-
Both the above measures, and recall and precision, will be
calculated for all adaptive and batch filtering
runs across the full test set. The reason for this is to
provide rich information about each run, and also to investigate the
behaviour of the different measures.
For adaptive filtering, further analyses will be performed on
different time periods. We may also report variations on the
chosen measures (e.g. for different values of MinU or beta), again
for the purpose of investigating the behaviour of the measures.
All batch or adaptive filtering runs should be declared as being
optimized for one particular measure, namely either T10U or T10F.
Specifically, participants are encouraged to submit one adaptive
filtering run optimized for the T10U measure
- (D) Average Uninterpolated Precision
Average uninterpolated precision is the primary measure of evaluation
for the routing sub-task and is defined over a ranked list of
documents. For each relevant document, compute the precision at its
position in the ranked list. Add these numbers up and divide by the
total number of relevant documents. Relevant documents which do not
appear in the top 1000 receive a precision score of zero.
(5) Submission requirements
The deadline for submitting results to the TREC 2001 filtering track is
September 5, 2001.
All runs should follow the traditional TREC submission format and
should be sent to NIST by the date given above. For convenience, we
have taken the relevant section from the general TREC guidelines and
reprint it verbatim below. Note that columns 4 and 5 (rank and score)
are irrelevant for most filtering runs. Please put something in these
columns to make sure that your runs pass the NIST validation checks
but the columns will only be used for routing and comparison runs
based on ranked retrieval (4D).
The format to use when submitting results is as follows, using a space
as the delimiter between columns. The width of the columns in the format
is not important, but it is important to include all columns and have at
least one space between the columns.
...
R25 Q0 285736 0 4238 xyzT10af5
R25 Q0 644554 1 4223 xyzT10af5
R25 Q0 12760 2 4207 xyzT10af5
R25 Q0 111222 3 4194 xyzT10af5
R25 Q0 8013262 4 4189 xyzT10af5
etc.
where:
- the first column is the topic number -- these will be numbered
R1-R84.
- the second column is the query number within that topic. This is
currently unused and should always be Q0.
- the third column is the official document number of the retrieved
document -- the itemid as described in (3)(iv).
- the fourth column is the rank the document is retrieved, and the fifth
column shows the score (integer or floating point) that generated the
ranking. This score MUST be in descending (non-increasing) order and
is important to include so that we can handle tied scores (for a given
run) in a uniform fashion (the evaluation routines rank documents from
these scores, not from your ranks). Note: For filtering runs where
these two columns are not meaningful, please nevertheless generate
entries consistent with the rules -- e.g. for each topic, count up in
column 4 and down from some
large number in column 5. For routing runs use actual ranks and scores.
- the sixth column is called the "run tag" and should be a unique
identifier for your group AND for the method used. That is, each
run should have a different tag that identifies the group and
the method that produced the run. Please change the tag from year
to year, since often we compare across years (for graphs and such)
and having the same name show up for both years is confusing.
Also please use 12 or fewer letters and numbers, and NO
punctuation, to facilitate labeling graphs and such with the tags.
NIST has a routine that checks for common errors in the result files
including duplicate document numbers for the same topic, invalid document
numbers, wrong format, and duplicate tags across runs. This routine will
be made available to participants to check there runs for errors
prior to submitting them. Submitting runs is an automatic process
done through a web form, and runs that contain errors cannot be processed.
As a safeguard, please send in a sample of results from at least two topics
as soon as possible if you are new to TREC. These will not be used for
anything other than to make sure your formats are correct.
In addition to the retrieved document file, each run should be tagged
with the following information. This should be supplied to NIST via the
usual web form:
- (o) Document File Name - the name of the retrieved document file to
which these tags apply
- (i) Sub-Task: (ADAPTIVE, BATCH, ROUTING)
-
Exactly one of these
- ADAPTIVE - Adaptive Filtering Run (see 2A)
- BATCH - Batch Filtering Run(see 2B)
- ROUTING - Routing Run (see 2C)
- (iv) Optimization measure: (LINEAR-UTILITY, F-BETA)
Exactly one of these for ADAPTIVE or BATCH.
Since experimental filtering systems are generally optimized for
a specific performance measure, groups should also
specify the measure for which each filtering run has been optimized.
The only two choices this year are:
- LINEAR-UTILITY: utility using the T10U function
- F-BETA: F measure using the T10F function
All routing runs will be automatically evaluated using average
uninterpolated precision.
- (v) Additional Tags: (RESOURCE-TREC, RESOURCE-OTHER)
Any, all or none of these.
- RESOURCE-TREC - Did this run make any use of any part(s) of the TREC collection
for training, term collection statistics, or building auxilliary data
structures? If yes, include this tag.
- RESOURCE-REUTERS - Did this run make any use of any Reuters data,
for example the 1987 Reuters corpus for training, term collection statistics,
or building auxilliary data
structures? If yes, include this tag.
- RESOURCE-OTHER - Did your system make any use of external resources (other than
TREC documents, topics, and relevance judgements, or Reuters data)
for training, term collection statistics, or building auxilliary data
structures? If yes, include this tag, and please describe them.
- (vi) Number of runs allowed
-
Here are the limits on the number of runs allowed for submission:
(A) Adaptive filtering runs 4
(B) Batch filtering runs 2
(C) Routing runs 2
Therefore, each group may submit between 1 and 8 runs. However, the
limits are defined per category (e.g. submitting 8 batch filtering
runs will not be allowed). As always, groups are encouraged to
generate and evaluate unofficial runs for comparative purposes.
- (vii) Example Results File
-
--- Start Results File (xyzT10af5) ---
R1 Q0 576243 0 999 xyzT10af5
R1 Q0 213721 1 998 xyzT10af5
R1 Q0 617777 2 997 xyzT10af5
R1 Q0 619543 3 996 xyzT10af5
R1 Q0 20775 4 995 xyzT10af5
R1 Q0 323344 5 994 xyzT10af5
R1 Q0 230506 6 993 xyzT10af5
R1 Q0 131197 7 992 xyzT10af5
R1 Q0 54321 8 991 xyzT10af5
R1 Q0 766778 9 990 xyzT10af5
R1 Q0 200331 10 989 xyzT10af5
...
--- End Results File (xyzT10af5) ---
|