Text REtrieval Conference (TREC) Guidelines for the TREC 2001 Filtering Track

Guidelines for the TREC 2001 Filtering Track Stephen Robertson and Jamie Callan Version 1.0 (replacing Version 0.9--Final Draft)
TREC home Active Participants home TREC Guidelines home Track Guidelines home

(1) Introduction Definition of the Task: Given a topic description, build a filtering profile which will select the most relevant examples from an incoming stream of documents. As the document stream is processed, the system may be provided with a binary judgement of relevance for some of the retrieved documents. This information can be used to adaptively update the filtering profile. In the TREC 2001 filtering task we will continue to stress adaptive filtering. By adaptive filtering, we mean that the system will not receive a large pool of evaluated documents in advance, rather the relevance judgements will be provided one by one for documents which are retrieved as the filtering system is operating. Once a document has been considered by the system (and a filtering decision against each topic has been taken) it is no longer available for subsequent retrieval by the system. The accumulated document collection, and any statistics derived therefrom, may however be used in any other way to improve the profiles and thresholds The batch filtering task starts with a training set, but is subject to the same constraints as regards the test set. The traditional routing option will also be available. The exact details of the requirements of each sub-task are provided below. (2) Detailed Description of Sub-Tasks The TREC-9 Filtering Track will consist of three separate sub-tasks: Adaptive Filtering, Batch Filtering, and Routing. (A) Adaptive Filtering The system starts with a set of topics and a document stream, and a small number of examples of relevant documents. For each topic, it must create an initial filtering profile and then analyze the incoming documents and make a binary decision to retrieve or not to retrieve each document on a profile by profile basis. If a document is retrieved for a given profile and a relevance judgement exists for that document/topic pair, this judgement is provided to the system and the information can be used to update the profile. (What constitutes a relevance judgement for the specific collection and topics to be used is discussed below.) This step is designed to simulate interactive user feedback. Systems are specifically prohibited from using relevance judgements for documents which are not retrieved by the appropriate filtering profile, or any relevance judgements (including training data) for any other topics. The document stream will be ordered (roughly) as a function of time. Documents must be processed in this order. The final output is an unranked set of documents for each topic. Evaluation will be based on the measures defined in section (4). (B) Batch Filtering The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or not relevant. For each topic, it must use this information to create a filtering profile and a binary classification rule which will be applied to an incoming stream of new documents. Systems are specifically prohibited from using any relevance judgements (including training data) for any other topics. Note that this year's definition does not include a batch-adaptive task -- a fixed binary classification rule should be applied to the complete test set. Thus it may be regarded as similar to routing, except that the final output is an unranked set of documents for each topic. Evaluation will be based on the measures defined in section (4). (C) Routing The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or not relevant. For each topic, it must use this information to create a routing profile which assigns retrieval scores to the incoming documents. Systems are specifically prohibited from using any relevance judgements (including training data) for any other topics. The final output is the top 1000 ranked documents for each topic. The primary evaluation measure will be uninterpolated average precision as defined in section (4). In the adaptive filtering task, documents are processed in date order, and once a document has been selected or rejected for a specific topic it is no longer available for retrieval for that topic. However, this document (after selection or rejection) may be added to an accumulating collection, and this collection may be used in any other way to improve the profiles. For example, any term statistics or score distributions from the accumulated collection may be used to modify a profile or its associated threshold, for application to the not-yet-considered documents. Each task will use slightly different resources and evaluation strategies, as described below. There are no mandatory task requirements this year: groups may participate in any or all of these tasks. However, the Adaptive Filtering sub-task will be the primary task in the TREC 2001 Filtering track. All groups are strongly encouraged to participate in this task. The Batch Filtering and Routing tasks are included mainly for historical continuity. Some groups may have systems which are not suitable for Adaptive Filtering and we don't want to exclude them completely from the Filtering Track. These tasks can also be used as a testing ground for text categorization and batch-style machine learning systems. (3) Topics, Documents, and Relevance Judgements Since the Filtering Track is working entirely with old data, it is important to be very clear about what resources may be used to train the systems. In general, participating groups are free to take advantage of topics, documents, and relevance judgements included in the TREC collection or in other collections which are not being used for the Filtering Track this year (i.e. any resource not covered in the rest of this section). Furthermore, other kinds of external resources are also permitted (dictionaries, thesauri, ontologies, etc...). Specific rules apply to any data from Reuters. The hierarchical structure and relationships in the categories, as indicated in the "codes" files distributed with the Reuters corpus, may be used. Concerning the 1987 Reuters corpus which has already been used by many people for various experiments: participants may make use of restricted fields of the 1987 Reuters documents, essentially the text fields, for (for example) term statistics. The fields concerned are those specified below for the test corpus (see 3(iv) below). They *may not* however use any Reuters data, or data from any other source, involving Reuters categories in any form. Notes: (a) We do not anticipate that any great benefits are likely to be obtained from the use of the text of the 1987 Reuters data, since the training set described below provides some more closely associated term statistical information. (b) It is possible that some participants have trained their systems on the 1987 Reuters corpus, including categories, in previous experiments. We understand that it may be difficult to disentangle the effects of such experiments. The fairest way would be to retrain on some different dataset those system parameters that were trained or modified as a result of experiments on 1987 Reuters. We ask participants who have used 1987 (or any other) Reuters data with categories in the past, and who may be unclear about what is required, to contact one of the coordinators. Participants will be asked to provide information about the resources which were used, as detailed in (5)(v). (i) Documents The TREC-9 filtering track will use the new Reuters Corpus. This corpus (RCV1 and RCV2) can be obtained from NIST, see detailed instructions at Reuters Corpora. The document collection is divided into training and test sets. The training set may be used, in the limited fashion specified below for each task, as part of the construction of the initial profiles. The training set consists of the documents dated August 1996, that is all documents with itemids up to and including 26150. The test set consists of all remaining documents. (A) Adaptive Filtering For each topic, the system begins with a very small number of training documents (positive examples) from the training set, and the topic statement. The system may also use any non-relevance-related information from the training set. During processing of the test set, the system may use additional relevance information on documents retrieved by the profile, as specified in (iii) below. Output is a set of documents from the test set. (B) Batch filtering For each topic, the system may use the full training set (all relevance information for that topic and any non-relevance-related information), and the topic statement, for building the filtering profile. The test set is processed in its entirety against the profile thus constructed. Output is a set of documents from the test set. (C) Routing For each topic, the system may use the full training set (all relevance information for that topic and any non-relevance-related information), and the topic statement, for building the routing profile. The test set is processed in its entirety against the profile thus constructed. Output is a ranked list of documents from the test set. (ii) Topics The topics are derived from the Reuters category codes. However, they will be provided on the NIST website as a separate file (details to follow). The category codes on individual document records from the Reuters corpus should not be used. Any hierarchical information given in the "codes" files distributed with the Reuters corpus may be used. (iii) Relevance Judgements Similarly, the relevance judgements are derived from the assigned category codes, but will be provided in a separate file, and should not be taken from the document records. (A) Adaptive Filtering For initial profile construction, a small number of positive examples (documents judged relevant) may be used -- two per profile. Statistics not relating to relevance from the training data may also be used (e.g. term statistics). As the system is processing the test collection, it may use the relevance judgement from any document that is retrieved to update the profile for filtering future documents. That is, each adaptive filtering topic can make use of any training documents (for it or any other adaptive filtering topics) and the relevance judgements for the appropriate topics on any documents already retrieved (for it or any other adaptive filtering topics). No information concerning any future documents, and no relevance information about any document-topic pair for which the document was not retrieved, may be used. In particular, the percentage that are relevant over the entire test set for a particular query cannot be used (though of course such statistics accumulated over previously retrieved documents may be used). The text of any document processed (retrieved or unretrieved) may be used to update term frequency statistics or auxilliary data structures. (B) Batch filtering and Routing The documents and relevance judgements from the training set may be used to construct the initial filtering profiles. Since neither task is adaptive, no information from any documents in the test set may be used in any way in profile construction. Thus neither term statistics, nor relevance judgements, nor the percentage judged relevant over the test set, may be used (iv) Processing of documents The *only* fields (elements) of the document records that may be used are: "newsitem" (as indicated above), "headline", "text", "dateline" and "byline". The category code fields may not be used: all relevance judgements derived from those fields (for training or test) will be provided as separate files. Note: the same restrictions apply to the use of the 1987 Reuters data, as discussed at the beginning of (3). For Adaptive Filtering, the test documents should be processed in order of date, and within date in order of itemid. Both date and itemid appear in the "newsitem" element at the start of each document, as "date" and "itemid" tag. They also appear embedded in filenames in the Reuters distribution: the date (in shortened form, without the hyphens) as the name of the zipfile containing all documents for that day, and the itemid in the name of the zipped file containing the document. (Note that itemids are unique, but not necessarily appropriately ordered between days. Within one day's zipfile the individual document files may not be in itemid order. Itemids range from 2286 to 26150 (training set) and 26151 to 810936 (test set), but there are some gaps. A file specifying the required processing order is available from the NIST website.) Documents may be processed individually or in batches, but in the batched case no information derived from the batch (e.g. statistics or relevance judgements) can be used to modify the profiles applied to the batch. (vi) Summary To summarize, participants may use any material (for example TREC documents, topics and relevance judgements) other than Reuters for initially developing their system. Restrictions concerning the use of the Reuters document collection, other Reuters data, and the relevance judgements for the test topics are defined on a task specific basis, as described above. (4) Evaluation For each sub-task, a system will return a set of documents for evaluation. For the filtering tasks the retrieved set is assumed to be unordered and can be of arbitrary size. The retrieved set of the routing task is the top 1000 documents. Please note that there will be no additional relevance judgements on the retrieved documents. All evaluation will be on the basis of the existing relevance judgements (it is assumed that the Reuters categories have been applied to all documents to which they are appropriate). The filtering sub-tasks will be evaluated according to a utility measure and a version of the F measure, and the routing task will be evaluated according to average uninterpolated precision. All runs will be evaluated based on the full document test set. Adaptive filtering systems will also be evaluated at earlier points to test the learning rates of the systems. This will be done automatically by the evaluation program. There will be no need to submit separate retrieved sets for each time point used for adaptive filtering evaluation. (A) Utility Linear utility assigns a positive worth or negative cost to each element in the contingency table defined below. Relevant Not Relevant Retrieved R+ / A N+ / B Not Retrieved R- / C N- / D Linear utility = AR+ + BN+ + CR- + DN- The variables `R+/R-/N+/N-` refer to the number of documents in each category. The utility parameters (`A,B,C,D`) determine the relative value of each possible category. The larger the utility score, the better the filtering system is performing for that query. For TREC 2001 we will use a single specific linear function: T10U = 2R+ - N+ (that is, `A=2, B=-1, C=D=0`). This is the same linear utility function as for TREC-9, but without the lower bound (but see below concerning averaging). Filtering according to a linear utility function is equivalent to filtering by estimated probability of relevance. T10U is equivalent to the retrieval rule: retrieve if P(rel) > .33 For the purpose of averaging across topics, the method used will be that used in TREC-8. That is, the topic utilities are scaled between limits, and the scaled values are averaged. The upper limit is the maximum utility for that topic, namely MaxU = 2(Total relevant) The lower limit is some negative utility, `MinU`, which may be thought of as the maximum number of non-relevant documents that a user would tolerate, with no relevants, over the lifetime of the profile. If the T10U value falls below this minimum, it will be assumed that the user stops looking at documents, and therefore the minimum is used. max(T10U, MinU) - MinU T10SU = --------------------- for each topic, and MaxU - MinU MeanT10SU = Mean T10SU over topics Different values of MinU may be chosen: a value of zero means taking all negative utilities as zero and then normalising by the maximum. A value of minus infinity is equivalent (in terms of comparing sytems) to using unnormalised utility. The primary evaluation measure will have MinU = -100 but results will also be presented using different values of MinU. (B) F-beta This measure, based on one defined by van Rijsbergen, is a function of recall and precision, together with a free parameter beta which determines the relative weighting of recall and precision. For any beta, the measure lies in the range zero (bad) to 1 (good). For TREC 2001, a value of beta=0.5 has been chosen, corresponding to an emphasis on precision (beta=1 is neutral). The measure (with this choice of beta) may be expressed in terms of the quantities above as follows: { 0 if R+=N+=0 { T10F = { 1.25R+ { ---------------------- otherwise { 0.25R- + N+ + 1.25R+ This measure will be calculated for each topic and averaged across topics. (C) Range of measures for filtering* Both the above measures, and recall and precision, will be calculated for all adaptive and batch filtering runs across the full test set. The reason for this is to provide rich information about each run, and also to investigate the behaviour of the different measures. For adaptive filtering, further analyses will be performed on different time periods. We may also report variations on the chosen measures (e.g. for different values of MinU or beta), again for the purpose of investigating the behaviour of the measures. All batch or adaptive filtering runs should be declared as being optimized for one particular measure, namely either T10U or T10F. Specifically, participants are encouraged to submit one adaptive filtering run optimized for the T10U measure (D) Average Uninterpolated Precision Average uninterpolated precision is the primary measure of evaluation for the routing sub-task and is defined over a ranked list of documents. For each relevant document, compute the precision at its position in the ranked list. Add these numbers up and divide by the total number of relevant documents. Relevant documents which do not appear in the top 1000 receive a precision score of zero. (5) Submission requirements The deadline for submitting results to the TREC 2001 filtering track is September 5, 2001. All runs should follow the traditional TREC submission format and should be sent to NIST by the date given above. For convenience, we have taken the relevant section from the general TREC guidelines and reprint it verbatim below. Note that columns 4 and 5 (rank and score) are irrelevant for most filtering runs. Please put something in these columns to make sure that your runs pass the NIST validation checks but the columns will only be used for routing and comparison runs based on ranked retrieval (4D). The format to use when submitting results is as follows, using a space as the delimiter between columns. The width of the columns in the format is not important, but it is important to include all columns and have at least one space between the columns. ... R25 Q0 285736 0 4238 xyzT10af5 R25 Q0 644554 1 4223 xyzT10af5 R25 Q0 12760 2 4207 xyzT10af5 R25 Q0 111222 3 4194 xyzT10af5 R25 Q0 8013262 4 4189 xyzT10af5 etc. where: the first column is the topic number -- these will be numbered R1-R84. the second column is the query number within that topic. This is currently unused and should always be Q0. the third column is the official document number of the retrieved document -- the itemid as described in (3)(iv). the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. This score MUST be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). Note: For filtering runs where these two columns are not meaningful, please nevertheless generate entries consistent with the rules -- e.g. for each topic, count up in column 4 and down from some large number in column 5. For routing runs use actual ranks and scores. the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also please use 12 or fewer letters and numbers, and NO punctuation, to facilitate labeling graphs and such with the tags. NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same topic, invalid document numbers, wrong format, and duplicate tags across runs. This routine will be made available to participants to check there runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed. As a safeguard, please send in a sample of results from at least two topics as soon as possible if you are new to TREC. These will not be used for anything other than to make sure your formats are correct. In addition to the retrieved document file, each run should be tagged with the following information. This should be supplied to NIST via the usual web form: (o) Document File Name - the name of the retrieved document file to which these tags apply (i) Sub-Task: (ADAPTIVE, BATCH, ROUTING) Exactly one of these ADAPTIVE - Adaptive Filtering Run (see 2A) BATCH - Batch Filtering Run(see 2B) ROUTING - Routing Run (see 2C) (iv) Optimization measure: (LINEAR-UTILITY, F-BETA) Exactly one of these for ADAPTIVE or BATCH. Since experimental filtering systems are generally optimized for a specific performance measure, groups should also specify the measure for which each filtering run has been optimized. The only two choices this year are: LINEAR-UTILITY: utility using the T10U function F-BETA: F measure using the T10F function All routing runs will be automatically evaluated using average uninterpolated precision. (v) Additional Tags: (RESOURCE-TREC, RESOURCE-OTHER) Any, all or none of these. RESOURCE-TREC - Did this run make any use of any part(s) of the TREC collection for training, term collection statistics, or building auxilliary data structures? If yes, include this tag. RESOURCE-REUTERS - Did this run make any use of any Reuters data, for example the 1987 Reuters corpus for training, term collection statistics, or building auxilliary data structures? If yes, include this tag. RESOURCE-OTHER - Did your system make any use of external resources (other than TREC documents, topics, and relevance judgements, or Reuters data) for training, term collection statistics, or building auxilliary data structures? If yes, include this tag, and please describe them. (vi) Number of runs allowed Here are the limits on the number of runs allowed for submission: (A) Adaptive filtering runs 4 (B) Batch filtering runs 2 (C) Routing runs 2 Therefore, each group may submit between 1 and 8 runs. However, the limits are defined per category (e.g. submitting 8 batch filtering runs will not be allowed). As always, groups are encouraged to generate and evaluate unofficial runs for comparative purposes. (vii) Example Results File --- Start Results File (xyzT10af5) --- R1 Q0 576243 0 999 xyzT10af5 R1 Q0 213721 1 998 xyzT10af5 R1 Q0 617777 2 997 xyzT10af5 R1 Q0 619543 3 996 xyzT10af5 R1 Q0 20775 4 995 xyzT10af5 R1 Q0 323344 5 994 xyzT10af5 R1 Q0 230506 6 993 xyzT10af5 R1 Q0 131197 7 992 xyzT10af5 R1 Q0 54321 8 991 xyzT10af5 R1 Q0 766778 9 990 xyzT10af5 R1 Q0 200331 10 989 xyzT10af5 ... --- End Results File (xyzT10af5) ---
Last updated: Monday, 18-June-01 trec@nist.gov

Guidelines for the TREC 2001 Filtering Track Stephen Robertson and Jamie Callan Version 1.0

Guidelines for the TREC 2001 Filtering Track
Stephen Robertson and Jamie Callan
Version 1.0