Text REtrieval Conference (TREC) Guidelines for the TREC 2002 Filtering Track

Guidelines for the TREC 2002 Filtering Track Stephen Robertson and Jamie Callan Version 1.0
TREC home Active Participants home TREC Guidelines home Track Guidelines home

(1) Introduction Definition of the Task: Given a topic description and some example relevant documents, build a filtering profile which will select the most relevant examples from an incoming stream of documents. In the TREC 2002 filtering task we will continue to stress adaptive filtering. However, the batch filtering and routing tasks will also be available. The details of the requirements of each sub-task are provided below. (2) Detailed Description of Sub-Tasks The Filtering Track will consist of three separate sub-tasks: Adaptive Filtering, Batch Filtering, and Routing. (A) Adaptive Filtering The system starts with a set of topics and a document stream, and a small number of examples of relevant documents. For each topic, it must create an initial filtering profile and then analyze the incoming documents and make a binary decision to retrieve or not to retrieve each document on a profile by profile basis. If a document is retrieved for a given profile and a relevance judgement exists for that document/topic pair, this judgement is provided to the system and the information can be used to update the profile. This step is designed to simulate interactive user feedback. Systems are specifically prohibited from using relevance judgements for documents which are not retrieved by the appropriate filtering profile, or any relevance judgements (including training data) for any other topics. Documents must be processed in the specified time order; once a document has been selected or rejected for a specific topic it is no longer available for subsequent reconsideration for that topic. (The accumulating collection of processed documents may however be used for such information as term statitistics or score distributions.) The final output is an unranked set of documents for each topic. Evaluation will be based on the measures defined in section (4). (B) Batch Filtering The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or not relevant. For each topic, it must use this information to create a filtering profile and a binary classification rule which will be applied to an incoming stream of new documents. Systems are specifically prohibited from using any relevance judgements (including training data) for any other topics. Note that this year's definition does not include a batch-adaptive task -- a fixed binary classification rule should be applied to the complete test set. Thus it may be regarded as similar to routing, except that the final output is an unranked set of documents for each topic. Evaluation will be based on the measures defined in section (4). (C) Routing The system starts with a set of topics and a set of training documents for each topic which have been labelled as relevant or not relevant. For each topic, it must use this information to create a routing profile which assigns retrieval scores to the incoming documents. Systems are specifically prohibited from using any relevance judgements (including training data) for any other topics. The final output is the top 1000 ranked documents for each topic. The primary evaluation measure will be uninterpolated average precision as defined in section (4). Each task will use slightly different resources and evaluation strategies, as described below. There are no mandatory task requirements this year: groups may participate in any or all of these tasks. The Adaptive Filtering sub-task will be the primary task in the Filtering track; all groups are strongly encouraged to participate in this task. The Batch Filtering and Routing tasks are included mainly for historical continuity. However, it is recognised that some groups have systems which are not suitable for Adaptive Filtering. These tasks can also be used as a testing ground for text categorization and batch-style machine learning systems. (3) System Training Since the Filtering Track is working with old data, it is important to be very clear about what resources may be used to train the systems. In general, participating groups are free to take advantage of topics, documents, and relevance judgements included in the TREC collection or in other collections which are not being used for the Filtering Track this year (i.e. any resource not covered in the rest of this section). Furthermore, other kinds of external resources are also permitted (dictionaries, thesauri, ontologies, etc...). Specific rules apply to the test corpus. The only data to be used from the test collection is specified in 4(iv) below; specifically, systems may not use the Reuters categories assigned to the documents. Furthermore, although the distributed relevance judgements for the present test collection would allow participants to evaluate their own runs and choose which runs to submit, participants should take care to avoid doing this. Ideally, decisions about all aspects of the system (including the settings of any tuning parameters) should be made on the basis of external data, including for example experiments on other test corpora. Participants should decide in advance exactly what the parameters of their submitted test runs are to be, and make only those runs. (It is recognised that a run may fail for technical reasons; under these conditions, of course a new run may be made and submitted.) The use of any other Reuters data, including the 1987 Reuters corpus, is also restricted. Participants should avoid training their systems in any way on the old Reuters corpus. Notes: (a) We do not anticipate that any significant benefits are likely to be obtained from the use of the 1987 Reuters data, since the training set described below provides some more closely associated term statistical information. Similarly (although this is more debatable), we do not anticipate any great benefits from training on the TREC-10 Filtering Track, since the present topics have very different characteristics. (b) Nevertheless, we would like participants, as far as possible, to minimise the effects of such training. We recognise that some participants may have trained/tested their systems on the 1987 Reuters corpus or on the Reuters Corpus 1 to be used in the present experiments, in previous experiments (including the TREC-10 filtering track). We understand that it may be difficult to disentangle the effects of such experiments. The fairest way would be to retrain on some different dataset those system parameters that were trained or modified as a result of experiments on Reuters data. We ask participants who have used Reuters data in the past, and who may be unclear about what is required, to contact one of the coordinators. Participants will be asked to provide information about the resources which were used, as detailed in (5)(v). (4) Topics, documents and relevance judgements (i) Documents The TREC-9 filtering track will use the new Reuters Corpus. This corpus (RCV1 and RCV2) can be obtained from NIST, see detailed instructions at Reuters Corpora. The document collection is divided into training and test sets. The training set may be used, in the limited fashion specified below for each task, as part of the construction of the initial profiles. The training set consists of all documents with dates up to and including 30 September 1996 (83,650 documents). The test set consists of all remaining documents (723,141). (A) Adaptive Filtering For each topic, the system begins with a very small number of training documents (positive examples) from the training set, and the topic statement. For TREC 2002 the number of training examples per topic has been set at three. The system may also use any non-relevance-related information from the training set. During processing of the test set, the system may use additional relevance information on documents retrieved by the profile, as specified in (iii) below. Output is a set of documents from the test set. (B) Batch filtering For each topic, the system may use the full training set (all relevance information for that topic and any non-relevance-related information), and the topic statement, for building the filtering profile. The test set is processed in its entirety against the profile thus constructed. Output is a set of documents from the test set. (C) Routing For each topic, the system may use the full training set (all relevance information for that topic and any non-relevance-related information), and the topic statement, for building the routing profile. The test set is processed in its entirety against the profile thus constructed. Output is a ranked list of documents from the test set. (ii) Topics The topics have been specially constructed for this TREC, and are of two types. A set of 50 are of the usual TREC type, developed by the NIST assessors (with assessor relevance judgements, see below). A second set of 50 have been constructed artificially from intersections of pairs of Reuters categories. The relevant documents are taken to be those to which both category labels have been assigned. This second set is included as an experiment, to help assess whether such a method of construction produces useful topics for experimental purposes. Participants are expected to process both sets of topics together. The topics will be in standard TREC format, and are expected to be ready in early June. The two sets will form a single sequence. (iii) Relevance Judgements As part of the process of preparing the assessor topics, the NIST assessors have been making extensive relevance judgements. These judgements will be made available for the purposes of training and adaptation, to be used in the manner specified. In addition, some further judgements will be made after the submissions are in on documents submitted by participants. The final evaluation will be based on all the available judgements. The file of judgements provided will indicate which documents have been judged for a particular topic, and which are judged relevant or not relevant (positive or negative). Any document not in this list will not have been judged by the assessors. Within the rules given, participants are free to use these judgements in any way. For example, unjudged documents may be treated as non-relevant or ignored or used in some other way. For the intersection topics, documents which have been assigned both category labels in the Reuters database are taken to be the (complete) relevance judgements, and the final evaluation will be based on them. As there is no natural equivalent to the "judged negative" / "unjudged" distinction, one has been artificially created for these topics. For each intersection topic, a sample of documents has been taken from those that have been assigned one but not both of the category labels. This sample is taken as the explicit negative judgements; all other documents are to be regarded as unjudged. This method is intended to imitate the fact that negative judgements are more likely to have been made on documents "close" in some sense to the relevant ones, and also to provide a similar number of negative examples to topics in the assessor set. As before, participants are free to use the "judged negative" and "unjudged" documents in any way, in any combination. Please see (3) above for the rules on system training. (A) Adaptive Filtering For initial profile construction, a small number of positive examples (documents judged relevant) may be used -- the specified three per profile. Statistics not relating to relevance from the training data may also be used (e.g. term statistics). However, no further relevance information from the training set should be used at any time. As the system is processing the test collection, it may use the relevance judgement from any document that is retrieved to update the profile for filtering future documents. That is, each adaptive filtering topic can make use of any training documents (for it or any other adaptive filtering topics) and the relevance judgements for the appropriate topics on any documents already retrieved (for it or any other adaptive filtering topics). No information concerning any future documents, and no relevance information about any document-topic pair for which the document was not retrieved, may be used. In particular, the percentage that are relevant over the entire test set for a particular query cannot be used (though of course such statistics accumulated over previously retrieved documents may be used). The text of any document processed (retrieved or unretrieved) may be used to update term frequency statistics or auxilliary data structures. (B) Batch filtering and Routing The documents and relevance judgements from the training set may be used to construct the initial filtering profiles. Since neither task is adaptive, no information from any documents in the test set may be used in any way in profile construction. Thus neither term statistics, nor relevance judgements, nor the percentage judged relevant over the test set, may be used. For Routing, the top 1000 documents should be returned in score order. For batch filtering, the binary retrieval decision should not make any use of information about the test set. Specifically, thresholds cannot be adjusted according to the distribution of scores in the test set. (iv) Processing of documents The *only* fields (elements) of the document records that may be used are: "newsitem" (as indicated above), "headline", "text", "dateline" and "byline". The category code fields may not be used: all relevance judgements derived from those fields (for training or test) will be provided as separate files. Note: the same restrictions apply to the use of the 1987 Reuters data, as discussed at the beginning of (3). For Adaptive Filtering, the test documents should be processed in order of date, and within date in order of itemid. Both date and itemid appear in the "newsitem" element at the start of each document, as "date" and "itemid" tag. They also appear embedded in filenames in the Reuters distribution: the date (in shortened form, without the hyphens) as the name of the zipfile containing all documents for that day, and the itemid in the name of the zipped file containing the document. (Note that itemids are unique, but not necessarily appropriately ordered between days. Within one day's zipfile the individual document files may not be in itemid order. Itemids range from 2286 to 86967 (training set) and 86968 to 810936 (test set), but there are some gaps. A file specifying the required processing order will be available from the NIST website.) Documents may be processed individually or in batches, but in the batched case no information derived from the batch (e.g. statistics or relevance judgements) can be used to modify the profiles applied to the batch. (v) Summary To summarize, participants may use any material (for example TREC documents, topics and relevance judgements) other than Reuters for initially developing their system. Restrictions concerning the use of the Reuters document collection, other Reuters data, and the relevance judgements for the test topics are defined on a task specific basis, as described above. (5) Evaluation For each sub-task, a system will return a set of documents for evaluation. For the filtering tasks the retrieved set is assumed to be unordered and can be of arbitrary size. The retrieved set of the routing task is the top 1000 documents. Evaluation will be based on the relevance judgements already provided for adaptive filtering, together with a small number of new relevance judgements, made by NIST assessors on the basis of the sampled and pooled output from the returned search results for each of the assessor topics. The filtering sub-tasks will be evaluated primarily according to a utility measure and a version of the F measure, and the routing task will be evaluated primarily according to average uninterpolated precision. The exact method of pooling and sampling output for the additional relevance assessments has yet to be decided. It is likely to involve a selection of runs from the different subtasks, and sampling to give a fixed number of documents per run selected. These documents will then be pooled, and any documents already judged will be removed from the pool, before they are passed to the assessors. All runs will be evaluated based on the full document test set. Adaptive filtering systems will also be evaluated at earlier points to test the learning rates of the systems. This will be done automatically by the evaluation program. There will be no need to submit separate retrieved sets for each time point used for adaptive filtering evaluation. (A) Utility Linear utility assigns a positive worth or negative cost to each element in the contingency table defined below. Relevant Not Relevant Retrieved R+ / A N+ / B Not Retrieved R- / C N- / D Linear utility = AR+ + BN+ + CR- + DN- The variables `R+/R-/N+/N-` refer to the number of documents in each category. The utility parameters (`A,B,C,D`) determine the relative value of each possible category. The larger the utility score, the better the filtering system is performing for that query. For this TREC we will use a single specific linear function: T11U = 2R+ - N+ (that is, `A=2, B=-1, C=D=0`). This is the same linear utility function as for TRECs 9 and 10 (but see below concerning averaging). Filtering according to a linear utility function is equivalent to filtering by estimated probability of relevance. T11U is equivalent to the retrieval rule: retrieve if P(rel) > .33 For the purpose of averaging across topics, the method used will be that proposed by Ault, a variant on the scaled utilities used in TRECs 8 and 10. Topic utilities are first normalised by the maximum for the topic, then scaled according to some minimum acceptable level, and the scaled normalised values are averaged. The maximum utility for the topic is MaxU = 2(Total relevant) so the normalised utility is T11U T11NU = ---- MaxU The lower limit is some negative normalised utility, `MinNU`, which may be thought of as the minimum (maximum negative) utility that a user would tolerate, over the lifetime of the profile. If the T11NU value falls below this minimum, it will be assumed that the user stops looking at documents, and therefore the minimum is used. max(T11NU, MinNU) - MinNU T11SU = ------------------------- for each topic, and 1 - MinNU MeanT11SU = Mean T11SU over topics Different values of MinU may be chosen. The primary evaluation measure will have MinNU = -0.5 but results will also be presented using different values of MinU. (B) F-beta This measure, used in TREC-10, is a function of recall and precision, together with a free parameter beta which determines the relative weighting of recall and precision. For any beta, the measure lies in the range zero (bad) to 1 (good). For this TREC, as for TREC-10, a value of beta=0.5 has been chosen, corresponding to an emphasis on precision (beta=1 is neutral). The measure (with this choice of beta) may be expressed in terms of the quantities above as follows: { 0 if R+=N+=0 { T11F = { 1.25R+ { ---------------------- otherwise { 0.25R- + N+ + 1.25R+ This measure will be calculated for each topic and averaged across topics. (C) Range of measures for filtering* Both the above measures, and recall and precision, will be calculated for all adaptive and batch filtering runs across the full test set. The reason for this is to provide rich information about each run, and also to investigate the behaviour of the different measures. For adaptive filtering, further analyses will be performed on different time periods. We may also report variations on the chosen measures (e.g. for different values of MinNU or beta), again for the purpose of investigating the behaviour of the measures. All batch or adaptive filtering runs should be declared as being optimized for one particular measure, namely either T11U or T11F. Specifically, participants are encouraged to submit one adaptive filtering run optimized for the T11U measure (D) Average Uninterpolated Precision Average uninterpolated precision is the primary measure of evaluation for the routing sub-task and is defined over a ranked list of documents. For each relevant document, compute the precision at its position in the ranked list. Add these numbers up and divide by the total number of relevant documents. Relevant documents which do not appear in the top 1000 receive a precision score of zero. (6) Submission requirements The deadline for submitting results to the TREC 2002 filtering track is September 3, 2002. All runs should follow the traditional TREC submission format and should be sent to NIST by the date given above. The format to use when submitting results is as follows, using a space as the delimiter between columns. The width of the columns in the format is not important, but it is important to include all columns and have at least one space between the columns. ... R125 Q0 285736 0 4238 xyzT11af5 R125 Q0 644554 1 4223 xyzT11af5 R125 Q0 92760 2 4207 xyzT11af5 R125 Q0 111222 3 4194 xyzT11af5 R125 Q0 801326 4 4189 xyzT11af5 etc. where: the first column is the topic number. Topics are numbered R101-R200. the second column is the query number within that topic. This is currently unused and should always be Q0. the third column is the official document number of the retrieved document -- the itemid as described in (4)(iv). the fourth column is the rank the document is retrieved, and the fifth column shows the score (integer or floating point) that generated the ranking. For routing runs this score must be in descending (non-increasing) order and is important to include so that we can handle tied scores (for a given run) in a uniform fashion (the evaluation routines rank documents from these scores, not from your ranks). For batch and adaptive filtering runs these columns will be ignored in the evaluation, but should nevertheless be present. The fourth column is arbitrary and could be 0. It is suggested that the fifth column could still contain the score which the document achieved, for the possible benefit of future researchers. It is understood that some systems may not base the retrieval decision on a single score for each document, and that scores may not be comparable over the lifetime of an adaptive filtering profile. the sixth column is called the "run tag" and should be a unique identifier for your group AND for the method used. That is, each run should have a different tag that identifies the group and the method that produced the run. Please change the tag from year to year, since often we compare across years (for graphs and such) and having the same name show up for both years is confusing. Also please use 12 or fewer letters and numbers, and NO punctuation, to facilitate labeling graphs and such with the tags. NIST has a routine that checks for common errors in the result files including duplicate document numbers for the same topic, invalid document numbers, wrong format, and duplicate tags across runs. This routine will be made available to participants to check there runs for errors prior to submitting them. Submitting runs is an automatic process done through a web form, and runs that contain errors cannot be processed. As a safeguard, please send in a sample of results from at least two topics as soon as possible if you are new to TREC. These will not be used for anything other than to make sure your formats are correct. In addition to the retrieved document file, each run should be tagged with the following information. This should be supplied to NIST via the usual web form: (i) Document File Name - the name of the retrieved document file to which these tags apply (ii) Sub-Task: (ADAPTIVE, BATCH, ROUTING) Exactly one of these ADAPTIVE - Adaptive Filtering Run (see 2A) BATCH - Batch Filtering Run(see 2B) ROUTING - Routing Run (see 2C) (iv) Optimization measure: (LINEAR-UTILITY, F-BETA) Exactly one of these for ADAPTIVE or BATCH. Since experimental filtering systems are generally optimized for a specific performance measure, groups should also specify the measure for which each filtering run has been optimized. The only two choices this year are: LINEAR-UTILITY: utility using the T11U function F-BETA: F measure using the T11F function All routing runs will be automatically evaluated using average uninterpolated precision. (v) Additional Tags: (RESOURCE-TREC, RESOURCE-OTHER) Any, all or none of these. RESOURCE-TREC - Did this run make any use of any part(s) of the TREC collection for training, term collection statistics, or building auxilliary data structures? If yes, include this tag. RESOURCE-REUTERS - Did this run make any use of any Reuters data, for example the 1987 Reuters corpus for training, term collection statistics, or building auxilliary data structures? If yes, include this tag. RESOURCE-OTHER - Did your system make any use of external resources (other than TREC documents, topics, and relevance judgements, or Reuters data) for training, term collection statistics, or building auxilliary data structures? If yes, include this tag, and please describe them. (vi) Number of runs allowed Here are the limits on the number of runs allowed for submission: (A) Adaptive filtering runs 4 (B) Batch filtering runs 2 (C) Routing runs 2 Therefore, each group may submit between 1 and 8 runs. However, the limits are defined per category (e.g. submitting 8 batch filtering runs will not be allowed). As always, groups are encouraged to generate and evaluate unofficial runs for comparative purposes. It is likely that not all submitted runs will be sampled for additional relevance judgements. A final decision on this procedure has not yet been taken, but if it involves selection of runs, participants may be asked to nominate which runs they would prefer to have sampled.
Last updated: Friday, 31-May-02 [email protected]

Guidelines for the TREC 2002 Filtering Track Stephen Robertson and Jamie Callan Version 1.0

Guidelines for the TREC 2002 Filtering Track
Stephen Robertson and Jamie Callan
Version 1.0