The relevance assessments are in the "qrels" files. They have the following format:
There are 3 qrels files for the 2009 Batch task:
The .eval files (such as refL09B.eval) and .evalH files (such as refL09B.evalH) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.
Note: in the 2009 Batch task:
The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.
The next 41 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)
Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:
Miscellaneous notes for other measures:
The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: H=highly relevant, R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.
At the end of the .eval or .evalH file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).
The evaluation software (l07_eval.c) is described below.
Median scores for each measure over the 10 submitted runs of the 2009 Batch task are provided in the medL09_all10.eval and medL09_all10.evalH files. (Low and high scores respectively are provided in the loL09* and hiL09* files.)
The median, low and high scores do not include the reference runs (below).
For the 2009 Batch task, results for 4 reference runs are provided:
The source code of the evaluation software is included in l07_sort.c (version 2.1) and l07_eval.c (version 2.3). Pre-compiled Win32 executables are in l07_sort.exe and l07_eval.exe. For Unix, the compilation syntax is typically something like this:
gcc -lm -o l07_eval.exe l07_eval.c
As a usage example, here are the steps to evaluating the refL09B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).
The first step is to sort the run in canonical order. The canonical order used is based on the trec_eval canonical order:
Note: the specified rank (column 4 of the retrieval set) is not actually a factor in the canonical order.
To sort the run in canonical order:
1. Save the retrieval set (e.g. refL09B) in a subdirectory named 'unsorted'.
2. Make a subdirectory named 'sorted'.
3. Make a text file named 'sortlist.txt' which contains a line specifying each run's name, such as follows:
refL09B
4. Run l07_sort.exe to sort the runs and output them to the subdirectory named 'sorted' as follows:
l07_sort.exe in=sortlist.txt inDir=unsorted\ outDir=sorted\ trecevalOrder kExtract
Note: on Unix platforms, you may need to use forward slashes instead of backslashes.
The sorting can take several seconds to run.
5. Verify that the sorted run appears in the 'sorted' subdirectory, along with the .K and .Kh files. For refL08B, the first 3 lines of the sorted version should be as follows:
102 Q0 zzz10d00 1 1 refL08B 102 Q0 zzy76d00 1 1 refL08B 102 Q0 zzy42f00 1 1 refL08B
l07_eval assumes the input run is already in evaluation order.
1. To produce the refL09B.eval file from the sorted refL09B file, use the following syntax:
l07_eval.exe run=sorted\refL09B q=qrelsL09.probs out=ignore1 out2=ignore2 out5=refL09B.eval stringDisplay=100 M1000=6910192 probD=6910192 precB=B09.txt estopt=0 Kfile=sorted\refL09B.K
Note: on Unix platforms, you may need to use forward slashes instead of backslashes.
The command can take a few seconds to run. Once it has completed, you should have a refL09B.eval file of scores.
2. To produce the .evalH file (evaluation with just Highly relevant judgments), add 'MinRelLevel=2' to the command-line, read from the .Kh file and write to a .evalH file, as follows:
l07_eval.exe run=sorted\refL09B q=qrelsL09.probs out=ignore1 out2=ignore2 out5=refL09B.evalH stringDisplay=100 M1000=6910192 probD=6910192 precB=B09.txt estopt=0 Kfile=sorted\refL09B.Kh MinRelLevel=2
The main measure for the track this year is F1@K (":est_K-F1:").
A secondary (rank-based) measure for the track this year is F1@R (":est_R-F1:").
In 2007, the main measure was Recall@B (":est_RB:").
In the case of the reference Boolean run (refL09B), the ordering is descending alphabetical by docno. Hence only measures at depth B (or depth K, which is the same as B for the Boolean run) are applicable for the reference Boolean run, so that the ordering does not matter. e.g. F1@K and Recall@B are fair measures for comparing to the Boolean run, but F1@R and P@5 are not.
For the other 3 reference runs (fullset09, oldrel09, oldnon09) only measures at depth K are applicable.
Measures just counting highly relevant documents may be considered as or more important than those based on general relevance (to be discussed at the conference).
The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page.