The relevance assessments are in the "qrels" files. They have the following format:
There are 3 qrels files for the 2008 Ad Hoc task:
For the 2008 Relevance Feedback task, the corresponding qrels files are qrelsRF08.normal, qrelsRF08.probs and qrelsRF08.runids (respectively).
Note: for the Relevance Feedback task, "residual" evaluation was used (i.e. documents judged for the topics in previous years (as listed in the qrelsRF08.pass1 file) were removed from the retrieval sets before pooling and evaluation this year).
The .eval files (such as refL08B.eval) and .evalH files (such as refL08B.evalH) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.
Note: in the 2008 Ad Hoc task:
Likewise, in the 2008 Relevance Feedback task:
The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.
The next 41 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)
Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:
Miscellaneous notes for other measures:
The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: H=highly relevant, R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.
At the end of the .eval or .evalH file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).
The evaluation software (l07_eval.c) is described below.
Median scores for each measure over the 64 submitted runs of the 2008 Ad Hoc task are provided in medL08_all64.eval and medL08_all64.evalH files. (Low and high scores respectively are provided in the loL08* and hiL08* files.)
We will provide medians over the standard condition runs soon (i.e. for runs just using the request text field).
The medians do not include the 5 reference runs (below).
Median scores for each measure over the 29 submitted runs of the 2008 Relevance Feedback task are provided in medRF08_all29.eval and medRF08_all29.evalH files. (Low and high scores respectively are provided in the loRF08* and hiRF08* files.)
We will provide medians over the runs using Feedback techniques soon.
The medians do not include the 2 reference runs (below).
For the 2008 Ad Hoc task, 5 reference runs were included in the pools:
The .eval and .evalH files for each reference run are provided (more analysis will appear in the track overview paper). Note that the xrefL08P and xrefL08C reference runs sometimes included more than 100,000 documents for a topic, which was not allowed for official submitted runs.
For the 2008 Relevance Feedback task, 2 reference runs were included in the pools:
The source code of the evaluation software is included in l07_sort.c and l07_eval.c (versions 2.1). Pre-compiled Win32 executables are in l07_sort.exe and l07_eval.exe. For Unix, the compilation syntax is typically something like this:
gcc -lm -o l07_eval.exe l07_eval.c
As a usage example, here are the steps to evaluating the refL08B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).
The first step is to sort the run in canonical order. The canonical order used is based on the trec_eval canonical order:
Note: the specified rank (column 4 of the retrieval set) is not actually a factor in the canonical order.
To sort the run in canonical order:
1. Save the retrieval set (e.g. refL08B) in a subdirectory named 'unsorted'.
2. Make a subdirectory named 'sorted'.
3. Make a text file named 'sortlist.txt' which contains a line specifying each run's name, such as follows:
refL08B
4. Run l07_sort.exe to sort the runs and output them to the subdirectory named 'sorted' as follows:
l07_sort.exe in=sortlist.txt inDir=unsorted\ outDir=sorted\ trecevalOrder kExtract
Note: on Unix platforms, you may need to use forward slashes instead of backslashes.
The sorting can take several seconds to run.
5. Verify that the sorted run appears in the 'sorted' subdirectory, along with the .K and .Kh files. For refL08B, the first 3 lines of the sorted version should be as follows:
102 Q0 zzz10d00 1 1 refL08B 102 Q0 zzy76d00 1 1 refL08B 102 Q0 zzy42f00 1 1 refL08B
l07_eval assumes the input run is already in evaluation order.
1. To produce the refL08B.eval file from the sorted refL08B file, use the following syntax:
l07_eval.exe run=sorted\refL08B q=qrelsL08.probs out=ignore1 out2=ignore2 out5=refL08B.eval stringDisplay=100 M1000=7000000 probD=6910912 precB=B08.txt estopt=0 Kfile=sorted\refL08B.K
Note: on Unix platforms, you may need to use forward slashes instead of backslashes.
The command can take a few seconds to run. Once it has completed, you should have a refL08B.eval file of scores.
2. To produce the .evalH file (evaluation with just Highly relevant judgments), add 'MinRelLevel=2' to the command-line, read from the .Kh file and write to a .evalH file, as follows:
l07_eval.exe run=sorted\refL08B q=qrelsL08.probs out=ignore1 out2=ignore2 out5=refL08B.evalH stringDisplay=100 M1000=7000000 probD=6910912 precB=B08.txt estopt=0 Kfile=sorted\refL08B.Kh MinRelLevel=2
Relevance Feedback evaluation has the extra step of producing a residual run. As a usage example, here are the steps to evaluating the refRF08B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).
Step 1: Produce a canonically ordered refRF08B run and extract the refRF08.K and refRF08.Kh files. (For details, follow the steps above for sorting the refL08B run.)
Step 2: Produce the residual retrieval set refRF08B-resid and residual refRF08.Kr file using the following syntax:
l07_eval run=sorted\refRF08B q=qrelsRF08.pass1 out=ignore1 out2=ignore2 out5=ignore5 stringDisplay=100 M1000=7000000 residCap=100000 Kfile=sorted\refRF08B.K outResid=sorted\refRF08B-resid outResidK=sorted\refRF08B.Kr
Furthermore, produce the residual refRF08.Khr file as follows:
l07_eval run=sorted\refRF08B q=qrelsRF08.pass1 out=ignore1 out2=ignore2 out5=ignore5 stringDisplay=100 M1000=7000000 residCap=100000 Kfile=sorted\refRF08B.Kh outResid=ignoreR outResidK=sorted\refRF08B.Khr
Step 3: Produce the refRF08B-resid.eval file from the sorted refRF08B-resid file and refRF08.Kr file using the following syntax:
l07_eval run=sorted\refRF08B-resid q=qrelsRF08.probs out=ignore1 out2=ignore2 out5=refRF08B-resid.eval stringDisplay=100 M1000=100000 probD=6910912 precB=Br08.txt Kfile=sorted\refRF08B.Kr estopt=0
Furthermore, produce the refRF08-resid.evalH file as follows:
l07_eval run=sorted\refRF08B-resid q=qrelsRF08.probs out=ignore1 out2=ignore2 out5=refRF08B-resid.evalH stringDisplay=100 M1000=100000 probD=6910912 precB=Br08.txt Kfile=sorted\refRF08B.Khr estopt=0 MinRelLevel=2
The main measure for the track this year is F1@K (":est_K-F1:").
A new secondary measure for the track this year is F1@R (":est_R-F1:").
Last year's main measure was Recall@B (":est_RB:").
The sampling approach favored depths 5 and 100000, so P@5 (":est_P5:") may be a good choice of an early cutoff measure and R@100000 (":est_R100000:") may be a good choice of a deep cutoff measure.
Note that even at the favored depths, the estimates may only have the accuracy of 5 sample points, hence substantial errors are possible on individual topics. Mean scores (over 24 or 26 topics) should be somewhat more reliable than the estimates for individual topics.
In the case of the reference Boolean run (refL08B), the ordering is descending alphabetical by docno. Hence only measures at depth B (or depth K, which is the same as B for the Boolean run) are applicable for the reference Boolean run, so that the ordering does not matter. e.g. F1@K and Recall@B are fair measures for comparing to the Boolean run, but F1@R and P@5 are not.
Measures just counting highly relevant documents may be considered as or more important than those based on general relevance (to be discussed at the conference).
For the Relevance Feedback task, the recommended measures are the same. The main measure of F1@Kr still appears as ":est_K-F1:" in the .eval file. The Kr values are listed in the ":K:" entries of the .eval file. The Khr values are listed in the ":K:" entries of the .evahH file.
The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page.