TREC 2009 Legal Track: Batch Task Glossary

Revision History

Relevance Assessments

The relevance assessments are in the "qrels" files. They have the following format:

  1. The 1st column is the topic number (from 7 to 145).
  2. The 2nd column is always a 0.
  3. The 3rd column is the document identifier (e.g. tmw65c00).
  4. The 4th column is the relevance judgment: 2 for "highly relevant", 1 for "relevant", 0 for "non-relevant", -1 for "gray" and -2 for "gray". (Note: A "gray" document is a document that was presented to the assessor, but an assessment could not be determined, such as because the document was too long to review (more than 300 pages) or there was a technical problem displaying the document. In the assessor system, -1 was "unsure" (the default setting for all documents) and -2 was "unjudged" (the intended label for gray documents).)
  5. The 5th column is the probability the document had of being selected for assessment from the pool of all submitted documents.
  6. The 6th column is the highest rank at which any submitted run retrieved the document (where 1 is the highest possible rank).
  7. The 7th column is one of the systems that retrieved the document at that rank.

There are 3 qrels files for the 2009 Batch task:

.eval and .evalH file format

The .eval files (such as refL09B.eval) and .evalH files (such as refL09B.evalH) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.

Note: in the 2009 Batch task:

.eval and .evalH measures:

The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.

The next 41 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)

Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:

Miscellaneous notes for other measures:

The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: H=highly relevant, R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.

At the end of the .eval or .evalH file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).

The evaluation software (l07_eval.c) is described below.

Median Scores

Median scores for each measure over the 10 submitted runs of the 2009 Batch task are provided in the medL09_all10.eval and medL09_all10.evalH files. (Low and high scores respectively are provided in the loL09* and hiL09* files.)

The median, low and high scores do not include the reference runs (below).

Reference Runs

For the 2009 Batch task, results for 4 reference runs are provided:

Evaluation Software

The source code of the evaluation software is included in l07_sort.c (version 2.1) and l07_eval.c (version 2.3). Pre-compiled Win32 executables are in l07_sort.exe and l07_eval.exe. For Unix, the compilation syntax is typically something like this:

gcc -lm -o l07_eval.exe l07_eval.c

Batch Task Evaluation

As a usage example, here are the steps to evaluating the refL09B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).

Sorting the run

The first step is to sort the run in canonical order. The canonical order used is based on the trec_eval canonical order:

Note: the specified rank (column 4 of the retrieval set) is not actually a factor in the canonical order.

To sort the run in canonical order:

1. Save the retrieval set (e.g. refL09B) in a subdirectory named 'unsorted'.

2. Make a subdirectory named 'sorted'.

3. Make a text file named 'sortlist.txt' which contains a line specifying each run's name, such as follows:

refL09B

4. Run l07_sort.exe to sort the runs and output them to the subdirectory named 'sorted' as follows:

l07_sort.exe in=sortlist.txt inDir=unsorted\ outDir=sorted\ trecevalOrder kExtract

Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

The sorting can take several seconds to run.

5. Verify that the sorted run appears in the 'sorted' subdirectory, along with the .K and .Kh files. For refL08B, the first 3 lines of the sorted version should be as follows:

102     Q0      zzz10d00        1       1       refL08B
102     Q0      zzy76d00        1       1       refL08B
102     Q0      zzy42f00        1       1       refL08B

l07_eval command-line

l07_eval assumes the input run is already in evaluation order.

1. To produce the refL09B.eval file from the sorted refL09B file, use the following syntax:

l07_eval.exe run=sorted\refL09B q=qrelsL09.probs out=ignore1 out2=ignore2 out5=refL09B.eval stringDisplay=100 M1000=6910192 probD=6910192 precB=B09.txt estopt=0 Kfile=sorted\refL09B.K

Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

The command can take a few seconds to run. Once it has completed, you should have a refL09B.eval file of scores.

2. To produce the .evalH file (evaluation with just Highly relevant judgments), add 'MinRelLevel=2' to the command-line, read from the .Kh file and write to a .evalH file, as follows:

l07_eval.exe run=sorted\refL09B q=qrelsL09.probs out=ignore1 out2=ignore2 out5=refL09B.evalH stringDisplay=100 M1000=6910192 probD=6910192 precB=B09.txt estopt=0 Kfile=sorted\refL09B.Kh MinRelLevel=2

Recommended Measures

The main measure for the track this year is F1@K (":est_K-F1:").

A secondary (rank-based) measure for the track this year is F1@R (":est_R-F1:").

In 2007, the main measure was Recall@B (":est_RB:").

In the case of the reference Boolean run (refL09B), the ordering is descending alphabetical by docno. Hence only measures at depth B (or depth K, which is the same as B for the Boolean run) are applicable for the reference Boolean run, so that the ordering does not matter. e.g. F1@K and Recall@B are fair measures for comparing to the Boolean run, but F1@R and P@5 are not.

For the other 3 reference runs (fullset09, oldrel09, oldnon09) only measures at depth K are applicable.

Measures just counting highly relevant documents may be considered as or more important than those based on general relevance (to be discussed at the conference).

For Additional Information

The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page.