The qrelsL07.normal file uses the common trec_eval qrels format. The 4th column may be a 1 (judged relevant), 0 (judged non-relevant), -1 (gray) or -2 (gray).
Note: A "gray" document is a document that was presented to the assessor, but an assessment could not be determined, such as because the document was too long to review (more than 300 pages) or there was a technical problem displaying the document.
The qrelsL07.probs file is the same as qrelsL07.normal except that there is a 5th column which lists the probability the document had of being selected for assessment from the pool of all submitted documents. qrelsL07.probs can be used with the experimental l07_eval utility to estimate precision and recall to depth 25000 for runs which contributed to the pool. (All submitted main task runs were included in the pool.)
The .eval files (such as refL07B.eval) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.
The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.
The next 37 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)
Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:
Miscellaneous notes for other measures:
The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.
At the end of the .eval file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).
The evaluation software (l07_eval.c and l07_eval.exe) will be made available soon, once sufficient usage documentation has been written (e.g. l07_eval requires a run to have its documents in evaluation order, unlike trec_eval; a separate utility was used to sort the runs for pooling and l07_eval used these sorted runs).
The sampling approach favored depths 5, B and 25000, so estimated measures at those depths should be favored. The primary measure for the track is Estimated Recall@B (":est_RB:").
Note that even at the favored depths, the estimates may only have the accuracy of 5 sample points, hence substantial errors are possible on individual topics. Mean scores (over 43 topics) should be somewhat more reliable than the estimates for individual topics.
In the case of the reference boolean run (refL07B), the ordering is descending alphabetical by docno. Hence only measures at depth B are applicable for the reference boolean run, so that the ordering does not matter.
The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:
Jason R. Baron jason.baron (at) nara.gov Douglas W. Oard oard (at) umd.edu Paul Thompson paul.thompson (at) dartmouth.edu Stephen Tomlinson stephent (at) magma.ca