TREC 2007 Legal Track: Main Task Glossary

Revision History

qrelsL07.normal

The qrelsL07.normal file uses the common trec_eval qrels format. The 4th column may be a 1 (judged relevant), 0 (judged non-relevant), -1 (gray) or -2 (gray).

Note: A "gray" document is a document that was presented to the assessor, but an assessment could not be determined, such as because the document was too long to review (more than 300 pages) or there was a technical problem displaying the document.

qrelsL07.probs

The qrelsL07.probs file is the same as qrelsL07.normal except that there is a 5th column which lists the probability the document had of being selected for assessment from the pool of all submitted documents. qrelsL07.probs can be used with the experimental l07_eval utility to estimate precision and recall to depth 25000 for runs which contributed to the pool. (All submitted main task runs were included in the pool.)

.eval file format

The .eval files (such as refL07B.eval) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.

The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.

The next 37 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)

Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:

Miscellaneous notes for other measures:

  • Even though the first 5 documents of all submitted runs for each topic were presented to the assessor, ":est_P5:" (estimated precision@5) can differ from the traditional "P5" (precision@5) because of gray documents.
  • The probabilities of document selection are only guaranteed to be at least 0.0002 (1 in 5000), hence 5000 was chosen as the basis for the marginal precision measures (":est_MP2nd5000:", ":est_MP3rd5000:", ":est_MP4th5000:", ":est_MP5th5000:"), which are the estimated precisions over the ranges 5001-10000, 10001-15000, 15001-20000 and 20001-25000. (Use ":est_P5000:" for 1-5000.)
  • The MJ measures (":MJ1st5000:", etc.) specify the number of judged documents in each range of 5000, and the est_MJ measures (":est_MJ1st5000:", etc.) specify the estimated number of judged documents in each range (which, if the run retrieved to depth 25000, would be 5000 for each range if not for gray documents and sampling error).

    The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.

    At the end of the .eval file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).

    The evaluation software (l07_eval.c and l07_eval.exe) will be made available soon, once sufficient usage documentation has been written (e.g. l07_eval requires a run to have its documents in evaluation order, unlike trec_eval; a separate utility was used to sort the runs for pooling and l07_eval used these sorted runs).

    Recommended Measures

    The sampling approach favored depths 5, B and 25000, so estimated measures at those depths should be favored. The primary measure for the track is Estimated Recall@B (":est_RB:").

    Note that even at the favored depths, the estimates may only have the accuracy of 5 sample points, hence substantial errors are possible on individual topics. Mean scores (over 43 topics) should be somewhat more reliable than the estimates for individual topics.

    In the case of the reference boolean run (refL07B), the ordering is descending alphabetical by docno. Hence only measures at depth B are applicable for the reference boolean run, so that the ordering does not matter.

    For Additional Information

    The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page. For additional questions, please contact one of the track coordinators:

    Jason R. Baron      jason.baron (at) nara.gov
    Douglas W. Oard     oard (at) umd.edu
    Paul Thompson       paul.thompson (at) dartmouth.edu
    Stephen Tomlinson   stephent (at) magma.ca