TREC 2008 Legal Track: Ad Hoc and Relevance Feedback Task Glossary

Revision History

Relevance Assessments

The relevance assessments are in the "qrels" files. They have the following format:

  1. The 1st column is the topic number (from 105 to 150).
  2. The 2nd column is always a 0.
  3. The 3rd column is the document identifier (e.g. tmw65c00).
  4. The 4th column is the relevance judgment: 2 for "highly relevant", 1 for "relevant", 0 for "non-relevant", -1 for "gray" and -2 for "gray". (Note: A "gray" document is a document that was presented to the assessor, but an assessment could not be determined, such as because the document was too long to review (more than 300 pages) or there was a technical problem displaying the document. In the assessor system, -1 was "unsure" (the default setting for all documents) and -2 was "unjudged" (the intended label for gray documents).)
  5. The 5th column is the probability the document had of being selected for assessment from the pool of all submitted documents.
  6. The 6th column is the highest rank at which any submitted run retrieved the document (where 1 is the highest possible rank).
  7. The 7th column is one of the systems that retrieved the document at that rank.

There are 3 qrels files for the 2008 Ad Hoc task:

For the 2008 Relevance Feedback task, the corresponding qrels files are qrelsRF08.normal, qrelsRF08.probs and qrelsRF08.runids (respectively).

Note: for the Relevance Feedback task, "residual" evaluation was used (i.e. documents judged for the topics in previous years (as listed in the qrelsRF08.pass1 file) were removed from the retrieval sets before pooling and evaluation this year).

.eval and .evalH file format

The .eval files (such as refL08B.eval) and .evalH files (such as refL08B.evalH) produced by l07_eval use a similar format as trec_eval 8.0, with several additional measures.

Note: in the 2008 Ad Hoc task:

Likewise, in the 2008 Relevance Feedback task:

.eval and .evalH measures:

The first 27 measures for each topic ("num_ret" though "P1000") are the same as in trec_eval 8.0 (but the calculation was done with l07_eval software instead of trec_eval). These measures do not use the probabilities for estimation. They are defined in http://trec.nist.gov/pubs/trec15/appendices/CE.MEASURES06.pdf.

The next 41 measures for each topic (":S1:" though ":SLastJudgedB:") also do not use the probabilities for estimation. (We will provide definitions in a later version of this document.)

Of the remaining measures, those which start with the "est_" prefix use estimation. For individual topics:

Miscellaneous notes for other measures:

  • Even though the first 5 documents of all submitted runs for each topic were presented to the assessor, ":est_P5:" (estimated precision@5) can differ from the traditional "P5" (precision@5) because of gray documents.
  • Marginal precision measures using 25000 as a basis have been added this year (":est_MP2nd25000:", ":est_MP3rd25000:", ":est_MP4th25000:") which are the estimated precisions over the ranges 25001-50000, 50001-75000 and 75001-100000 respectively. (Use ":est_P25000:" for 1-25000.)
  • The MJ measures (":MJ1st25000:", etc.) specify the number of judged documents in each range of 25000, and the est_MJ measures (":est_MJ1st25000:", etc.) specify the estimated number of judged documents in each range (which, if the run retrieved to depth 100000, would be 25000 for each range if not for gray documents and sampling error).

    The ":relstring:" listing for each topic uses the following codes for the first 100 retrieved documents: H=highly relevant, R=relevant, N=non-relevant, U=gray. A hyphen (-) is used for documents which were not shown to the assessor.

    At the end of the .eval or .evalH file, the summary over "all" topics is given for each measure. For all of the estimated ("est_") measures, the arithmetic mean is used (all topics are weighted equally). The mean is also used for most other measures (except a few of the trec_eval measures).

    The evaluation software (l07_eval.c) is described below.

    Median Scores

    Ad Hoc task: L08_all64

    Median scores for each measure over the 64 submitted runs of the 2008 Ad Hoc task are provided in medL08_all64.eval and medL08_all64.evalH files. (Low and high scores respectively are provided in the loL08* and hiL08* files.)

    We will provide medians over the standard condition runs soon (i.e. for runs just using the request text field).

    The medians do not include the 5 reference runs (below).

    Relevance Feedback task: RF08_all29

    Median scores for each measure over the 29 submitted runs of the 2008 Relevance Feedback task are provided in medRF08_all29.eval and medRF08_all29.evalH files. (Low and high scores respectively are provided in the loRF08* and hiRF08* files.)

    We will provide medians over the runs using Feedback techniques soon.

    The medians do not include the 2 reference runs (below).

    Reference Runs

    For the 2008 Ad Hoc task, 5 reference runs were included in the pools:

    The .eval and .evalH files for each reference run are provided (more analysis will appear in the track overview paper). Note that the xrefL08P and xrefL08C reference runs sometimes included more than 100,000 documents for a topic, which was not allowed for official submitted runs.

    For the 2008 Relevance Feedback task, 2 reference runs were included in the pools:

    Evaluation Software

    The source code of the evaluation software is included in l07_sort.c and l07_eval.c (versions 2.1). Pre-compiled Win32 executables are in l07_sort.exe and l07_eval.exe. For Unix, the compilation syntax is typically something like this:

    gcc -lm -o l07_eval.exe l07_eval.c
    

    Ad Hoc Task Evaluation

    As a usage example, here are the steps to evaluating the refL08B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).

    Sorting the run

    The first step is to sort the run in canonical order. The canonical order used is based on the trec_eval canonical order:

    Note: the specified rank (column 4 of the retrieval set) is not actually a factor in the canonical order.

    To sort the run in canonical order:

    1. Save the retrieval set (e.g. refL08B) in a subdirectory named 'unsorted'.

    2. Make a subdirectory named 'sorted'.

    3. Make a text file named 'sortlist.txt' which contains a line specifying each run's name, such as follows:

    refL08B
    

    4. Run l07_sort.exe to sort the runs and output them to the subdirectory named 'sorted' as follows:

    l07_sort.exe in=sortlist.txt inDir=unsorted\ outDir=sorted\ trecevalOrder kExtract
    

    Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

    The sorting can take several seconds to run.

    5. Verify that the sorted run appears in the 'sorted' subdirectory, along with the .K and .Kh files. For refL08B, the first 3 lines of the sorted version should be as follows:

    102     Q0      zzz10d00        1       1       refL08B
    102     Q0      zzy76d00        1       1       refL08B
    102     Q0      zzy42f00        1       1       refL08B
    

    l07_eval command-line

    l07_eval assumes the input run is already in evaluation order.

    1. To produce the refL08B.eval file from the sorted refL08B file, use the following syntax:

    l07_eval.exe run=sorted\refL08B q=qrelsL08.probs out=ignore1 out2=ignore2 out5=refL08B.eval stringDisplay=100 M1000=7000000 probD=6910912 precB=B08.txt estopt=0 Kfile=sorted\refL08B.K

    Note: on Unix platforms, you may need to use forward slashes instead of backslashes.

    The command can take a few seconds to run. Once it has completed, you should have a refL08B.eval file of scores.

    2. To produce the .evalH file (evaluation with just Highly relevant judgments), add 'MinRelLevel=2' to the command-line, read from the .Kh file and write to a .evalH file, as follows:

    l07_eval.exe run=sorted\refL08B q=qrelsL08.probs out=ignore1 out2=ignore2 out5=refL08B.evalH stringDisplay=100 M1000=7000000 probD=6910912 precB=B08.txt estopt=0 Kfile=sorted\refL08B.Kh MinRelLevel=2

    Relevance Feedback Task Evaluation

    Relevance Feedback evaluation has the extra step of producing a residual run. As a usage example, here are the steps to evaluating the refRF08B run, assuming it is in the submitted form (i.e. the K and Kh values are appended to the end of the run as specified by the guidelines for submission).

    Step 1: Produce a canonically ordered refRF08B run and extract the refRF08.K and refRF08.Kh files. (For details, follow the steps above for sorting the refL08B run.)

    Step 2: Produce the residual retrieval set refRF08B-resid and residual refRF08.Kr file using the following syntax:

    l07_eval run=sorted\refRF08B q=qrelsRF08.pass1 out=ignore1 out2=ignore2 out5=ignore5 stringDisplay=100 M1000=7000000 residCap=100000 Kfile=sorted\refRF08B.K outResid=sorted\refRF08B-resid outResidK=sorted\refRF08B.Kr

    Furthermore, produce the residual refRF08.Khr file as follows:

    l07_eval run=sorted\refRF08B q=qrelsRF08.pass1 out=ignore1 out2=ignore2 out5=ignore5 stringDisplay=100 M1000=7000000 residCap=100000 Kfile=sorted\refRF08B.Kh outResid=ignoreR outResidK=sorted\refRF08B.Khr

    Step 3: Produce the refRF08B-resid.eval file from the sorted refRF08B-resid file and refRF08.Kr file using the following syntax:

    l07_eval run=sorted\refRF08B-resid q=qrelsRF08.probs out=ignore1 out2=ignore2 out5=refRF08B-resid.eval stringDisplay=100 M1000=100000 probD=6910912 precB=Br08.txt Kfile=sorted\refRF08B.Kr estopt=0

    Furthermore, produce the refRF08-resid.evalH file as follows:

    l07_eval run=sorted\refRF08B-resid q=qrelsRF08.probs out=ignore1 out2=ignore2 out5=refRF08B-resid.evalH stringDisplay=100 M1000=100000 probD=6910912 precB=Br08.txt Kfile=sorted\refRF08B.Khr estopt=0 MinRelLevel=2

    Recommended Measures

    The main measure for the track this year is F1@K (":est_K-F1:").

    A new secondary measure for the track this year is F1@R (":est_R-F1:").

    Last year's main measure was Recall@B (":est_RB:").

    The sampling approach favored depths 5 and 100000, so P@5 (":est_P5:") may be a good choice of an early cutoff measure and R@100000 (":est_R100000:") may be a good choice of a deep cutoff measure.

    Note that even at the favored depths, the estimates may only have the accuracy of 5 sample points, hence substantial errors are possible on individual topics. Mean scores (over 24 or 26 topics) should be somewhat more reliable than the estimates for individual topics.

    In the case of the reference Boolean run (refL08B), the ordering is descending alphabetical by docno. Hence only measures at depth B (or depth K, which is the same as B for the Boolean run) are applicable for the reference Boolean run, so that the ordering does not matter. e.g. F1@K and Recall@B are fair measures for comparing to the Boolean run, but F1@R and P@5 are not.

    Measures just counting highly relevant documents may be considered as or more important than those based on general relevance (to be discussed at the conference).

    For the Relevance Feedback task, the recommended measures are the same. The main measure of F1@Kr still appears as ":est_K-F1:" in the .eval file. The Kr values are listed in the ":K:" entries of the .eval file. The Khr values are listed in the ":K:" entries of the .evahH file.

    For Additional Information

    The track Web site at http://trec-legal.umiacs.umd.edu/ contains links to resources and background information. The track mailing list archives can be reached through a link from that page.