The evaluation scripts are real-time-filtering-modelA-eval.py and
real-time-filtering-modelB-eval.py for tasks A and B respectively.
Each script takes three arguments, the judgments file, the clusters file,
and the run file. So, to invoke the Task A script on run 'runA', for example,
use:
    python real-time-filtering-modelA-eval.py -q qrels.txt -c clusters-2015.json -r runA

The judgment file is qrels.txt.  Judgment sets were built as described below.
All runs contributed to the pools.
    * For task A runs, for each topic, the first 10 (by delivery decision
      time) tweets per day are added to the pool.
    * For task B runs, for each topic, up to 85 (the pool depth) tweets are
      added to the pool, doing a round-robin by rank across days.  That is,
      first all rank 1 tweets from all days with tweets are added, then all
      rank 2 tweets, etc.  If one day runs out of tweets before the limit is
      reached, more tweets from those days that still have tweets are added
      until either there are no more tweets to add or the 85-tweet limit
      is reached.

NIST assessors judged these pools for 51 topics, assigning relevance labels
of 0 for not relevant, 1 for relevant, and 2 for highly relevant.  The
51 topics judged are:
	226,227,228,236,242,243,246,248,249,253,
	254,255,260,262,265,267,278,284,287,298,
	305,324,326,331,339,344,348,353,354,357,
	359,362,366,371,377,379,383,384,389,391,
	392,401,400,405,409,416,419,432,434,439,448

Tweets were then manually clustered into equivalence classes.  An unjudged
retweet encountered in a run was mapped to its corresponding cluster and
assigned a relevance judgment based on the label of the majority
of the judged tweets in that cluster.  These 'propagated' judgments were
added to the qrels file using different labels to distinguish them
from assessor-judged tweets.  The labels for propagated judgments are
-1 for not relevant, 3 for relevant and 4 for highly relevant.

The clusters file is clusters-2015.json.  As noted above, this file defines
the manually-produced equivalence classes of tweets.