The Session Track released 133 query sessions for 69 topics
(some topics had more than one sessions corresponding to them),
with only the first 87 sessions, covering 49 topics, used for evaluation.
NIST provided judgments for all 49 topics. All submitted runs were
pooled and the depth-10 set of URLs were judged against the general
topic. Judging was conducted in a 6-grades scale: spam (-2),
not relevant (0), relevant (1), highly relevant (4), key (2), and
navigational (3). When computing different measures the above relevance
judgments were mapped to: spam -> 0, not relevant -> 0, relevant -> 1,
highly relevant -> 2, key -> 3, and navigational -> 4. Different from
Session 2011, and similar to Session 2012, relevance was defined
against the entire topic and not against different subtopics (due
to the nature of this years topics).

*** Note that we did not apply any special treatment to duplicate
documents, i.e. documents in the ranked lists for the current query that
have been returned (and clicked by users) previously in the session.

Based on the qrels provided by NIST, we evaluated runs by eight measures,
(a) Average Precision (average_precision)xi
(b) Expected Reciprocal Rank (err) -- as defined by Chapelle et al.
	at CIKM 2009,
(c) ERR@10 (err_at_k),
(d) nDCG (ndcg),xi
(e) nDCG@10 (ndcg_at_k),
(f) ERR normalised by the maximum ERR per query (nerr),
(g) nERR@10 (nerr_at_k), and
(h) Precision@10 (precision_at_k)