---------------------------------------------- FedWeb 2014 Evaluation: Results Merging task ---------------------------------------------- Each submitted run was evaluated using the following measures: (1) nDCG@20: nDCG@20 based on duplicate-filtered qrels-file (qrels-nodups, see below), normalized by the nDCG over all relevant results (i.e., all resources). (2) nDCG@100 (analogous) (3) nDCG@20_withdup: analogous to nDCG@20, but based on the original qrels-file, without penalizing duplicates. (4) nDCG@20_local: again an nDCG@20 measure, based on removed duplicates, but this time normalized on the qrelsfile (qrels-top20RS, see below), starting from the assumption that the only relevant results are among the top 20 resources of the resource selection run that the given submitted run was based on. This means perfect ordering would lead to nDCG@20_local = 1 (which is not the case for measure (1), due to relevant results outside of the selected 20 resources). The measure nDCG@20_local allows to purely evaluate the merging behavior of systems, by comparing among merging runs based on the same Resource Selection run. (5) nDCG-IA@20: based on qrels file with duplicates removed (qrels-nodupes), and again on the relevance of results of all resources (as for (1)), by weighting the nDCG@20 for each intent separately with the corresponding intent probability. NOTES Note that each runfile was first verified and filtered, to only contain results from the top 20 resources for each query, from the corresponding Resource Selection run. By 'duplicates filtered', we mean that after the first occurrence of a particular result in the merged list for a query, all consecutive results that refer to the same web page as that first result, are set to non-relevant in the filtered qrels-file. (Note that filtered qrels are run-dependent.) Manually checked duplicates are in the file 'merging-duplicates.txt'. In this file, each line contains a set of snippet-ID's, denoting a set of duplicates. The first column indicates the nature of these duplicates, i.e., automatically detected or manually verified (0: URL is the same; 1: the MD5 hashes of the retrieved documents correspond and the documents are not empty; 2: duplicate nature manually verified for pages with similar size and simhash). nDCG values can be calculated using trec_eval (measure 'ndcg_cut.20'). The nDCG-IA measure can be calculated by combining the individual vertical nDCG's with the intent probabilities. To this end, the files resources_fedweb2014.txt (containing information on vertical_ID and resource_ID), fedweb-50test.vr4rm (with intent probabilities per topic), and calc-nDCG-IA.py are provided. The latter script calls trec_eval and merges the individual nDCG@k values to the final nDCG-IA. The vertical intent probabilities (in the file fedweb-50test.vr4rm) were calculated for each test topic as follows: a. the quality of each vertical was quantified by the highest graded precision among all resources (based on their top-10 returned documents) within that vertical b. the vertical intent probability was then obtained by normalizing this vertical score across all verticals The original qrels file (merging-qrels.txt) contains the relevance weights per result, scaled to a range between 0 and 1000, calculated as the UDM relevance level weights for the individual results [2]: weights = {'Non':0.0, 'Rel':0.158, 'HRel': 0.546, 'Key':1.0, 'Nav':1.0} [1] https://sites.google.com/site/trecfedweb/ [2] T. Demeester, R. Aly, D. Hiemstra, D. Nguyen, D. Trieschnigg, and C. Develder. Exploiting User Disagreement for Web Search Evaluation: an Experimental Approach. In 7th ACM International Conference on Web Search and Data Mining (WSDM 2014), pages 33--42, 2014.