Combination of Multiple Searches Edward A. Fox and Joseph A. Shaw Department of Compnter Science Virginia Tech, Blacksburg, VA 24061-0106 Abstract The TREC-2 project at Virginia Tech focused on meth- ods for combining the evidence from multiple retrieval runs to improve retrieval performance over any sin- gle retrieval method. This paper describes one such method that has been shown to increase performance by combining the similarity values from five different retrieval runs using both vector space and P-norm ex- tended boolean retrieval methods. 1 Overview The primary focus of our experiments at Virginia Tech involved methods of combining the results from vari- ous divergent search schemes and document collections. We performed both routing and ad-hoc retrieval experi- ments on the provided test collections. The results from both vector and P-norm type queries were considered in determining the probability of relevance for each docu- ment in an individual collection. The results for each collection were then merged to create a single final set of documents that would be presented to the user. 2 Index Creation This section outlines the indexing done with the doc- ument collection provided by NIST. Each of the indi- vidual collections was indexed separately as document vector files; limitations in disk space prohibited the use of inverted files and the creation of a single combined document vector file. All processing was performed on a DECstation 5000/25 with 40 MB of RAM using the 1985 release of the SMART Information Retrieval System [2], with enhancements from previous experiments as well as a new modification for our TREC-2 experiments. The index files were created from the source text via the following process. First, the source document text provided by NIST was passed through a preparser to convert the SGML-like format to the proper format for 243 Table 1: SMART weighting schemes used for TREC-2. SMART label term~weight = ann 0.5 + 0.5 * tf 11 max~tJ bnn 1 mnn tf ~ ma~~t f am 0.5+ ~ * log( num~doc,) ~ 2*max~tI colI~fr~g nnn if the 1985 version of SMART. The extraneous sections of the documents were filtered out at this point. The TEXT sections of the documents, as well as the various HEADLINE, TITLE, SUMMARY, and ABSTRACT sections of the collections were indexed; all of the other sections were ignored. The subsections of the TEXT fields, where they existed, were considered as part of the TEXT field, with the subsection delimiters removed. The resulting filtered text was tokenized, stop words were deleted using the standard 418 word stop list provided with SMART, and the remaining non-noise words were included in the term dictionary along with their occurrence frequencies. Each term in the dictio- nary has a unique identification number. A document vector file was created during indexing which contains for each document its unique ID, and a vector of term IDs and term weights. The initially recorded weights can be changed based on one of several schemes after the indexing is complete. The various SMART weight- ing schemes referred to within this paper are summa- rized in Table 1. The dictionary size for each collection was approximately 16 MB, while the document vector files ranged from 31 MB to 124MB (see Table 2). Table 2: Collection statistics summary. Text, Dicti~ nary and Document Vector sizes in Megabytes. Lifiectionif I____ Doc. Total Text Dict. Vectors Doc.s AP-1 266 16.0 120.2 84678 DO~1 190 15.9 97.9 226087 FR-i 258 15.8 53.8 26207 WSJ-1 295 16.2 124.8 98735 ZIFF-1 251 15.7 88.4 75180 Dl 1260 N/A 485.1 510887 AP-2 248 15.9 110.4 79923 FR-2 211 15.6 42.7 20108 WSJ- 2 ~ 255 16.0 105.5 74520 ZIFF- 2 918082 15.4 63.6 56920 D2 ~______ N/A 322.2 231471 [Dl & D2 [[ 2162 [ N/A ~ 807.3 ~ 742358 ] AP-3 250 15.9 ~ 111.2 78325 PATN-3 ~ 254 15.6 31.3 6711 SJM- 3 319 16.1 114.4 90257 ZIFF-3 1318652 16.0 109.8 161021 D3 I[ ~ N/A 366.7 336314 ___ __ __~___ [ Total [[ 3347 ~ N/A ~ 1174.0 11078672 1 3 Retrieval 3.1 Queries All of the queries were created from the topic descrip- tions provided by NIST. Two types of queries were used, P-norm extended boolean queries and natural language vector queries. A single set of P-norm queries was cre- ated, but was interpreted multiple times with different operator weights (P-values). Two different sets of vec- tor queries were created from the topics, one contain- ing information from fewer sections of a topic descrip- tion. The Title, Description and Concepts sections of the topic descriptions were used in the creation of all three sets of queries, the Definitions section was used also in both sets of vector queries, while the P-norm query set and one of the vector query sets also con- tained information from the Narrative section of the topic descriptions. The vector query set that included the Narrative section of the topic is referred to as the long vector query set, for obvious reasons, while the other is referred to as the short vector query set. The P-norm queries were written as complex boolean expressions using AND and OR operators. Phrases were simulated using AND operators since the queries were intended only for soft-boolean evaluation. The query terms were not specifically weighted; uniform op- erator weights (P-values) of 1.0, 1.5 and 2.0 were used 244 Table 4: Summary of the five individual runs. Title Query Type J Similarity Measure sv TI Short vector Cosine similarity LV j~ Long vector Cosine similarity Pnl.O ~ P-norm P-norm, P = 1.0 Pnl.5 Jj P-norm P-norm, P = 1.5 Pn2.O P-norm P-norm, P = 2.5 on different evaluations of the query set. 3.2 Individual Retrieval Runs The first step in our TREC-2 experiments involved de- termination of what weighting schemes would be most effective for P-norm queries. Our TREC-1 experiments with P-norm queries had obtained mixed results, per- forming poorly based on binary document term weights in our Phase I experiments and performing well for a P- value of 1.0 and very poor with larger P-values in our Phase II experiments using a tf-idf weighting scheme [4]. We performed several P-norm retrieval runs on the two AP and two WSJ training collections with topics 51 to 100 to determine the most effective term weight- ing scheme for P-norm queries with large test collec- tions. The results from these experiments are shown in Table 3 using the standard TREC-2 average non- interpolated precision and the exact R-precision mea- sures. The most effective weighting scheme turned out to be the SMART ann weighting scheme, which con- firmed the result obtained originally by Fox for the much smaller classical document collections [3]. The two sets of vector queries were evaluated us- ing the standard cosine correlation similarity method as implemented by SMART. The same SMART ann weighting scheme used for the P-norm queries was used on the vector queries for several reasons. First, a weighting scheme that did not use any collection statis- tics was needed for the routing experiments. Second, the methods used in combining runs described in the next section required a similar range of possible simi- larity values produced by each run. Finally, the neces- sity of merging results from each collection into a single set of results was simplified since the resulting similar- ity values were not based on collection statistics which would have differed for each collection. The P-norm queries were evaluated using three different P-values, again using the SMART ann weighting scheme based on specific P-norm experiments described below. The five individual runs are summarized in Table 4. The five individual runs were performed and evalu- ated for each of the nine training collections on topics 51 to 100. The results for these experiments are given in Table 3: Average Precision and Exact R-Precision for P-norm experiments on weighting with the AP and WSJ collections (Ad-hoc Topics 51-100). I IT Average Precision TI R-Precision I Coil._~ P-value_II_ann_~_bnn_~_mnn_I'_ann_~_bnn_~_mnn_j 1.0 0.2810 0.2419 0.1419 0.2688 0.2660 0.1689 AP-1 1.5 0.3122 0.2581 0.1444 0.2976 0.2732 0.1757 _______ 2.0 0.3027 0.2510 0.1457 0.2968 0.2775 0.1707 1.0 0.3004 0.2672 0.1826 0.3165 0.2864 0.2046 AP-2 1.5 0.3332 0.2999 0.1831 0.3412 0.3118 0.2161 2.0 0.3300 0.2922 0.1847 0.3339 0.3057 0.2284 1.0 0.2941 0.2485 0.1742 0.3221 0.2830 0.2181 WSJ-1 1.5 0.3199 0.2753 0.1774 0.3443 0.2994 0.2225 2.0 0.3217 0.2752 0.1776 0.3470 0.3013 0.2277 1.0 0.2206 0.1881 0.1356 0.2367 0.2094 0.1722 WSJ-2 1.5 0.2327 0.2013 0.1174 0.2511 0.2234 0.1549 _______ 2.0 0.2325 0.1970 0.1098 0.2442 0.2158 0.1445 Table 5. In general, the P-norm queries performed bet- ter than the vector queries. The most effective P-value however differed between the collections: The AP runs performed better with a P-value of 1.5, while a P-value of 2.0 performed better for the WSJ collections. 3.3 Combination Retrieval Runs Our experiments in TREC-1 involved combining the results from several different retrieval runs for a given collection either simply taking the top N documents re- trieved for each run, or modifying the value of N for each run, based on the eleven point average precision for that run. We felt these efforts suffered from con- sidering only the rank of a retrieved document and not the actual similarity value itself. In TREC-2, our ex- periments concentrated on methods of combining runs based on the similarity values of a document to each query for each of the runs. Additionally, combining the similarities at retrieval time had the advantage of extra evidence over combining separate results files since the similarity of every document for each run was available instead of just the similarities for the top 1000 docu- ments for each run. While our results for four of the training collections indicated that the P-norm queries performed better than the vector queries, this result was likely specific to the actual queries involved and not necessarily true in general. This lead to a decision to weight each of the separate runs equally and not fa- vor any individual run or method. In general, it may be desirable or necessary to weight a single run more, or less, depending on its overall performance; this could be especially useful in a routing situation. For any given information retrieval ranking metohd, there are two primary types of errors that can occur: 245 Table 6: Formulas for combining similarity values. Name I' Combined Similarity = ] CombMAX MAX(Individual Similarities) CombMIN MIN(Individual Similarities) CombSUM SUM(Individual Similarities) CombANZ SUM(Individual Similarities) _______________ Num bcr ol Nonzero Similarities, CombMNZ SUM(Individual Similarities)* Number of Nonzero Similarities CombMED MED(Individual Similarities) assigning a relatively high rank to a non-relevant docu- ment, and assigning a relatively low rank to a relevant document. It has been shown that different retrieval paradigms will perform differently on the same set of data, often will little overlap in the set of retrieved doc- uments. [5] For instance, when one retrieval method assigns a high rank to a non-relevant document, a differ- ent retrieval method is likely to assign that document a much lower rank. Similarly, when one retrieval method fails to assign a high rank to a relevant document, a different retrieval method is likely to assign that doc- ument a high rank. This characteristic of information retrieval methods indicates that some method for con- sidering both retrieval methods together should help to decease the probability of this happening; of course, it is also possible for both methods to highly rank a non- relevant document or to poorly rank a relevant docu- ment. Six methods of combining the similarity values were tested in our TREC-2 experiments, as summarized in Table 5: Average Precision and Exact R-Precision for the five individual runs (Ad-hoc Topics 51-100). Avera~e non-interpolated Precision J{______ _______ Disk 1 _______ _______ff______ Disk 2 ~Both1 ~ Run J{ AP J DOE ~ FR ~ WSJ ~ ZF [[ AP ~ FR J WSJ JZ~F~¾DisksJ sV 0.2387 0.0605 0.0222 0.2203 0.1026 0.2543 0.0330 0.1503 0.0770 0.1418 LV 0.2435 0.0586 0.0302 0.2414 0.0864 0.2664 0.0324 0.1633 0.0753 0.1555 Pnl.0 0.2605 0.0658 0.0611 0.2941 0.1110 0.3004 0.0879 0.2206 0.1003 0.1988 Pnl.5 0.2939 0.0771 0.0639 0.3199 0.1278 0.3332 0.0878 0.2327 0.1065 0.2242 Pn2.0 0.2849 0.0847 0.0706 0.3217 0.1278 0.3300 0.0865 0.2325 0.1136 0.2250 CombSUM 0.3493 0.1001 0.0741 0.3605 0.1475 0.3748 0.0842 0.2752 0.1273 0.2620 Chg/Max 18.84% 18.18% 4.95% 12.06% 15.41% 12.48% -4.20% 18.26% 12.05% 16.44% __________ Exact R-Precision _______ ______________ _______ ~_______ _______ Disk 1 _______ II Disk 2 ________~ Both Run [[ AP [ DOE ~ FR [ WSJ ~ ZF II AP [ FR [ WSJ ] ZF 1' Disks J sV 0.2624 0.0564 0.0183 0.2616 0.1180 0.2649 0.0202 0.1744 0.0922 0.2169 LV 0.2672 0.0493 0.0274 0.2800 0.0802 0.2704 0.0176 0.1860 0.0843 0.2311 Pnl.0 0.2688 0.0661 0.0533 0.3221 0.1123 0.3165 0.0971 0.2367 0.0969 0.2708 Pnl.5 0.2976 0.0762 0.0572 0.3443 0.1218 0.3412 0.1016 0.2511 0.1068 0.2962 Pn2.0 0.2968 0.0765 0.0654 0.3470 0.1254 0.3339 0.0820 0.2442 0.1158 0.3008 CombSUM 0.3590 0.0950 0.0619 0.3767 0.1357 0.3732 0.0887 0.2851 0.1216 0.3292 Chg/Max 20.63% 24.18% -5.35% 8.55% 8.21% 9.37% -12.69% 13.54% 5.00% 9.44% Table 6. The rational behind the CombMIN combi- nation method was to minimize the probability that a non-relevant document would be highly ranked, while the purpose of the CombMAX combination method was to minimize the number of relevant documents being poorly ranked. There is an inherent flaw with both of these methods; namely, they are specialized to handle specific problems without regard to their effect on the other retrieved documents: for example, the CombMIN combination method will promote the type of error that the CombMAX method is designed to minimize, and vice versa. The CombMED combination method is a simplistic approach to handling this, using the median similarity value to avoid both scenarios. What is clearly needed is some method of considering the documents' relative ranks, or similarity values, instead of simply attempting to select a single similarity value from a set of runs. To this end, we tried three other methods of combining retrieval methods. CombSUM, the summa- tion of the set of similarity values, or, equivalently, the numerical mean of the set of the set of similarity val- ues; CombANZ, the average of the non-zero similarity values, that ignores the effects of a single given run or query failing to retrieve a relevant document; and CombMNZ to provide higher weights to documents re- trieved by multiple retrieval methods. Clearly~ there are more possibilities to consider; the advantages of those 246 chosen are simplicity, in terms of both execution effi- ciency and implementation, and generality, in terms of not being specific to a given method or retrieval run. These six methods were evaluated against the AP and WSJ test collections for topics 51 through 100, combin- mg the similarity values of each of the five individual runs specified above. The results are shown in Table 7 below the results of each of the corresponding individ- ual runs from Table 5. Note that while the CombMAX runs performed well compared with most of the indi- vidual runs, they did not do as well as the best of the individual runs in most cases. The CombMIN runs per- formed similarly for the AP collection, but performed worse than every individual run for the WSJ collection. The CombANZ runs and the CombMNZ runs both performed better than the best of the individual runs, with the CombMNZ runs performing only slightly bet- ter than the combANZ runs for three of the four collec- tions, and performing basically the same for the fourth. The primary reason for the similar performance of the two runs is that the two methods produce the same ranked sequence of for all the documents retrieved by all five individual runs. Thus, the The CombSUM retrieval run was performed for each of the nine collections on the two training CD-ROMs. The results are shown in Table 5. Breaking this anal- ysis down to a per topic basis in Table 11, it can be seen that the CombSUM method performs significantly better than the best single individual run, Pn2.0; a tw~ tailed paired t test on the CombSUM and Pn2.0 average precisions results in a p value of 3. le-OS, which indi- cates these results are conclusive. However, comparing the CombSUM results with the best individual runs for each query basis, results in a p value of approximately about 0.16, indicating that there is a 16 percent chance that the CombSUM method is no better than the best individual run, Pn2.0, for any specific query. Perform- ing the same calculation on the R-Precision results in similar significance findings. While combining all five runs produced an overall improvement in retrieval effectiveness over each of the runs, the same does not always hold true when com- bining only two or three runs. Each of the ten combi- nations of two CombSUM runs was performed for both of the AP test collections, as well as a run combining all three of the P-norm runs. The results of these are given in Table 8. Most of the combinations of two runs performed worse than the better of the two runs while performing better than the poorer of the two runs. One notable exception to this is the combination of the two vector runs, which performed noticeably poorer than either of the two runs. 3.4 Collection Merging The retrieval results for each of the collections were combined by simply merging the results based solely on the combined similarity values. Since the retrieval runs were based on term weights without collection statis- tics such as inverse document frequency, the similarity values were directly comparable across collections. The results of merging the CombSUM results by summed similarity value for both disks, is shown in the last col- umn of Table 5. 4 TREC-2 Results The procedure described above was used for both our official TREC-2 routing and ad-hoc results. The exact queries for ad-hoc topics 51 to 100 used for testing our above method were used for the routing queries against the new collections on disk 3. The results obtained from performing the CombSUM retrieval runs for each of the four collections as well as the merged results are shown in Table 9. The two CombSUM entries in the last column of table are the official TREC-2 results. Since we concentrated on the ad-hoc evaluations, these routing results are included primarily for the benefit of other groups, for purposes of comparison. The ad- hoc queries for topics 101 to 150 were evaluated in the same manner, and are reported in Table 10. Again, 247 the official results are the two CombSUM entries in last column of the table. As can be seen from Table 12, the CombSUM method performs quite poorly for certain topics while perform- ing very well for others, compared to the best single run's results that that topic. Comparing the Comb- SUM results to the single best individual run (Pn2.0) shows an improvement for 46 out of the 50 topics, which shows that the CombSUM run performs much better than any single individual run. Performing a tw~tailed paired ~ test on the Pn2.0 and CombSUM precisions results in a p value of about 1.le-11, which indicates these results are very conclusive. However, comparing the CombSUM results with the best individual runs on a per query basis results in a p value of about 0.2, indicating that there is a 20 percent chance that the CombSUM method is no better than the best individ- ual run for each specific query. Again, performing the same calculation on the R-Precision results in similar values. 4.1 The CEO Model The Combination of Expert Opinion (CEO) model [6, 7] of Thompson can be used to treat the dif- ferent retrieval methods as experts, and allows combin- ing their weighting probability distributions to improve performance. This could be used in a variety of ways to combine results from a variety of runs and indexing schemes (that could include stemming and/or morpho- logical analysis). For TREC-2, the CEO experiments completed consisted of combining seven individual runs, the three P-norm extended boolean retrieval run types described above, and retrieval runs based on the long vector queries, using both cosine correlation and inner product similarity measures for SMART system term weighting schemes of `inn and a~n. Further discussion of this process and the results are described elsewhere in these proceedings. 4.2 Evaluation Improvements in retrieval effectiveness from combining the evidence from multiple sources of evidence has been performed before in various incarnations, most recently by Belkin e~ aL [1] who evaluated the progressive effect of considering multiple soft boolean representations to improve on a base INQUERY natural language retrieval run. In their experiments, the base INQUERY natural language run performed better than any of the boolean representations, and they report that combining the re- sults from the natural language representation and the combined boolean representations with equal weights performed worse than the best single run. Not until weighting the natural language run four times more Table 7: Comparison of combination runs and the five individual runs (Ad-hoc Topics 51-100). f TI________ Average Precision It________ R-Precision ________] [Run 11 AP-1 J WSJ-1 f AP-2 ~ WSJ-2 [[ AP-1 J WSJ-1 ~ AP-2 ~ WSJ-2 J SV 0.2387 0.2203 0.2543 0.1503 0.2624 0.2616 0.2649 0.1744 LV 0.2435 0.2414 0.2664 0.1633 0.2672 0.2800 0.2704 0.1860 Pnl.0 0.2810 0.2941 0.3004 0.2206 0.2688 0.3221 0.3165 0.2367 Pnl.5 0.3122 0.3199 0.3332 0.2327 0.2976 0.3443 0.3412 0.2511 Pn2.0 0.3027 0.3217 0.3300 0.2325 0.2968 0.3470 0.3339 0.2442 CombMAX 0.2856 0.3205 0.3337 0.2343 0.3013 0.3484 0.3431 0.2449 CombMIN 0.2863 0.1924 0.3047 0.1308 0.3036 0.2214 0.2980 0.1395 CombSUM 0.3493 0.3605 0.3748 0.2752 0.3590 0.3767 0.3732 0.2851 CombANZ 0.3493 0.3367 0.3748 0.2465 0.3590 0.3517 0.3732 0.2590 CombMNZ 0.3059 0.3368 0.3516 0.2467 0.3175 0.3517 0.3578 0.2590 CombMED 0.2943 0.3204 0.3335 0.2328 0.2977 0.3444 0.3414 0.2518 than the combined boolean schemes did they experi- ence improved retrieval performance when combining different query methods. This differs from our results in several ways. Most importantly, the stage at which we combine the different methods differed: Belkin et al. combined the query representations before performing the actual retrieval, while we combined the similarity values produced from retrieval on each method individ- ually. The difference between the two methodologies can best be demonstrated using the standard vector space model: Belkin et al. combined by summing the vector representations of each query, while our method is analogous to summing the cosines of the angles be- tween each vector and a document. It is easily shown that the cosine of the angle between a document vec- tor and a combined query vector, that is the sum of two query vectors as in the Belkin et aL approach, is not equal to the sum of the cosines between a docu- ment vector and the two separate query vectors. Other differences between the two methodologies include the fact that our P-norm queries performed better on av- erage than our natural language vector queries, with exceptions on a per query basis. We used only one P- norm query and modified the operator weights while Belkin et aL used five different boolean queries. Fi- nally, combining with five runs with equal weights ac- tually improved performance over each individual run. However, one common trend emerges from both exper- iments: the more query representations considered, the better the results. 4.3 Future Exploration Planned future work includes studying the following: * Individually weighting various methods' similarity values when performing combination runs. 248 * Normalization methods to allow combination of runs made with different weighting schemes. * Extending the analysis to all combinations of three and four retrieval runs. * Considering more/different query types. 5 Acknowledgements This research was supported in part by DARPA and by PRC Inc. We also thank Russell Modlin, M. Prabhakar Koushik and Durgesh Rao for their collaboration during TREC-1. References [1] Belkin, N.J., Cool, C., Croft, W.B., Callan, J.P. (1993, June). The Effect of Multiple Query Rep- resentations on Information Retrieval Performance. Proc. 15th Int'l Conf. on R~D in IR (SIGIR `93), Pittsburgh, 339-346. [2] Buckley, C. (1985, May) Implementation of the SMART information retrieval system. Technical Report 85-686, Cornell University, Department of Computer Science. [3] Fox, E.A. (1983, August). Extending the Boolean and Vector Space Models of Information Retrieval with P-Norm Queries and Multiple Concept Types. Cornell University Department of Computer Science dissertation. [4] Fox, E.A., Koushik, M.P., Shaw, J., Modlin, R., Rao, D. (1993). Combining Evidence from Multiple Searches. In The First Text REtrieval Conference Table 8: Average (Ad-hoc Topics 51-100). Precision and Exact R-Precision for CombSUM runs combining two or three individual runs Average Precision Run AP-1 wsJ-1 AP-2 WSJ-2 AP-l SV 0.2387 0.2203 0.2543 0.1503 0.2624 LV 0.2435 0.2414 0.2664 0.1633 0.2672 Pnl.0 0.2810 0.2941 0.3004 0.2206 0.2688 Pnl.5 0.3122 0.3199 0.3332 0.2327 0.2976 Pn2.0 0.3027 0.3217 0.3300 0.2325 0.2968 ;½ R- Precision AP-2 0.2649 wsJ-1 0.2616 WSJ-2 0.1744 0.2800 0.2704 0.1860 0.3221 0.3165 0.2367 0.3443 0.3412 0.2511 0.3470 0.3339 ¾; 0.2442 SV and LV 0.1457 0.1657 0.1611 0.1100 0.1492 0.1887 0.1524 0.1124 SV and Pnl.0 0.2774 0.3111 0.3257 0.2362 0.2874 0.3332 0.3360 0.2554 SV and Pnl.5 0.3117 0.3389 0.3614 0.2463 0.3198 0.3575 0.3675 0.2637 SV and Pn2.0 0.3012 0.3395 0.3584 0.2467 0.3153 0.3580 0.3636 0.2551 LV and Pnl.0 0.2744 0.3136 0.3197 0.2353 0.2867 0.3357 0.3269 0.2518 LV and Pnl.5 0.3057 0.3413 0.3536 0.2442 0.3181 0.3596 0.3624 0.2568 LV and Pn2.0 0.2950 0.3408 0.3518 0.2458 0.3141 0.3608 0.3596 0.2615 Pnl.0 and Pnl.5 0.2817 0.3109 0.3243 0.2324 0.2898 0.3412 0.3395 0.2476 Pnl.0 and Pn2.0 0.2935 0.3191 0.3330 0.2367 0.2944 0.3426 0.3397 0.2507 Pnl.5 and Pn2.0 0.2928 0.3233 0.3336 0.2328 0.2935 0.3478 0.3351 0.2489 Pnl.0, Pnl.5 and Pn2.0 0.2943 0.3192 0.3345 0.2339 0.2953 0.3421 0.3386 0.2497 (TREC-1), D.K. Harmon (Ed.), National Institute of Standards and Technology Special Publication 500-207, Gaithersburg, MD. [5] Katzer, J., McGill, M.J., Tessier, J.A., Frakes, W., Dasgupta, P. (1982). A Study of the Overlap among Document Representations. Inform ajion Technol- ogy: Reseach and Developmeni, 1(2) :261-274. [6] Thompson, P. (1990) A Combination of Expert Opinion approach to probabilistic information re- trieval, Part 1: The conceptual model. Informa~ion Processing ~ Managemeni, 26(3):371-382, 1990. [7] Thompson, P. (1990). A Combination of Expert Opinion approach to probabilistic information re- trieval, Part 2: Mathematical treatment of CEO Model 3. Informalion Processing ~ Managemeni, 26(3) :383-394. 249 Table 9: Average Precision and Exact R-Precision for the five individual runs compared with the combined CombSUM runs (Routing Topics 51-100). Average non-interpolated Precision [Run [[ AP-3 ~ PATN-3 SJM-3 ZF-3 II Disk 3 sV 0.1347 0.0189 0.1139 0.0593 0.0589 LV 0.1189 0.0156 0.1056 0.0587 0.0494 Pnl.0 0.2519 0.0257 0.2128 0.1141 0.2039 Pnl.5 0.2869 0.0239 0.2411 0.1189 0.2279 Pn2.0 0.2852 0.0221 0.2390 0.1303 0.2225 CombSUM 0.3196 0.0260 0.2696 0.1304 0.2681 Chg/Max 11.4% 1.2% 11.8% 0.07% 17.6% _________ Exact_R-Precision _______ _______ [Run [[ AP-3 ~ PATN-3 ~ SJM-3 ] ZF-3 II Disk 3 ] sV 0.1703 0.0171 0.1337 0.0595 0.0595 LV 0.1444 0.0156 0.1098 0.0547 0.1002 Pnl.0 0.2790 0.0325 0.2185 0.1224 0.2594 Pnl.5 0.3082 0.0322 0.2579 0.1248 0.2786 Pn2.0 0.3062 0.0310 0.2531 0.1462 0.2809 CombSUM 0.3319 0.0319 0.2900 0.1260 0.3143 Chg/Max 7.7% -1.8% 12.4% -13.8% 11.9% Table 10: Average Precision and Exact R-Precision for the five individual runs (Ad-hoc Topics 101-150). Average non-interpolated Precision Run I'__AP__]_DOE_~ Disk 1 _______ ______ fi_______ Disk 2 ZF ~ oth FR ] WSJ J ZF `1 AP ] FR ~ WSJ [ `[Disks J sV 0.3237 0.0949 0.0630 0.2740 0.0936 0.3068 0.0650 0.2259 0.1166 0.2035 LV 0.3326 0.0697 0.1018 0.2848 0.0997 0.2981 0.0602 0.2483 0.1045 0.2159 Pnl.0 0.3340 0.0831 0.1777 0.3153 0.1292 0.3133 0.1927 0.2838 0.1722 0.2205 Pnl.5 0.3682 0.0814 0.1874 0.3332 0.1430 0.3438 0.1982 0.2941 0.1964 0.2543 Pn2.0 0.3647 0.0750 0.1761 0.3290 0.1307 0.3419 0.1995 0.2828 0.2018 0.2573 CombSUM 0.4153 0.1038 0.2133 0.3778 0.1657 0.3959 0.2000 0.3561 0.2200 0.3206 Chg/Max 12.8% 9.4% 13.8% 13.4% 15.9% 15.1% 0.2% 21.0% 9% 24.6% _____________________ Exact R-Precision _______ FR ~ wSJ [ TI 1 [Run I{ 0A33P51 J DOE [Disk 1 ~ ~ J ZF ~ AP ] Disk 2 ZF ~ DBio$i~ ~ sV 0.0947 0.0567 0.2881 0.1086 0.3052 0.0718 0.2502 0.1130 0.2649 LV 0.3385 0.0714 0.1006 0.3022 0.1055 0.3019 0.0540 0.2712 0.1036 0.2661 Pnl.0 0.3495 0.0989 0.1434 0.3322 0.1190 0.3139 0.1653 0.2934 0.1577 0.2867 Pnl.5 0.3721 0.0899 0.1582 0.3440 0.1277 0.3465 0.1716 0.3044 0.1796 0.3203 Pn2.0 0.3735 0.0925 0.1579 0.3389 0.1154 0.3401 0.1640 0.2953 0.1979 0.3233 CombSUM 0.4137 0.1156 0.1938 0.3757 0.1389 0.3712 0.1741 0.3385 0.1909 0.3711 Chg/Max 10.8% 16.9% 22.5% 9.2% 8.8% 7.1% 1.4% 11.2% -3.5% 14.8% 250 Table 11: Per-Topic comparison of Average Precisions (Ad-hoc Topics 51-100). [Topic ] #Rel I[ SV ] LV [Pnl.O ] Pnl.5 J Pn2.O ~ CombSUM'1 Chg/Max Chg/Pn2.()] 74 273 0.0000 0.0099 0.0001 0.0002 0.0002 0.0002 -97.98% 0.00% 90 206 0.1248 0.1134 0.0082 0.0036 0.0015 0.0089 -92.87% 493.33% 77 138 0.1100 0.2241 0.0093 0.0196 0.0246 0.0411 -81.66% 67.07% 91 40 0.1648 0.1039 0.0012 0.0038 0.0068 0.0308 -81.31% 352.94% 72 119 0.1099 0.1173 0.0076 0.0116 0.0138 0.0323 -72.46% 134.06% 73 183 0.0231 0.0124 0.0017 0.0052 0.0120 0.0098 -57.58% -18.33% 94 310 0.0176 0.0185 0.0191 0.1094 0.2100 0.1043 -50.33% -50.33% 85 894 0.1587 0.1779 0.0239 0.0433 0.0518 0.1018 -42.78% 96.53% 67 92 0.0011 0.0012 0.0435 0.0487 0.0502 0.0290 -42.23% -42.23% 55 810 0.2362 0.4535 0.1153 0.1266 0.1256 0.2874 -36.63% 128.82% 64 375 0.1493 0.1641 0.0740 0.0673 0.0706 0.1097 -33.15% 55.38% 57 461 0.0282 0.0605 0.5464 0.4518 0.3447 0.3786 -30.71% 9.83% 80 374 0.0218 0.0266 0.1007 0.1125 0.1025 0.0899 -20.09% -12.29% 82 602 0.2544 0.3139 0.2603 0.1746 0.1278 0.2521 -19.69% 97.26% 97 352 0.0513 0.0513 0.1133 0.1376 0.1405 0.1190 -15.30% -15.30% 76 294 0.1481 0.1677 0.1221 0.0966 0.0755 0.1440 -14.13% 90.73% 89 174 0.0365 0.0762 0.0481 0.1159 0.1515 0.1306 -13.80% -13.80% 98 666 0.0822 0.0677 0.0421 0.0579 0.0703 0.0711 -13.50% 1.14% 84 396 0.1659 0.0985 0.0787 0.1251 0.1559 0.1471 -11.33% -5.64% 99 288 0.4834 0.4796 0.3529 0.3551 0.3224 0.4480 -7.32% 38.96% 58 159 0.0971 0.0620 0.3459 0.4198 0.4072 0.3937 -6.22% -3.32% 92 88 0.0253 0.0405 0.0130 0.0154 0.0176 0.0381 -5.93% 116.48% 71 380 0.0595 0.0613 0.1396 0.1978 0.2210 0.2156 -2.44% -2.44% 88 165 0.1813 0.2341 0.3629 0.3635 0.3542 0.3550 -2.34% 0.23% 59 579 0.2368 0.2272 0.1546 0.3827 0.4370 0.4282 -2.01% -2.01% 93 171 0.0344 0.0385 0.5233 0.4831 0.3779 0.5141 -1.76% 36.04% 95 263 0.0149 0.0228 0.1450 0.2072 0.2392 0.2375 -0.71% -0.71% 52 535 0.4918 0.4462 0.4896 0.6623 0.6882 0.6864 -0.26% -0.26% 78 162 0.2202 0.2632 0.7696 0.7484 0.7250 0.7705 0.12% 6.28% 61 206 0.4106 0.5130 0.3646 0.3786 0.3621 0.5167 0.72% 42.70% 51 138 0.2799 0.3957 0.4751 0.5282 0.5428 0.5476 0.88% 0.88% 70 55 0.1156 0.1198 0.7700 0.7919 0.7982 0.8080 1.23% 1.23% 54 171 0.3063 0.3045 0.3950 0.3207 0.2673 0.4048 2.48% 51.44% 62 298 0.1674 0.1907 0.1519 0.2087 0.2121 0.2198 3.63% 3.63% 81 62 0.1759 0.1410 0.2287 0.2370 0.2321 0.2467 4.09% 6.29% 68 195 0.0960 0.0920 0.1040 0.2098 0.2540 0.2651 4.37% 4.37% 69 52 0.0956 0.1629 0.5382 0.5873 0.5833 0.6227 6.03% 6.75% 53 571 0.1821 0.1874 0.1543 0.2928 0.3241 0.3461 6.79% 6.79% 83 633 0.2673 0.2931 0.1753 0.2412 0.2666 0.3317 13.17% 24.42% 56 878 0.3011 0.3277 0.3391 0.3089 0.2691 0.3955 16.63% 46.97% 100 317 0.1904 0.1972 0.1423 0.1815 0.2094 0.2516 20.15% 20.15% 65 386 0.0100 0.0078 0.1111 0.1190 0.1236 0.1529 23.71% 23.71% 86 213 0.5146 0.4242 0.5216 0.5234 0.4891 0.6624 26.56% 35.43% 60 60 0.0547 0.0887 0.0866 0.0960 0.0992 0.1259 26.92% 26.92% 79 232 0.1376 0.1220 0.2057 0.2784 0.2719 0.3690 32.54% 35.71% 75 365 0.0016 0.0037 0.0443 0.0499 0.0481 0.0684 37.07% 42.20% 63 208 0.0032 0.0028 0.0631 0.0597 0.0688 0.0972 41.28% 41.28% 66 197 0.0001 0.0000 0.0320 0.0388 0.0478 0.0695 45.40% 45.40% 87 188 0.0436 0.0559 0.0410 0.0914 0.1256 0.1867 48.65% 48.65% 96 693 0.0095 0.0084 0.0833 0.1177 0.1302 0.2356 80.95% 80.95% ~ Avg f 15667 II 0.1418 ~ 0.1555 0.1988 ~ 0.2242 ~ 0.225 I[ 0.262 ]] 16.44% [ 16.44% 251 Table 12: Per-Topic comparison of Average Precisions (Ad-hoc Topics 101-150). Topic I #Rel'1 SV J LV ~ Pnl.O J Pnl.5 ~ Pn2.O I'_CombSUM I' Chg/Max] Chg/Pn2.O 1 103 94 0.2400 0.2379 0.0130 0.0225 0.0309 0.0532 -77.83% 72.17% 122 114 0.1467 0.1059 0.0547 0.0490 0.0544 0.0898 -38.79% 65.07% 101 57 0.2232 0.1932 0.0900 0.1088 0.1175 0.1482 -33.60% 26.13% 140 25 0.0150 0.0520 0.0111 0.0207 0.0315 0.0356 -31.54% 13.02% 135 400 0.4072 0.4449 0.1103 0.1121 0.0835 0.3122 -29.83% 273.89% 104 75 0.1867 0.2214 0.0715 0.0642 0.0828 0.1671 -24.53% 101.81% 121 55 0.0017 0.0028 0.0283 0.0542 0.0628 0.0475 -24.36% -24.36% 127 223 0.2411 0.2476 0.0543 0.0942 0.1028 0.1911 -22.82% 85.89% 143 397 0.4069 0.3978 0.1628 0.1968 0.2022 0.3257 -19.96% 61.08% 124 173 0.0201 0.0165 0.1711 0.1360 0.1068 0.1501 -12.27% 40.54% 146 358 0.6795 0.7017 0.4394 0.4775 0.4907 0.6255 -10.86% 27.47% 114 138 0.0518 0.0756 0.2381 0.2559 0.2615 0.2525 -3.44% -3.44% 112 291 0.0041 0.0188 0.3876 0.3515 0.3194 0.3779 -2.50% 18.32% 130 286 0.2301 0.2937 0.4184 0.5477 0.5749 0.5632 -2.04% -2.04% 102 64 0.2127 0.2308 0.1119 0.1373 0.1505 0.2287 -0.91% 51.96% 128 381 0.0915 0.0751 0.2583 0.3710 0.4190 0.4156 -0.81% -0.81% 150 458 0.4618 0.4994 0.3362 0.4432 0.4566 0.4985 -0.18% 9.18% 115 165 0.4053 0.4383 0.3262 0.3633 0.3720 0.4407 0.55% 18.47% 113 206 0.0531 0.0768 0.2700 0.3304 0.3030 0.3367 1.91% 11.12% 119 326 0.0644 0.0843 0.2442 0.2252 0.2018 0.2497 2.25% 23.74% 107 98 0.1101 0.1580 0.3156 0.4062 0.4451 0.4592 3.17% 3.17% 116 49 0.2635 0.1886 0.3008 0.2803 0.2487 0.3125 3.89% 25.65% 118 273 0.0259 0.0164 0.1551 0.1736 0.1725 0.1806 4.03% 4.70% 145 162 0.3218 0.2983 0.2169 0.2180 0.1961 0.3359 4.38% 71.29% 138 52 0.0895 0.0992 0.1169 0.1321 0.1322 0.1386 4.84% 4.84% 148 250 0.7256 0.7300 0.7316 0.7904 0.8010 0.8416 5.07% 5.07% 108 294 0.0994 0.1266 0.3089 0.2586 0.1820 0.3264 5.67% 79.34% 136 206 0.1713 0.1976 0.3943 0.5899 0.5894 0.6319 7.12% 7.21% 134 188 0.4033 0.4132 0.5091 0.4897 0.4675 0.5500 8.03% 17.65% 132 201 0.1979 0.3246 0.5671 0.5759 0.5625 0.6222 8.04% 10.61% 133 80 0.2396 0.2793 0.1541 0.2322 0.2425 0.3034 8.63% 25.11% 147 315 0.1371 0.1266 0.2312 0.2820 0.3001 0.3317 10.53% 10.53% 106 201 0.0354 0.0458 0.2784 0.3463 0.3653 0.4039 10.57% 10.57% 126 240 0.2766 0.2990 0.1997 0.3132 0.3548 0.3971 11.92% 11.92% 125 169 0.1632 0.1719 0.2580 0.2224 0.1842 0.2920 13.18% 58.52% 137 158 0.3365 0.4135 0.2339 0.3128 0.3555 0.4715 14.03% 32.63% 131 28 0.0659 0.0648 0.0671 0.0984 0.1061 0.1255 18.28% 18.28% 110 496 0.4909 0.4968 0.4845 0.4768 0.4336 0.6001 20.79% 38.40% 142 660 0.4269 0.4479 0.4060 0.4694 0.4629 0.5721 21.88% 23.59% 117 275 0.1165 0.1030 0.1296 0.1596 0.1637 0.2007 22.60% 22.60% 149 133 0.0309 0.0836 0.0490 0.0732 0.0769 0.1028 22.97% 33.68% 129 207 0.3391 0.2338 0.2402 0.3348 0.3477 0.4369 25.65% 25.65% 144 49 0.2490 0.2275 0.0832 0.1327 0.1958 0.3130 25.70% 59.86% 105 54 0.0656 0.1744 0.1304 0.1439 0.1441 0.2262 29.70% 56.97% 139 55 0.0911 0.1139 0.0592 0.0705 0.0776 0.1481 30.03% 90.85% 111 285 0.3795 0.3540 0.1991 0.3025 0.3756 0.5025 32.41% 33.79% 123 435 0.0697 0.0800 0.2254 0.2252 0.2043 0.3014 33.72% 47.53% 120 83 0.0165 0.0136 0.0490 0.0556 0.0538 0.0746 34.17% 38.66% 109 742 0.0875 0.0879 0.0811 0.1344 0.1500 0.2290 52.67% 52.67% 141 36 0.0084 0.0125 0.0538 0.0517 0.0514 0.0873 62.27% 69.84% ~ Avg ~ 10760 [[ 0.2035 ~ 0.2159 ~ 0.2205 ] 0.2543 ~ 0.2573 II 0.3206 [1 24.60% ~ 24.60% ~ 252