Description of the PRC CEO Algorithm for TREC-2 Paul Thompson PRC Inc., Mail Stop 5S3 1500 PRC Drive McI-can, VA 22102 Phone: 612/687-4650 E-mail: thompson@research.westlaw.com (current affiliation: West Publishing Company) Abstract This paper describes an application of the Combination of Expert Opinion technique to combine the results of multiple retrieval methods used on the TREC-2 collection. The methods being combined were weighted by their TREC-1 performance. 1. Introduction This paper describes work done on the TREC-2 project at PRC Inc. in collaboration with Professor Edward Fox and his colleagues at Virginia Polytechnic Institute and State University (VPI&SU). The reader should refer to the description of their system included in these working notes for further details on the common processing of the TREC-2 data shared by PRC and VPI&SU (Fox et al. 1993). PRC used its algorithm, the Combination of Expert Opinion (CEO), to combine the results of VPI&SU's runs. VPI&SU used a different combination technique for their final results. Originally the intent was that the CEO algorithm would be integrated with the SMART system used by VPI&SU. Both upper and lower level combination of results would take place, i.e., at the lower level of individual document features within a particular retrieval method and the upper 271 level of combination of the output of the individual methods themselves, i.e., the various cosine and p-norm methods used by VPI&SU. For TREC-1 we were not able to train the CEO algorithm, so that the weighting of the various methods would be optimized based on relevance judgments. For TREC-2 we used the 11 point average scores obtained by the various methods for TREC- 1 for weighting. Again we only used the upper level of CEO. For TREC-1 we found that combining all methods resulted in lower performance than using the single best method. This year our first version combined the top two methods, based on TREC- 1, while the second version used the top five methods. 2. Combination of Expert Opinion The statistical technique of CEO provides a solution to the problem of combining different probabilistic models of document retrieval. This technique is expected to result in improved precision and recall over that provided by any one model, or method, since research has shown that various retrieval models retrieve different sets of more or less equally relevant documents (Katzer et al. 1982, Fox et al. 1988). In the Bayesian formulation of the CEO problem (Lindley 1983) a decision maker is interested in some parameter or event for which he/she has a prior, or initial, distribution or probability. The decision maker revises the distribution upon consulting several experts, each with his/her own distribution or probability for the parameter or event. To effect this revision, the decision maker must assess the relative expertise of the experts and their interdependence, both with each other and the decision maker. The experts' distributions are considered as data by the decision maker, which is used to update the prior distribution. For automatic document retrieval, the retrieval system is the decision maker, and different retrieval algorithms, or models, are the experts (Thompson 1990a,b, 1991). This is referred to as the upper level CEO. At the lower level the probabilities of individual features, e.g., terms, within a particular retrieval model can be combined using CEO. In lower level CEO the retrieval model is the decision maker and the term probabilities are viewed as lower level experts. The probability distributions supplied by these lower level experts can be updated, according to Bayes theorem, by user relevance judgments for retrieved documents. These same relevance judgments also give the system a way to evaluate the performance of each model, both in the context of a single search of several iterations and over all searches to date. These results can be used in a statistically sound way to weight the contributions of the models in the combined probability distribution used to rank the retrieved documents. Since various algorithms, such as p-norm, are expressed in terms of correlations rather than probability distributions, it was necessary to extend the CEO algorithm to handle correlations. So far this extension has been handled in a 272 heuristic fashion. If a retrieval method, e.g., one of the cosine methods, returned a value between 0 and 1 as a retrieval status value; the logistic transformation of this weight was interpreted as an estimate of the mean of a logistically transformed beta distribution which was provided as evidence to the decision maker. Since there was no basis with which to assign a standard deviation to this distribution, as called for by the CEO methodology, an assumption was made that all standard deviations were .4045, a value corresponding to a standard deviation of .1 in terms of probabilities. The CEO code was written in g++. For TREC- 1 we used the CEO algorithm to combine all of the VPI&SU retrieval methods except for the Boolean, i.e., weighted and unweighted cosine and inner product measures as well as p-norm measures of 1.0, 1.5, and 2.0. For measures, such as the inner product and some of the p-norm results not giving a retrieval status value in the 0 to 1 range, the result was mapped to this interval by scaling the highest score of the method in question for a given topic to the highest score given by one of the cosine measures. Default scores half way between 0 and the lowest score achieved by a particular method were used for documents not retrieved in the top 200 in response to a given topic, since the actual score of these documents was unknown. For TREC-2 we followed the same approach except that only the results of methods with better TREC- 1 performance were combined. Our fwst version used Cosine.atn and Cosine.nnn, the two best VPI&SU methods from TREC- 1, weighted by their performance on TREC- 1. The second version used these two methods and the next best three, Inner.atn, Inner.nnn, and Pnorm 1.0, also weighted by their ThEC-1 performance (see VPI&SU report for details on these methods). Figures 1 and 2 show our summary official TREC-2 results for versions 1 and 2 respectively. Queryid (Num): all prceol Total number of documents over all queries Retrieved: 50000 Relevant: 10785 Rel_ret: 3561 Interpolated Recall - Precision Averages: at 0.00 0.4768 at 0.10 0.2613 at 0.20 0.1807 at 0.30 0.1202 at 0.40 0.0875 at 0.50 0.0504 at 0.60 0.0328 at 0.70 0.0169 at 0.80 0.0142 at 0.90 0.0077 at 1.00 0.0031 Average precision (non-interpolated) over all rel does 0.0904 Precision: At S docs: 0.2120 At 10 docs: 0.2480 At 15 docs: 0.2707 At 20 docs: 0.2760 At 30 docs: 0.2753 At 100 docs: 0.2418 At 200 docs: 0.1756 At 500 does: 0.1086 At 1000 docs: 0.0712 R-Precision (precision after a query) docs retrieved): Exact: 0.1703 R(=num_relfor Queryid (Num): Total number of Retrieved: Relevant: Rel_ret: Interpolated at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision rel docs all prceol documents over all queries 50000 10785 3323 Recall - Precision Averages: 0.5963 0.3270 0.2306 0.1430 0.0866 0.0425 0.0323 0.0246 0.0209 0.0131 0.0041 (non-interpolated) over all 0.1120 Precision: At S docs: 0.4120 At 10 docs: 0.4000 At 15 docs: 0.3920 At 20 docs: 0.3740 At 30 docs: 0.3527 At 100 docs: 0.2722 At 200docs: 0.1974 At SOodocs: 0.1090 At 1000 docs: 0.0665 R-Precision (precision after R (= numjel for a query) docs retrieved): Exact: 0.1809 Figure 1: Summary scores 1 using two best experts for PRC version Figure 2: Summary scores for PRC version 2 using five best experts 273 3. Conclusion Although selecting the best individual TREC- 1 VPI&SU retrieval methods for combination and weighting them by their TREC- 1 performance seemed to be a reasonable strategy which would yield better retrieval results than unweighted combination of all methods, in fact CEO performance was worse on TREC-2. This was due, in part, to changes made in the individual methods which made methods that had been best in TREC- 1 less effective than other methods for TREC-2. These results suggest that selection and weighting of methods for combination based on performance of earlier versions of the methods is unwarranted. References Fox, E.A.; Koushik, P.; Shaw, 3.; Modlin, R.; Rao, D. 1993. "Combining Evidence from Multiple Searches" this proceedings. Fox, E.A.; Nunn, G.L.; Lee, W.C. 1988. "Coefficients for Combining Concept Classes in a Collection" Proceedings of the 11th. International Conference on Research and Development in Information Retrieval, June 13-15, Grenoble, France, p.291-307. Katzer, J.; McGill, M.J.; Tessier, J.A.; Frakes, w.; DasGupta, P.1982. "A study of the overlap among document representations" Information Technology: Research and Development, vol. 2, p. 261-274. Lindley, D.V. 1983. "Reconciliation of probability distributions" Operations Research, v.31, no.5, p.866-880. Thompson, P. 1990a. "A Combination of Expert Opinion Approach to Probabilistic Information Retrieval, Part 1: The 274 Conceptual Model" Information Processing and Management, vol.26, no.3, p. 371-382. Thompson, P. 1990b. "A Combination of Expert Opinion Approach to Probabilistic Information Retrieval, Part 2: Mathematical Treatment of CEO Model 3" Information Processing and Management, vol.26, no.3, p. 383-394. Thompson, P. 1991. "Machine Learning in the Combination of Expert Opinion Approach to IR" In Birnbaum, L. and Collins, G. (eds.) Machine Learning: Proceedings of the Eighth International Workshop (ML91), San Mateo: Morgan Kaufmann, p.270-274.