Description of the PRC CEO Algorithm for TREC Paul Thompson PRC Inc., Mail Stop 5S3 1500 Planning Research Drive McLean, VA 22102 Phone: 703/556-1923 Email: thompson~aul@po.gis.prc.com This paper describes work done on the ThEC project at PRC Inc. in collaboration with Professor Edward Fox and his colleagues at Virginia Polytechnic Institute and State University (VPI&SU). The reader should refer to the description of their system included in this proceedings for further details on the common processing of the TREC data shared by PRC and VPI&SU (Fox et al. 1993). PRC developed an algorithm, the Combination of Expert Opinion (CEO), which combined the results of VPI&SU's runs. VPI&SU used a different combination technique for their final results. Originally the intent was that the CEO algorithm would be integrated with the SMART system used by VPI&SU. Both upper and lower level combination of results would take place, i.e., at the lower level of individual document features within a particular retrieval method and the upper level of combination of the output of the individual methods themselves, i.e., the various cosine and p-norm methods used by VPI&SU. Furthermore we had originally hoped to train the CEO algorithm, so that the weighting of the various methods would be optimized based on relevance judgments. For the official TREC results we were only able to use the upper level of CEO without any training. Since then we have done additional retrospective experiments in which the different methods are weighted in the CEO algorithm by one of several measures of their performance for TREC. Combination of Expert Opinion The statistical technique of CEO provides a solution to the problem of combining different probabilistic models of document retrieval. This technique is expected to result in improved precision and recall over that provided by any one model, or method, since research has shown that various retrieval models retrieve different sets of relevant documents (Katzer et al. 1982, Fox et al. 1988). In the Bayesian formulation of the CEO problem (Lindley 1983) a decision maker is interested in some parameter or event; and he/she has a prior, or initial, distribution or probability for that parameter or event. The decision maker revises the distribution upon consulting several experts, each with his/her own distribution or probability for the parameter or event. To effect this revision, the decision maker must assess the relative expertise of the experts and their interdependence, both with each other and the decision maker. The experts' distributions are considered as data by the decision maker, which is used to update the prior distribution. For automatic document retrieval, the retrieval system is the decision maker and different retrieval algorithms, or models, are the experts (Thompson 1990a,b, 1991). This is referred to as the upper level CEO. At the lower level the probabilities of individual features, e.g., terms, within a particular retrieval model can be combined using CEO. In lower level CEO the retrieval model is the decision maker and the term probabilities are viewed as lower level experts. The probability distributions supplied by these lower level experts can be updated, according to Bayes theorem, by user relevance judgments for retrieved documents. These same relevance judgments also give the system a way to evaluate the performance of each model, both in the context of a single search of several iterations and over all searches to date. These results can be used in a statistically sound 337 way to weight the contributions of the models in the combined probability distribution used to rank the retrieved documents. Since various algorithms, such as p-norm, are expressed in terms of correlations rather than probability distributions, it was necessary to extend the CEO algorithm to handle correlations. So far this extension has been handled in a heuristic fashion. If a retrieval method, e.g., one of the cosine methods, returned a value between 0 and 1 as a retrieval status value; the logistic transformation of this weight was interpreted as an estimate of the mean of a logistically transformed beta distribution which was provided as evidence to the decision maker. Since there was no basis with which to assign a standard deviation to this distribution, as called for by the CEO methodology, an assumption was made that all standard deviations were .4045, a value corresponding to a standard deviation of .1 in terms of probabilities. All of the retrieval methods used by VPI&SU were combined with the CEO algorithm except for the Boolean. That is, we used weighted and unweighted cosine and inner product measures as well as p-norm measures of 1.0, 1.5, and 2.0. For measures, such as the inner product and some of the p-norm results that did not give a retrieval status value in the 0 to 1 range, the result was mapped to this interval by scaling the highest score of the method in question for a given topic to the highest score given by one of the cosine measures. Default scores half way between 0 and the lowest score achieved by a particular method were used for documents not retrieved in the top 200 in response to a given topic, since the actual score of these documents was unknown. The Boolean model was not included, because it was not a ranked retrieval method. In the future we plan to extend our normalization techniques to use the Boolean results as well. Figure 1 shows our summary official TREC results for topics 51-100 on the Wall Street Journal collection from the first CD-ROM. Since TREC, we have experimented with weighting the different methods combined based on their performance with the TREC data. In other words we have attempted to determine an upper bound for performance based on knowledge of each method's performance on the actual test data. We used four different weighting schemes: the 11-point average, precision at 0.00 recall, precision at 0.10 recall, and unweighted (i.e., our official TREC results). We also tried using the five best methods, rather than the seven used for our official results, i.e., we excluded Pnorml.5 and Pnorm2.0. None of the weights produced better results than the unweighted scheme. This was surprising. Figure 2 shows our summary results for CEO based on all seven methods using the 11-point average and additional relevance judgments made by VPI&SU. Figure 3 shows the same weighting scheme using only NIST relevance judgments. Two immediate explanations suggest themselves. First, using overall averages may not be too useful. Second, our simple implementation of CEO assumes independence among the methods. To examine the first problem we intend to try weighting the methods on a topic by topic basis rather than by overall averages. Again this would be a retrospective upper bound experiment. In terms of the CEO approach (Thompson 1990a,b) using only overall averages would be analogous to using only feedback from past searches, while using topic- specific weights would correspond to receiving feedback over several iterations of the same search. We propose to investigate the second problem by analyzing the overlap of pairs of runs of the various methods to determine dependence and thus perform CEO without the independence assumption. The PRC portion of our experiments were all run on a Sun SPARCstation 2 with 16 megabytes of RAM. The CEO code was written in g++. Top ranked evaluation 338 Run number: prceol all 50 Num~queries: Total number of documents Retrieved: 10000 Relevant: 4056 Rel_ret: 1666 Recall - Precision Averages: at0.00 0.6598 at 0.10 0.4382 at0.20 0.3451 at0.30 0.2536 at0.40 0.1753 atOSO 0.1400 at0.60 0.1016 at0.70 0.0489 at 0.80 0.0185 at 0.90 0.0015 at 1.00 0.0015 Average precision for all points 11-ptAvg: 0.1985 Average precision for 3 3-ptAvg: 0.1679 Recall: over all queries intermediate points (0.20, 0.50, 0.80) at Sdocs: at 15 docs: at 30 docs: at 100 docs: at 200 docs: Precision: 0.0379 0.1193 0.1714 0.3438 0.4557 At Sdocs: 0.4040 At 15 docs: 0.3653 At30docs: 0.3033 AtlOOdocs: 0.2322 At200docs: 0.1666 Figure 1. TREC summary results without weights 339 Top ranked evaluation Run number: 1 Num~queries: 50 Total number of documents over all queries Retrieved: 10000 Relevant: 3901 Rel ret: 1867 RecalF- Precision Averages: at0.00 0.5419 at 0.10 0.3684 at0.20 0.3239 at0.30 0.2848 at0.40 0.1975 atOSO 0.1490 at 0.60 0.0933 at0.70 0.0575 at 0.80 0.0147 at0.90 0.0029 at 1.00 0.0006 Average precision for all points 11-ptAvg: 0.1850 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.1625 Recall: Exact: at Sdocs: at 15 docs: at 30 docs: at 100 docs: at 200 docs: Precision: Exact: 0.1867 At Sdocs: 0.2360 Atl5docs: 0.2920 At30docs: 0.2853 AtlOOdocs: 0.2234 At200docs: 0.1867 0.5400 0.0270 0.0890 0.1592 0.3603 0.5400 Figure 2. Post-TREC summary results with 11-point average weights and additional relevance judgments 340 Top ranked evaluation Runnumber: 1 Num~queries: 47 Total number of documents over all queries Retrieved: 9400 Relevant: 3697 Rel_ret: 1366 Recall - Precision Averages: at 0.00 0.3801 atOlO 0.2588 at0.20 0.2379 at0.30 0.1638 at0.40 0.1144 atOSO 0.0831 at0.60 0.0551 at 0.70 0.0363 at0.80 0.0101 at0.90 0.0002 at 1.00 0.0002 Average precision for all points 11-ptAvg: 0.1218 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-ptAvg: 0.1104 Recall: Exact: at Sdocs: at 15 docs: at 30 docs: atlOOdocs: at200docs: 0.4114 Precision: Exact: 0.1453 At Sdocs: 0.1915 Atl5docs: 0.2156 At30docs: 0.1879 AtlOOdocs: 0.1683 At200docs: 0.1453 0.4114 0.0137 0.0497 0.0800 0.2249 Figure 3. Post-TREC summary results with 11-point average weights and only NIST relevance judgments 341 References Fox, E. A.; Koushik, P.; Shaw, J.; Modlin, R.; and Rao, D. 1993. "Combining Evidence from Multiple Searches" this proceedings. Fox, E. A.; Nunn, G. L.; Lee, W. C. 1988. "Coefficients for Combining Concept Classes in a Collection" Proceedings of the 11th. International Conference on Research and Development in Information Retrieval June 13-15, Grenoble, France p.291-307. Katzer, J.; McGill, M. J.; Tessier, J. A.; Frakes, W.; DasGupta, P.1982. "A study of the overlap among document representations" Information Technology: Research and Development vol.2 P.261-274. Lindley, D.V. 1983. "Reconciliation of probability distributions." Operations Research. v.31, no.5, p.866-880. Thompson, P. 1990a. "A Combination of Expert Opinion Approach to Probabilistic Information Retrieval, Part 1: The Conceptual Model." Information Processing and Management, Vol.26, No.3, p.371-382. Thompson, P. 1990b. "A Combination of Expert Opinion Approach to Probabilistic Information Retrieval, Part 2: Mathematical Treatment of CEO Model 3."' Information Processing and Management, Vol.26, No.3, p.383-394. Thompson, P.1991. "Machine Learning in the Combination of Expert Opinion Approach to IR" In Bimbaum, L. and Collins, G. (eds.) Machine Learning: Proceedings of the Eighth International Workshop (ML91), San Mateo, Morgan Kaufinann, p.270-274. 342