Combining Evidence from Multiple Searches Edward A. Fox, M. Prabhakar Koushik, Joseph Shaw, Russell Modlin and Durgesh Rao Department of Computer Science Virginia Tech, Blacksburg, VA 24061-0106 Abstract At Virginia Tech and PRC Inc. investigations with TREC data have focused on developing and comparing mechanisms for combining evidence related to a number of search schemes. Our work with the first CD-ROM has included various indexing, weighting, retrieval, combination, evaluation, and failure analysis efforts. Related work reported elsewhere in the proceedings by Paul Thompson discusses extensions undertaken by PRC Inc. and an evaluation of those results. Future work will develop our ideas further, try them out with additional data, and hopefully be evaluated in connection with other work on TREC and TIPSTER. 1 Overview The 1992 TREC effort at Virginia Tech was carried out largely on a DECstation 5000 Model 25 with 40 MB of RAM. The 1985 version of the SMART retrieval system, with numerous of our enhancements, was used for indexing, retrieval, and evaluation. Our efforts were divided into two main phases. Prior to the TREC meeting we worked solely with the Wall Street Journal (WSJ) on the first CD-ROM. Thus, in Phase 1, we made eight different types of runs employing three different retrieval models - the Boolean model, the p-norm model and the vector space model - and three different weighting schemes. The queries for the Boolean and p-norm runs were manually generated by project team members. Vector queries were generated automatically from the topic descriptions. The results from these runs were then merged together to provide combined result sets. Relevance judgements were also performed on a subset of the retrieved documents to help with training studies. In Phase 2, after the TREC meeting, we experimented with all five collections on the first CD-ROM. However, based on our work with the WSJ, we restricted our investigation to five out of the original eight cases. We also explored use of limited training information in carrying out the merging of results. Work is continuing at Virginia Tech to incorporate the results of our many runs into merged selections, to improve performance. Subsequent sections of this paper describe all of these activities in greater detail. 2 Indexing and Data Structures This section outlines the indexing done with the document collection provided by NIST. Due to limitations of available disk space, only Disc 1 was used during the experimental runs. The documents were indexed on a DEC station 5000/25 using an enhanced version of the 1985 release 319 of the SMART Information Retrieval System [1]. The following specifications were used during the indexing process: 1. No stemming was done. 2. A stop word list of 418 words was used. 3. The Heading, Text, and Summary sections were included. 4. A controlled vocabulary was not included. Briefly, the document text is tokenized, stop words are deleted, and non-noise words are included in the term dictionary along with their occurrence frequencies. Each term in the dictionary has a unique identification number. A document vector file is also created during indexing which contains for each document its unique ID, and a vector of term IDs and term weights. The weighting scheme itself is fairly flexible and can be changed to one of several schemes after the indexing is complete. Indexing the WSJ created a dictionary of approximately 15 MB and a document vector file of 121 MB. The other 4 collections take up space proportional to their sizes. 3 Retrieval Approach 3.1 Retrieval Runs Several retrieval runs were then made as outlined below: * Retrieval based on the vector model The topics were indexed considering the Description, Narrative, and Concepts sections to form vector queries. Retrieval was performed by varying the weighting on the document and query vectors. Three different methods were tried: 1. Weighting with tf 2. Weighting with (tf/max(tf)) * idf 3. Weighting with (0.5 + 0.5 * tf/max(tf)) * idf- called aug~norm where tf = term frequency, and idf = inverse document frequency. During retrieval, the query-document similarities were computed in two different ways for each of these weighting schemes: cosine and inner product. * Retrieval based on the Boolean model Boolean queries were manually generated by team members (with a background in computer science), and a retrieval run was made using these queries. The queries were composed, for the most part, using the entire text of the topic descriptions provided, and occasion- ally broader/narrower terms were obtained from general domain knowledge. The Boolean operators used were AND and OR. 320 Table 1: Summary of Retrieval Runs ([ Name ~ Model ~ Sim. Function ~ Weighting Scheme 1'~ bool Boolean Boolean binary term weights pnorml.0 p-norm p-norm binary term weights, p=1.0 pnorml.5 p-norm p-norm binary term weights, p=1.5 pnorm2.0 p-norm p-norm binary term weights, p=2.0 cosine.atn vector cosine aug~orm * idf cosine.nnn vector cosine tf inner.atn vector inner product aug~orm * idf inner.nnn vector inner product tf . Retrieval based on the p-norm model The Boolean queries described above were also used for the p-norm runs. Retrieval runs were made with three different p-values: 1.0, 1.5, and 2.0. No query term or clause weights were used during the p-norm runs. The different runs are summarized in Table 1. Note that in Phase 1 of our efforts, we used all eight runs listed. In Phase 2, however, we focused on the pnorml.O case, with document weighting, and the four vector runs. 3.2 Weighting Schemes The weighting schemes mentioned above are detailed in Table 2. Table 2: Weighting Scheme Options . Term frequency normalization. This has the following choices: (n)one newif = tf (b)inary newif = 1 ________________ ________~ if (m)ax~norm newif = m~x~if (a)ug~orm newif = 0.5 + 0.5 * if ______________ max~if . Document weights. This has the following choices: f{(n)oneT new~wt = newif I(t)fldf new~wt = newif * ~ (p)rob new~wt = newif * 1()9(num~0o1C1s;fC;6lQl~fre~) . Document vector normalization. This can be either of: `1(n)onei~Th~wt=new~wt ~ This allows for a very flexible approach to changing the document vector weights as can be seen from Table 3. 321 Table 3: Matrix of weighting options Combinations Term frequency Doc. weights Doc. vector norm. nnn tf none none ntn tf tfidf none npn tf prob none bnn binary none none btn binary idf none bpn binary prob none mnn max~orm none none mtn max~orm idf none mpn max~orm prob none ann augnorm none none atn augnorm idf none apn augnorm prob none 3.3 CPU time The retrieval runs required approximately 4 minutes of CPU time for each topic. We did a full sequential pass through the document vector file for this since we did not have enough disk space for the inverted file. 3.4 Combination of Run Results Our original plan was to compare several schemes for combining the results of a number of runs. The results of using one scheme, the CEO Model, are reported elsewhere in these proceedings by Paul Thompson. One of our goals was to use an artificial intelligence methodology called Decision Trees. Decision Trees should reflect the most effective evaluation methods for each query. Our Decision Trees were produced by a commercial software package called KnowledgeSEEKER [2], by FirstMark Technologies Limited. Input to KnowledgeSEEKER is a training set of documents for which the results of the various evaluation runs are known along with the relevance values of the documents. The output of KnowledgeSEEKER is a Decision Tree that indicates in what order to apply the evaluation runs in order to determine the relevance of documents in the collection. The highest level of the tree is the test that most effectively evaluates documents for a given query. Based on the results of the most effective test, documents may be assigned a relevance score or additional tests that further evaluate the documents may be suggested. The Decision Tree partitions the collection into disjoint document groups with relevance values for the documents in each group. The results produced by KnowledgeSEEKER for the tralning set of documents must be parsed in order to apply the Decision Trees to other documents in the collection. In Phase 1 training data was fed into the Decision Tree system so that the ranges of values of independent variables (e.g., similarity for a particular type of search) could be categorized into sets or intervals that predict the dependent variable values (e.g., relevance value of 0 or 1). For example, one Decision Tree developed during Phase 1 is given below. 322 Figure 1: Decision Tree Example for query 52 RULE~1 IF COSINE-2 = [0.032,0.07) THEN RELEVANCE = 0 73.3~I. RELEVANCE = 1 26.7~i. RULE~2 IF COSINE-2 = [0.07,0.195] THEN RELEVANCE = 0 9.SX RELEVANCE = 1 90.5X This indicates that the likelihood of relevance is about .27 for very low values from the second cosine run, and about .91 for higher values from that same cosine run. When selecting this tree, the Decision Tree method suggests that ranking solely based on this cosine run would be wise. More complicated Decision Trees resulted for a number of queries, where several of the base runs' values had to be consulted. Unfortunately, no full ranking of run results using the Decision Trees could be completed in time for this report, so other, simpler methods were applied. In Phase 1, a simple scheme was used for the results that were turned in. Essentially, the best results from each of the runs were included, until 200 distinct documents were found, for each query. This scheme is referred to as Ad Hoc Merge in discussions below. In Phase 2, a more complex system was explored, called Recall-Precision (R-P) Merge. Details and results are given in Section 6. 4 Systems description The main machine used for the indexing and retrieval runs was a DECstation 5000 Model 25 with 40MB of RAM. This is a MIPS R3000 CPU running at 25MHz. The total disk space used for the project was on the order of 3 GB. 5 Results of Phase 1 Due to limitations of disk space, only a subset of the collection comprising of Disc 1 of the Wall Street Journal was used during Phase 1 experimental runs. Relevance judgements were performed on a subset of this data by team members, in order to obtain a large set of training information. These were compared with the NIST judgement data and showed very high correlation. (Almost 90% of the documents we judged relevant were judged relevant by NIST.) In any case, the NIST judgments were used in the official (November 18, 1992) evaluation of our Phase 1 system, shown in Figure 2. 323 Figure 2: Phase 1, Disc 1, WSJ, queries 50-100, Ad Hoc Merge Top ranked evaluation vpidt2 all 50 documents over all queries 10000 4056 1371 ion Averages: 0.5153 0.2841 0.1898 0.1285 0.0988 0.0746 0.0527 0.0338 0.0271 0.0222 0.0222 for all points 0.1317 for 3 intermediate points (0.20, 0.50, 0.80) 0.0972 Run number: Num~queries: Total number of Retrieved: Relevant: Rel~ret: Recall - Precis at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision 11-pt Avg: Average precision 3-pt Avg: Recall. at 5 docs: 0.0476 at 15 docs: 0.0640 at 30 docs: 0.0843 at 100 docs: 0.2050 at 200 docs: 0.4481 Precision: At 5 docs: 0.3160 At 15 docs: 0.2147 At 30 docs: 0.1760 At 100 docs: 0.1246 At 200 docs: 0.1371 6 Results of Phase 2 6.1 Base Runs In Phase 2, twenty-live base runs were made on Disc 1: 5 different retrieval methods were used for each of the 5 sub-collections. Based on our evaluation, the 11-point averages are given in Table 4. Note that in the p-norm case, document weights were utilized, in contrast to the binary weighting used in Phase 1. 324 Table 4: Base Runs, Phase 2, Disc 1, 11-point averages Run/Collection AP DOE FR WSJ ZF cosine.atn 0.1138 0.0543 0.0259 0.2740 0.0813 cosine.nnn 0.1890 0.0330 0.0504 0.2184 0.0946 inner.atn 0.1241 0.0609 0.0405 0.3224 0.0888 inner.nnn 0.1478 0.0252 0.0108 0.1329 0.0101 pnorml.0 0.3006 0.0876 0.0727 0.3085 0.1448 Table 5: Base Runs + Similarity Merge Run/Collection AP DOE FR WSJ ZF Sim-Merge cosine.atn 0.1138 0.0543 0.0259 0.2740 0.813 0.1149 cosine.nnn 0.1890 0.0330 0.0504 0.2184 0.0946 0.1513 inner.atn 0.1241 0.0609 0.0405 0.3224 0.0888 0.1717 inner.nnn 0.1478 0.0252 0.0108 0.1329 0.0101 0.0075 pnorml.0 0.3006 0.0876 0.0727 0.3085 0.1448 0.1831 It should be noted that collection (except for WSJ, 6.2 Similarity Merge the p-norm run results were best in almost all situations for the given where inner.atn was slightly better). TREC evaluations were to be done for the entire Disc 1 contents, so it is necessary to combine the results of the 5 sub-collections into an overall Disc 1 evaluation. This is a difficult matter, since each collection has a different number of relevant documents, and each was indexed separately. In 1993 we expect to consider this problem in more detail. As a first solution to the problem we decided on the simplest possible approach - combine the results based on similarity. Thus, for a particular retrieval approach, we merged all of the documents retrieved from the 5 sub-collection runs, sorted the 1000 documents found for each query based on similarity, and returned the 200 with the highest similarity values. For convenience, Table 5 shows the data from Table 4, but with an added column to show the Similarity Merge results. Clearly, some improvement is needed in this collection merging process. One might use training based on the number of relevant documents in a collection, to predict a prior probability of finding a relevant document in that collection, and then use that to temper the similarity values. 6.3 Recall-Precision Merge Another type of merger involves combining the several retrieval runs for a given sub-collection. To improve upon the Ad Hoc Merge used in Phase 1, we elected to train the merging process using recall-precision results. Thus, we considered the retrospective case of using the recall-precision tables from our evaluations, to help determine which runs to draw from for merging. In particular, we use the following Recall-Precision Merge algorithm: 1. For each run to be merged, store the top 200 items on a stack, with the highest rank at the bottom of the stack. 325 Table 6: Base Runs + Similarity Merge + R-P Merge Run/Collection AP DOE FR WSJ ZF Sim-Merge cosine.atn 0.1138 0.0543 0.0259 0.2740 0.813 0.1149 cosine.nnn 0.1890 0.0330 0.0504 0.2184 0.0946 0.1513 inner.atn 0.1241 0.0609 0.0405 0.3224 0.0888 0.1717 inner.nnn 0.1478 0.0252 0.0108 0.1329 0.0101 0.0075 pnorml.0 0.3006 0.0876 0.0727 0.3085 0.1448 0.1831 R-P Merge 0.2268 0.0554 0.0521 0.2555 0.1003 0.1523 2. Associate with each run an estimate of the probability that the top item on the stack is relevant. Initially, this value is the precision value at recall 0.0 from our evaluation. 3. To draw the next item for the merged result, identify the stack with the highest estimated probability of relevance, pop the top of the stack, and update the probability estimate. 4. To update the probability estimate, use interpolation based on the number of popped items (i.e., the number "retrieved") and the rest of the recall-precision results. 5. Continue the process starting again with step 3, as long as less than 200 items have been drawn. The results of applying this "R-P Merge" algorithm are given in Table 6, on the last line. Note that the first 5 columns of that line show the results of merging for a particular collection, and the very last value reflects Similarity Merge followed by R-P Merge. From Table 6 we see that the R-P Merge method does not yield results that are as good as the best individual run. In particular, we could simply use the pnorml.O results uniformly and do better than with merging. Further improvement in the above algorithm, possibly yielding more accurate estimates in step 4, will be investigated in 1993. Other studies will consider if recall-precision data for each query could be used in a similar training situation, for subsequent testing on Disc 2. 7 Evaluation 7.1 Software engineering We began with the 1985 version of SMART, and have enhanced it. We tried for a long period of time to use the new version of SMART on an RS/6000 but found the use of disk space to be excessive. Since we could not get reliable results, we went back to the older version. We underwent extensive software development since May. This included writing of C programs and Unix shell scripts to partially automate indexing, retrieval, relevance judgement making, merg- ing, evaluation and tabulation of results. 7.2 Problems and Failure Analysis Problems encountered in the project were partially identified. A failure analysis was performed on a subset of documents that failed to be included in our result set in Phase 1. Observations regarding our merging methods came from further studies in Phase 2. The following observations aim to summarize our findings. 326 * Disk Space Limitations: We had problems acquiring the required disk space in time for the submission of results due to lateness of the promised award from DARPA. A great deal of time was wasted in making partial runs using a hodgepodge of computers, with NFS mounting of remote files slowing processing by almost an order of magnitude. Ultimately we started over when told that work on the WSJ would suffice. This forced us to submit a subset of results in Phase 1, and to only complete work on Disc 1 during Phase 2. With more disk space we could have use inverted files for indexing the data and that would have made things much faster. It would have allowed real time interactive searching which would have speeded up and improved the quality of our relevance judgement operation. Also, with more disk space, we could have used an RS/6000, to get runs done more quickly, assuming SMART could be ported and made fully operational on that platform. * Merging of Retrieval Runs: Ideally, an effective method to predict relevance like the Decision Tree or CEO model should be used to take the results from the various runs and determine the relevant documents based upon the training set and the relevance judgements. However, in Phase 1 we did not have enough time to properly implement these, so we settled on a less effective approach - using the ranks of the documents from each of the runs. The top N documents from each run were included in the results set, ordered based upon the ranks. The value of N was determined by the number of document number repetitions across the various runs. The average value of N was near 40. This method used to combine the results from the various runs was flawed. Using the top N documents from each run assumes that each of the runs was of equal quality, which is usually not the case. Each poor run would have many non-relevant documents in the final list, while a quality run could have many relevant documents miss the cutoff of N. The similarity values should be used to determine the top-ranked documents to p~ll from each run, but the combinations of runs with various incompatible similarity measures made this a non-trivial task. In Phase 2, we used additional training data, namely the recall-precision average values from the evaluation of each run whose results were to be merged. This approach, too, is flawed, since the averages reflect general trends, while query-specific trends would be much more appropriate to use. * Faulty Queries: No feedback i~iodification could be performed to improve retrieval performance. The Boolean queries proved to be too general in many cases. The p-norm runs were made with the Boolean queries; separate p-norm queries (i.e., that made use of user-assigned p-values and query weights) would have improved retrieval effectiveness. The vector queries were generally good, but also had some faults. In particular, shorter vector queries might have led to better Similarity Merge behavior by decreasing the likelihood of spurious matches in sub- collections with few relevant documents. Also, the NOT clauses from the topics descriptions were included in the vector queries, which may have contributed to retrieval of non-relevant documents. The effect of bad queries and an inadequate combination method was lower retrieval quality than desired. 327 7.3 Possible enhancements ec B ause of the disk space problem we were not able to do many of the tests we planned to do, even by the end of Phase 2. Work will continue in 1993 now that new disks have been received. Among the planned tasks are: * Phrase Identification and Matching In the current system, phrases are handled by using an AND query term. For example, Information AND Retrieval was used for Information-Retrieval. Due to the absence of a proximity operator, this leads to retrieval of non-relevant documents where these words occur widely apart. The retrieval results can be improved by providing a mecharnsm for dealing with phrases explicitly, and/or the use of proximity operators. * Better Base Methods In addition to considering the use of phrases, further study of base runs, considering query construction, indexing, weighting schemes, and lexical information, will be undertaken. Of particular interest is the use of p-norm queries, which if tailor-made, might well out-perform the vector queries in all collections. Contrasts between stemming and morphological analysis can also be made. * Merging Methods While the Ad Hoc and the R-P Merge methods are not based on elaborate theory, they do provide insight into the effects of combining results. Further refinement of the approaches, and additional testing to obtain upper-bound performance values, will be undertaken. * The CEO Model The Combination of Expert Opinion (CEO) model [3, 4J of Thompson can be used to treat the different retrieval runs as experts, and combining their weighting probability distributions to improve performance. This could be used in a variety of ways to combine results from a variety of runs and indexing schemes. References [1] C. Buckley. Implementation of the SMART information retrieval system. 85-686, Cornell University, Department of Computer Science, May 1985. Technical Report [2] FirstMark Technologies Limited. KnowledgeSEEKER User's Guide. FirstMark Technologies, 14 Concourse Gate Site 680, Ottawa, Ontario, Canada, 1990. [3] P. Thompson. A Combination of Expert Opinion approach to probabilistic information retrieval, Part 1: The conceptual model. Information Processing ~ Management, 26(3):371-382, 1990. [4] P. Thompson. A Combination of Expert Opinion approach to probabilistic information retrieval, Part 2: Mathematical treatment of CEO Model 3. Information Processing & Management, 26(3):383-394, 1990. 328