Combining Evidence from Multiple Searches
                     Edward A. Fox, M. Prabhakar Koushik,
                  Joseph Shaw, Russell Modlin and Durgesh Rao
                        Department of Computer Science
                    Virginia Tech, Blacksburg, VA 24061-0106


                                  Abstract

      At Virginia Tech and PRC Inc. investigations with TREC data have focused on developing
    and comparing mechanisms for combining evidence related to a number of search schemes. Our
    work with the first CD-ROM has included various indexing, weighting, retrieval, combination,
    evaluation, and failure analysis efforts. Related work reported elsewhere in the proceedings by
    Paul Thompson discusses extensions undertaken by PRC Inc. and an evaluation of those results.
    Future work will develop our ideas further, try them out with additional data, and hopefully be
    evaluated in connection with other work on TREC and TIPSTER.


1   Overview
The 1992 TREC effort at Virginia Tech was carried out largely on a DECstation 5000 Model
25 with 40 MB of RAM. The 1985 version of the SMART retrieval system, with numerous of our
enhancements, was used for indexing, retrieval, and evaluation.
  Our efforts were divided into two main phases. Prior to the TREC meeting we worked solely
with the Wall Street Journal (WSJ) on the first CD-ROM. Thus, in Phase 1, we made eight different
types of runs employing three different retrieval models - the Boolean model, the p-norm model and
the vector space model - and three different weighting schemes. The queries for the Boolean and
p-norm runs were manually generated by project team members. Vector queries were generated
automatically from the topic descriptions. The results from these runs were then merged together
to provide combined result sets. Relevance judgements were also performed on a subset of the
retrieved documents to help with training studies.
  In Phase 2, after the TREC meeting, we experimented with all five collections on the first
CD-ROM. However, based on our work with the WSJ, we restricted our investigation to five out
of the original eight cases. We also explored use of limited training information in carrying out the
merging of results. Work is continuing at Virginia Tech to incorporate the results of our many runs
into merged selections, to improve performance.
  Subsequent sections of this paper describe all of these activities in greater detail.


2   Indexing and Data Structures
This section outlines the indexing done with the document collection provided by NIST. Due
to limitations of available disk space, only Disc 1 was used during the experimental runs. The
documents were indexed on a DEC station 5000/25 using an enhanced version of the 1985 release


                                   319

of the SMART Information Retrieval System [1]. The following specifications were used during
the indexing process:

   1. No stemming was done.

   2. A stop word list of 418 words was used.

   3. The Heading, Text, and Summary sections were included.

   4. A controlled vocabulary was not included.

   Briefly, the document text is tokenized, stop words are deleted, and non-noise words are included
in the term dictionary along with their occurrence frequencies. Each term in the dictionary has a
unique identification number. A document vector file is also created during indexing which contains
for each document its unique ID, and a vector of term IDs and term weights. The weighting scheme
itself is fairly flexible and can be changed to one of several schemes after the indexing is complete.
Indexing the WSJ created a dictionary of approximately 15 MB and a document vector file of 121
MB. The other 4 collections take up space proportional to their sizes.


3    Retrieval Approach
3.1   Retrieval Runs
Several retrieval runs were then made as outlined below:

   * Retrieval based on the vector model
     The topics were indexed considering the Description, Narrative, and Concepts sections to
     form vector queries. Retrieval was performed by varying the weighting on the document and
     query vectors. Three different methods were tried:

       1. Weighting with tf
       2. Weighting with (tf/max(tf)) * idf
       3. Weighting with (0.5 + 0.5 * tf/max(tf)) * idf- called aug~norm

     where
     tf = term frequency, and
     idf = inverse document frequency.
     During retrieval, the query-document similarities were computed in two different ways for
     each of these weighting schemes: cosine and inner product.

   * Retrieval based on the Boolean model
     Boolean queries were manually generated by team members (with a background in computer
     science), and a retrieval run was made using these queries. The queries were composed,
     for the most part, using the entire text of the topic descriptions provided, and occasion-
     ally broader/narrower terms were obtained from general domain knowledge. The Boolean
     operators used were AND and OR.


                                    320

Table 1: Summary of Retrieval Runs

([ Name      ~ Model   ~ Sim. Function ~ Weighting Scheme        1'~
 bool         Boolean    Boolean        binary term weights
 pnorml.0     p-norm     p-norm         binary term weights, p=1.0
 pnorml.5     p-norm     p-norm         binary term weights, p=1.5
 pnorm2.0     p-norm     p-norm         binary term weights, p=2.0
 cosine.atn   vector     cosine         aug~orm * idf
 cosine.nnn   vector     cosine         tf
 inner.atn    vector     inner product  aug~orm * idf
 inner.nnn    vector     inner product  tf


   . Retrieval based on the p-norm model
     The Boolean queries described above were also used for the p-norm runs. Retrieval runs were
     made with three different p-values: 1.0, 1.5, and 2.0. No query term or clause weights were
     used during the p-norm runs.

   The different runs are summarized in Table 1. Note that in Phase 1 of our efforts, we used all
eight runs listed. In Phase 2, however, we focused on the pnorml.O case, with document weighting,
and the four vector runs.

3.2  Weighting Schemes
The weighting schemes mentioned above are detailed in Table 2.

                              Table 2: Weighting Scheme Options

   . Term frequency normalization. This has the following choices:
        (n)one        newif = tf
        (b)inary      newif = 1


     ________________                   ________~
                                 if
        (m)ax~norm    newif = m~x~if
        (a)ug~orm     newif = 0.5 + 0.5 * if
     ______________                      max~if

   . Document weights. This has the following choices:
     f{(n)oneT new~wt = newif

     I(t)fldf   new~wt = newif * ~
        (p)rob  new~wt = newif * 1()9(num~0o1C1s;fC;6lQl~fre~)


   . Document vector normalization. This can be either of:
     `1(n)onei~Th~wt=new~wt
     ~


   This allows for a very flexible approach to changing the document vector weights as can be seen
from Table 3.


                                          321

Table 3: Matrix

of weighting options

 Combinations  Term frequency Doc. weights  Doc. vector norm.
 nnn           tf             none          none
 ntn           tf             tfidf         none
 npn           tf             prob          none
 bnn           binary         none          none
 btn           binary         idf           none
 bpn           binary         prob          none
 mnn           max~orm        none          none
 mtn           max~orm        idf           none
 mpn           max~orm        prob          none
 ann           augnorm        none          none
 atn           augnorm        idf           none
 apn           augnorm        prob          none


3.3   CPU time

The retrieval runs required approximately 4 minutes of CPU time for each topic. We did a full
sequential pass through the document vector file for this since we did not have enough disk space
for the inverted file.

3.4   Combination of Run Results

Our original plan was to compare several schemes for combining the results of a number of runs.
The results of using one scheme, the CEO Model, are reported elsewhere in these proceedings by
Paul Thompson. One of our goals was to use an artificial intelligence methodology called Decision
Trees.
   Decision Trees should reflect the most effective evaluation methods for each query. Our Decision
Trees were produced by a commercial software package called KnowledgeSEEKER [2], by FirstMark
Technologies Limited. Input to KnowledgeSEEKER is a training set of documents for which the
results of the various evaluation runs are known along with the relevance values of the documents.
The output of KnowledgeSEEKER is a Decision Tree that indicates in what order to apply the
evaluation runs in order to determine the relevance of documents in the collection. The highest
level of the tree is the test that most effectively evaluates documents for a given query. Based on
the results of the most effective test, documents may be assigned a relevance score or additional
tests that further evaluate the documents may be suggested. The Decision Tree partitions the
collection into disjoint document groups with relevance values for the documents in each group.
The results produced by KnowledgeSEEKER for the tralning set of documents must be parsed in
order to apply the Decision Trees to other documents in the collection.
   In Phase 1 training data was fed into the Decision Tree system so that the ranges of values
of independent variables (e.g., similarity for a particular type of search) could be categorized into
sets or intervals that predict the dependent variable values (e.g., relevance value of 0 or 1). For
example, one Decision Tree developed during Phase 1 is given below.


                                     322

Figure 1: Decision Tree Example for query 52

RULE~1 IF
COSINE-2 = [0.032,0.07)
THEN
RELEVANCE = 0    73.3~I.
RELEVANCE = 1    26.7~i.


RULE~2 IF
COSINE-2 = [0.07,0.195]
THEN
RELEVANCE = 0     9.SX
RELEVANCE = 1    90.5X

  This indicates that the likelihood of relevance is about .27 for very low values from the second
cosine run, and about .91 for higher values from that same cosine run. When selecting this tree, the
Decision Tree method suggests that ranking solely based on this cosine run would be wise. More
complicated Decision Trees resulted for a number of queries, where several of the base runs' values
had to be consulted. Unfortunately, no full ranking of run results using the Decision Trees could
be completed in time for this report, so other, simpler methods were applied.
  In Phase 1, a simple scheme was used for the results that were turned in. Essentially, the best
results from each of the runs were included, until 200 distinct documents were found, for each
query. This scheme is referred to as Ad Hoc Merge in discussions below.
  In Phase 2, a more complex system was explored, called Recall-Precision (R-P) Merge. Details
and results are given in Section 6.


4   Systems description
The main machine used for the indexing and retrieval runs was a DECstation 5000 Model 25
with 40MB of RAM. This is a MIPS R3000 CPU running at 25MHz. The total disk space used
for the project was on the order of 3 GB.


5   Results of Phase 1
Due to limitations of disk space, only a subset of the collection comprising of Disc 1 of the Wall
Street Journal was used during Phase 1 experimental runs. Relevance judgements were performed
on a subset of this data by team members, in order to obtain a large set of training information.
These were compared with the NIST judgement data and showed very high correlation. (Almost
90% of the documents we judged relevant were judged relevant by NIST.) In any case, the NIST
judgments were used in the official (November 18, 1992) evaluation of our Phase 1 system, shown
in Figure 2.


                                    323

Figure 2: Phase 1, Disc 1, WSJ, queries 50-100, Ad Hoc Merge

           Top ranked evaluation
                  vpidt2  all
                     50
                documents over all queries
                  10000
                   4056
                   1371
               ion Averages:
                  0.5153
                  0.2841
                  0.1898
                  0.1285
                  0.0988
                  0.0746
                  0.0527
                  0.0338
                  0.0271
                  0.0222
                  0.0222
                  for all points
                  0.1317
                  for 3 intermediate points (0.20, 0.50, 0.80)
                  0.0972

Run number:
Num~queries:
Total number of
    Retrieved:
    Relevant:
    Rel~ret:
Recall - Precis
    at 0.00
    at 0.10
    at 0.20
    at 0.30
    at 0.40
    at 0.50
    at 0.60
    at 0.70
    at 0.80
    at 0.90
    at 1.00
Average precision
   11-pt Avg:
Average precision
    3-pt Avg:
Recall.
    at  5 docs:    0.0476
    at 15 docs:    0.0640
    at 30 docs:    0.0843
    at 100 docs:   0.2050
    at 200 docs:   0.4481
Precision:
    At  5 docs:    0.3160
    At 15 docs:    0.2147
    At 30 docs:    0.1760
    At 100 docs:   0.1246
    At 200 docs:   0.1371


6   Results of Phase 2
6.1   Base Runs

In Phase 2, twenty-live base runs were made on Disc 1: 5 different retrieval methods were used for
each of the 5 sub-collections. Based on our evaluation, the 11-point averages are given in Table 4.
Note that in the p-norm case, document weights were utilized, in contrast to the binary weighting
used in Phase 1.


                                      324

Table 4: Base Runs, Phase 2, Disc 1, 11-point averages
 Run/Collection     AP    DOE       FR    WSJ       ZF
 cosine.atn        0.1138 0.0543  0.0259  0.2740  0.0813
 cosine.nnn        0.1890 0.0330  0.0504  0.2184  0.0946
 inner.atn         0.1241 0.0609  0.0405  0.3224  0.0888
 inner.nnn         0.1478 0.0252  0.0108  0.1329  0.0101
 pnorml.0          0.3006 0.0876  0.0727  0.3085  0.1448

Table 5: Base Runs + Similarity Merge
 Run/Collection     AP    DOE       FR    WSJ       ZF     Sim-Merge
 cosine.atn        0.1138 0.0543  0.0259  0.2740  0.813      0.1149
 cosine.nnn        0.1890 0.0330  0.0504  0.2184  0.0946     0.1513
 inner.atn         0.1241 0.0609  0.0405  0.3224  0.0888     0.1717
 inner.nnn         0.1478 0.0252  0.0108  0.1329  0.0101     0.0075
 pnorml.0          0.3006 0.0876  0.0727  0.3085  0.1448     0.1831

   It should be noted that
collection (except for WSJ,

6.2   Similarity Merge

the p-norm run results were best in almost all situations for the given
where inner.atn was slightly better).

TREC evaluations were to be done for the entire Disc 1 contents, so it is necessary to combine the
results of the 5 sub-collections into an overall Disc 1 evaluation. This is a difficult matter, since
each collection has a different number of relevant documents, and each was indexed separately. In
1993 we expect to consider this problem in more detail.
   As a first solution to the problem we decided on the simplest possible approach - combine
the results based on similarity.  Thus, for a particular retrieval approach, we merged all of the
documents retrieved from the 5 sub-collection runs, sorted the 1000 documents found for each
query based on similarity, and returned the 200 with the highest similarity values.
   For convenience, Table 5 shows the data from Table 4, but with an added column to show the
Similarity Merge results.
   Clearly, some improvement is needed in this collection merging process. One might use training
based on the number of relevant documents in a collection, to predict a prior probability of finding
a relevant document in that collection, and then use that to temper the similarity values.

6.3  Recall-Precision Merge
Another type of merger involves combining the several retrieval runs for a given sub-collection. To
improve upon the Ad Hoc Merge used in Phase 1, we elected to train the merging process using
recall-precision results.  Thus, we considered the retrospective case of using the recall-precision
tables from our evaluations, to help determine which runs to draw from for merging.
   In particular, we use the following Recall-Precision Merge algorithm:

   1. For each run to be merged, store the top 200 items on a stack, with the highest rank at the
     bottom of the stack.


                                            325

Table 6: Base Runs + Similarity Merge + R-P Merge
 Run/Collection    AP     DOE      FR      WSJ      ZF     Sim-Merge
 cosine.atn       0.1138  0.0543  0.0259  0.2740  0.813     0.1149
 cosine.nnn       0.1890  0.0330  0.0504  0.2184  0.0946    0.1513
 inner.atn        0.1241  0.0609  0.0405  0.3224  0.0888    0.1717
 inner.nnn        0.1478  0.0252  0.0108  0.1329  0.0101    0.0075
 pnorml.0         0.3006  0.0876  0.0727  0.3085  0.1448    0.1831
 R-P Merge        0.2268  0.0554  0.0521  0.2555  0.1003    0.1523


  2. Associate with each run an estimate of the probability that the top item on the stack is
    relevant. Initially, this value is the precision value at recall 0.0 from our evaluation.

  3. To draw the next item for the merged result, identify the stack with the highest estimated
    probability of relevance, pop the top of the stack, and update the probability estimate.

  4. To update the probability estimate, use interpolation based on the number of popped items
    (i.e., the number "retrieved") and the rest of the recall-precision results.

  5. Continue the process starting again with step 3, as long as less than 200 items have been
    drawn.

   The results of applying this "R-P Merge" algorithm are given in Table 6, on the last line. Note
that the first 5 columns of that line show the results of merging for a particular collection, and the
very last value reflects Similarity Merge followed by R-P Merge.
   From Table 6 we see that the R-P Merge method does not yield results that are as good as
the best individual run. In particular, we could simply use the pnorml.O results uniformly and do
better than with merging.
   Further improvement in the above algorithm, possibly yielding more accurate estimates in step
4, will be investigated in 1993. Other studies will consider if recall-precision data for each query
could be used in a similar training situation, for subsequent testing on Disc 2.


7   Evaluation
7.1  Software engineering

We began with the 1985 version of SMART, and have enhanced it. We tried for a long period
of time to use the new version of SMART on an RS/6000 but found the use of disk space to be
excessive. Since we could not get reliable results, we went back to the older version.
   We underwent extensive software development since May. This included writing of C programs
and Unix shell scripts to partially automate indexing, retrieval, relevance judgement making, merg-
ing, evaluation and tabulation of results.

7.2  Problems and Failure Analysis
Problems encountered in the project were partially identified. A failure analysis was performed
on a subset of documents that failed to be included in our result set in Phase 1.  Observations
regarding our merging methods came from further studies in Phase 2. The following observations
aim to summarize our findings.

                                          326

* Disk Space Limitations:
 We had problems acquiring the required disk space in time for the submission of results due
 to lateness of the promised award from DARPA. A great deal of time was wasted in making
 partial runs using a hodgepodge of computers, with NFS mounting of remote files slowing
 processing by almost an order of magnitude. Ultimately we started over when told that work
 on the WSJ would suffice.
 This forced us to submit a subset of results in Phase 1, and to only complete work on Disc 1
 during Phase 2. With more disk space we could have use inverted files for indexing the data
 and that would have made things much faster. It would have allowed real time interactive
 searching which would have speeded up and improved the quality of our relevance judgement
 operation. Also, with more disk space, we could have used an RS/6000, to get runs done more
 quickly, assuming SMART could be ported and made fully operational on that platform.

* Merging of Retrieval Runs:
 Ideally, an effective method to predict relevance like the Decision Tree or CEO model should
 be used to take the results from the various runs and determine the relevant documents based
 upon the training set and the relevance judgements. However, in Phase 1 we did not have
 enough time to properly implement these, so we settled on a less effective approach - using
 the ranks of the documents from each of the runs. The top N documents from each run were
 included in the results set, ordered based upon the ranks. The value of N was determined by
 the number of document number repetitions across the various runs. The average value of N
 was near 40.
 This method used to combine the results from the various runs was flawed. Using the top N
 documents from each run assumes that each of the runs was of equal quality, which is usually
 not the case. Each poor run would have many non-relevant documents in the final list,
 while a quality run could have many relevant documents miss the cutoff of N. The similarity
 values should be used to determine the top-ranked documents to p~ll from each run, but the
 combinations of runs with various incompatible similarity measures made this a non-trivial
 task.
 In Phase 2, we used additional training data, namely the recall-precision average values from
 the evaluation of each run whose results were to be merged. This approach, too, is flawed,
 since the averages reflect general trends, while query-specific trends would be much more
 appropriate to use.

* Faulty Queries:
 No feedback i~iodification could be performed to improve retrieval performance. The Boolean
 queries proved to be too general in many cases. The p-norm runs were made with the
 Boolean queries; separate p-norm queries (i.e., that made use of user-assigned p-values and
 query weights) would have improved retrieval effectiveness. The vector queries were generally
 good, but also had some faults.  In particular, shorter vector queries might have led to
 better Similarity Merge behavior by decreasing the likelihood of spurious matches in sub-
 collections with few relevant documents. Also, the NOT clauses from the topics descriptions
 were included in the vector queries, which may have contributed to retrieval of non-relevant
 documents.   The effect of bad queries and an inadequate combination method was lower
 retrieval quality than desired.


                                     327

7.3  Possible enhancements
 ec
B  ause of the disk space problem we were not able to do many of the tests we planned to do, even
by the end of Phase 2. Work will continue in 1993 now that new disks have been received. Among
the planned tasks are:

   * Phrase Identification and Matching
     In the current system, phrases are handled by using an AND query term. For example,
     Information AND Retrieval was used for Information-Retrieval. Due to the absence
     of a proximity operator, this leads to retrieval of non-relevant documents where these words
     occur widely apart. The retrieval results can be improved by providing a mecharnsm for
     dealing with phrases explicitly, and/or the use of proximity operators.

   * Better Base Methods
     In addition to considering the use of phrases, further study of base runs, considering query
     construction, indexing, weighting schemes, and lexical information, will be undertaken. Of
     particular interest is the use of p-norm queries, which if tailor-made, might well out-perform
     the vector queries in all collections. Contrasts between stemming and morphological analysis
     can also be made.

   * Merging Methods
     While the Ad Hoc and the R-P Merge methods are not based on elaborate theory, they do
     provide insight into the effects of combining results. Further refinement of the approaches,
     and additional testing to obtain upper-bound performance values, will be undertaken.

   * The CEO Model
     The Combination of Expert Opinion (CEO) model [3, 4J of Thompson can be used
     to treat the different retrieval runs as experts, and combining their weighting probability
     distributions to improve performance. This could be used in a variety of ways to combine
     results from a variety of runs and indexing schemes.


References

[1] C. Buckley. Implementation of the SMART information retrieval system.
  85-686, Cornell University, Department of Computer Science, May 1985.

Technical Report

[2] FirstMark Technologies Limited. KnowledgeSEEKER User's Guide. FirstMark Technologies,
   14 Concourse Gate Site 680, Ottawa, Ontario, Canada, 1990.

[3] P. Thompson. A Combination of Expert Opinion approach to probabilistic information retrieval,
   Part 1: The conceptual model. Information Processing ~ Management, 26(3):371-382, 1990.

[4] P. Thompson. A Combination of Expert Opinion approach to probabilistic information retrieval,
   Part 2: Mathematical treatment of CEO Model 3.  Information Processing & Management,
   26(3):383-394, 1990.


                                    328