UCLA-Okapi at TREC-2: Query Expansion Experiments
                  Efthimis N. Efthimiadis* and Paul V. Biron

                 Graduate School of Library and Information Science
                      University of California at Los Angeles

1    Introduction

This is the first participation of the Graduate School of Li-
brary and Information Science, University of California at
Los Angeles in the TREC Conference. For TREC-2, Cat-
egory B, UCLA used a version of the Okapi text retrieval
system that was made available to UCLA by City Univer-
sity, London, UK. OKAPI has been described in TREC-
1 (Robertson, Walker, Hancock-Beaulieu, Gull & Lau,
1993a) as well as in this conference (Robertson, Walker,
Jones, Hancock-Beaulieu, & Gatford, 1994).  Okapi is a
simple set-oriented system based on a generalized proba-
bilistic model with facilities for relevance feedback. In addi-
tion OKAPI supports a full range of deterministic Boolean
and quasi-Boolean operations.


1.1   Objectives

The main research objective of the UCLA participation
in TREC-2 was to investigate query expansion within the
framework as provided by Okapi. More specifically, the
objectives were to:

  * use an enhanced version of the G~See-List (GSL) and
    evaluate its effect on retrieval performance.

  * investigate the performance of query expansion with
    and without relevance information by varying the
    number of documents that are treated as relevant and
    the number of terms that are included in the expan-
    sion.

  * compare the performance of different ranking alg~
    rithms for the ranking of terms for term selection dur-
    ing query expansion.

  * compare the effectiveness in retrieval of user assigned
    relevance judgements against hypothetically assumed
    relevance judgements based on the top X documents.
  *To whom all correspondence should be addressed. Grad-
uate School of Library and Information Science, University of
California at Los Angeles, 405 Hilgard Avenue, Los Angeles,
CA 90024-1520, e-mail: iacxene~mvs.oac.ucla.edu         1


                                                     279

1.2   The Okapi version at UCLA and the
      WSJ database

The Okapi system consists of a low level search engine or
basic search system (BSS), a user interface for the man-
ual search experiments and data conversion and inversion
utilities.

 The UCLA hardware consisted of Sun SPARC-2 machine
with 32 MB of memory, and 1 GB of disk storage.

 The Wall Street Journal (WSJ) database was used for
both the routing and ad-hoc searches. Because of the lack
of adequate disk space on the UCLA machine the database
was indexed at City University by Stephen Walker and it
was then transferred (FTP-ed) to UCLA.

 For TREC2 the Okapi databases were built by index-
ing mainly the DOCNO and TEXT fields of the records.
Inverted indexes included complete within-document posi-
tional information, enabling term frequency and term prox-
imity to be used. Okapi's typical index size overhead is
around 80% of the textfile size. The elapsed time for inver-
sion of the WSJ database was about 12 hours.

 At this point it is worth noting of (a) the nature of the
WSJ records, and (b) a limitation of Okapi's due to index-
ing.

 (a) The WSJ records consist of documents that do not
have the same kind of structure found in bibliographic
databases, such as INSPEC or ERIC. The records contain
the full-text of stories and have varied length, mostly longer
than the length of an average abstract of a bibliographic
database. In addition, the language and the style is mostly
`journalistic' as opposed to `scientific', i.e. less structured.
One important issue is that some WSJ records often con-
tain short multi-story articles which are completely unre-
lated one from the other. This type of record is usually
a compilation of a number of one- or two-paragraph long
news stories. The stories share no content relation between
them, the only common feature is their co-existence in the
same record. This has implications in retrieval effective-
ness, especially when such records are included in the pool

of documents that provide terms for query expansion, be-
cause of the noise introduced by the terms taken from the
irrelevant stories.

 (b) This last issue relates to a limitation of Okapi. The
version of Okapi used at UCLA retrieves documents at the
record level only. Retrieval at the paragraph level, which
would have facilitated a better handling of some issues like
the above, is not currently available.


2    The weighting functions

The weighting of search terms can be said to involve two
levels:


level 1: A weighting function is used to weight the terms
    for the initial query as well as the terms for subsequent
    search iterations of the same query or some modified
    version of the query.

level 2: A weighting function is used for the weighting of
    candidate terms for query expansion.


Sections 2.1 and 2.2 discuss functions used in level 1 and
level 2 respectively.


2.1   Search term weighting

The theory of relevance weights (Robertson & Sparck
Jones, 1976) provides the basic probabilistic model. The
binary independence or relevance weight model assigns a
weight to each term and the matching function for each
document is given by the `~impk sum-of-weights' over all
of the terms in the query.

 The weight of a term is calculated by following function
which is also known as the ft point-5 formula:

         Wf4 = log (r + .5)(N - iz - R + r + .5)
                   (n - r + .5)(R - r + .5)

where,


    N is the total number of documents in the collec-
    tion;
    R is the sample of relevant documents as defined
    by the user's feedback;
    n is the number of documents indexed by term t;
    r is the number of relevant documents (from the
    sample R) assigned to term t.

        When relevance information is not available the above
      weight reduces to approximately the inverse document fre-
      quency (IDF).

        For calculating the total weight of a document the fol-
      lowing function was used which is based on the binary inde-
      pendence model, and takes into consideration the 2-Poisson
      model for within document frequency (tf) and the docu-
      ment length. These are described in detail in Robertson et
      al (1993b). The purpose of the UCLA Okapi system was
      to evaluate the existing Okapi models and therefore did
      not allow for modifications of the existing functions. For
      compatibility purposes and for comparisons it was decided
      to use the BM15 (best match) function for the runs. The
      BM15 best match weigthing function is:


      dOOW6~9htbml5 = ~(((k1tftf)) X wf4) + k2 x ~q x ((aavedl - dl)
                                               vedl+ dl)
                                                     0
      where k1 and k2 are unknown constants. In the UCLA-
      Okapi implementation the values for these constants are:
      k1 = 1 and k2 = 1.


      2.2   Query expansion term weighting

      Ther anking algorithms that were considered for the rank-
      ing of terms for query expansion were: wpq, emim, porter,
      r~lohi and r~hilo. These algorithms are described briefly
      below.


      2.2.1 The wpq algorithm

      This algorithm is based on an independence assumption
      that holds between a query expansion term and the terms in
      the entire previous search formulation (Robertson, 1990).
      According to the relevance weighting theory, the inclusion
      of term tin the search formulation with weight Wj will
      increase the effectiveness of retrieval by

(1)                      wpq = Wt (Pt - qt)
                                                     (3)

      where, Wt is a weighting function, which in this case is the
      Wf 4; Pt is the probability of term t occurring in a relevant
      document; and qt is the probability of a term t occurring
      in a non-relevant document.

        This means that irrespective of the weighting function
      (Wt) used the rule for deciding the inclusion of a term in a
      query expansion search should be based on the ranking of
      wpq instead of Wt alone. Substituting the weighting func-
      tion and the probability of relevance in wpq with r, R, n,
      Nwe get:

                                                              wpq = ig (r (+nL5)r(N+F5~R~Rr+~.5)5) .(~Rr~NnIRr)


                                                   280

                                                                                        0

  The wpq algorithm combines the effects of the relevance
weighting theory, as expressed by the Wf4 component,
which assign greater importance to the infrequent terms
with the frequency of occurrence of a term in the relevant
document set.


2.2.2   The emim algorithm

The expected mutual information measure (emim) is a
term weighting model incorporating relevance information
in which it is assumed that index terms may not be dis-
tributed independently of each other.  (van Rijsbergen,
1977; Harper and van Rusbergen, 1978; van Rijsbergen,
Harper & Porter, 1981)

  The emim weight reduces to the f~ weight when the "de-
gree of involvement", i.e.  the joint probabilities, are all
unity. Assuming the same definitions for n, N, r, R, as
those already used earlier, the emim weight of a term is
calculated as follows:

  Eiq   =   pilill - P12i12 - p2ii21 + p22i22


        =   1og;~.r

                (n-~N
            -log             .(n-r)
                (N-R)n
                (R-r)N
            -log             .(R-r)
                (N-n)R
                (N-n-R+r)N
            +log                     (`v-n-R+r)
                 (N-n)(N-R)


2.2.3   The porie~ algorithm

Porter and Galpin (1988) describe a ranking formula used
in the MUSCAT online catalogue:
                              r   n
                  porler =     - N

where r, R, n, N are defined as in the f~ weight (eq. 1).


2.2.4   The r~lohi algorithm

The r~lohi algorithm has been proposed by Efthimiadis
(1993a) as the result of the observation of the ranking be-
havior of six algorithms used for ranking terms for query
expansion.

  The r~lohi ranking algorithm:

  * ranks terms according to r, i.e. their frequency of oc-
      currence in the relevant document set, in descending
      order and

        * resolves ties according to their term frequency, n, from
          low-to-high frequency.


        It was hypothesized that the r~lohi algorithm would have
      an almost identical ranking to porier and a performance ap-
      proaching that of wpq and emim. More differences between
      the algorithms may occur if the size of the set of relevant
      documents (R) gets larger. Conclusions about the algo-
      rithm however could not be drawn before it was evaluated
      against the other algorithms. The results of that evaluation
      are reported in Efthimiadis (in press) where the r~lohi al-
      gorithm demonstrated better performance when compared
      to the other algorithms.


      2.2.5 The r~hilo sort

      A variant of the r~lohi algorithm is to rank candidate terms
      for query expansion using the r~hilo rank which:


        * ranks terms according to r, i.e. their frequency of oc-
          currence in the relevant document set, and

        * resolves ties according to their term frequency, n, from
          high-t~low frequency.


      Since the r~hilo algorithm will result in sorting terms in
      exactly the opposite way of the r~loh i algorithm it was in-
      cluded as a control for the study.


      3    Methodology

      3.1   Runs

(5)   Initial tests were performed in topics 1-50 where the depen-
      dent variables were the weighting function and the query
      processing of terms. ~From the results obtained it was es-
      tablished that the function to use will be BM15 and that
      the parsing of the Topics would include both single terms
      and "phrases" as defined by comma delimited text in the
      Topics.

 The table below (Table 3.1 gives all the variables used
in constructing the runs.  The options available for each
variable are also provided.


Weighting Function: Best match function BM15 (see
   equation 2).

Phrases: Choice of YES, NO, or BOTH. This determines
   the type of parsing of the "Concepts" and "Title" fields

281

          Table 3.1 Methodology for the Routing Runs on Topics 51-100 _______

 Weighting                Expansion     Terms       used for   UCLA ~
                            Query     Number of   No. of Docs
 Function   Phrases  QE   Algorithm   Expanded   Auto Rel Fbk  GSL

 bm15       no       no      w~q         0            0        no
            yes      yes    emim         10           5        yes
            both            por~er       20           10
                            r~lohi       30           15
__________ _________        r~hilo    ___________     20      ________

   from the Topics, which were the source of the search
   terms. NO means that the terms extracted from the
   Concepts and Title fields are single terms only. YES
   means that phrases get extracted as determined by the
   simple routine, where a phrase is identified by using
   the punctuation found in the Concepts and Title fields.
   BOTH is the combination of the two methods and the
   terms are searched as single terms as well as phrases.

Query Expansion (QE): The choice of query expansion
   algorithms is one of wpq, emim, portev, r~lohi, r~hilo.

Terms expanded: This specifies the number of terms to
   include in the expansion. When the number of terms
   expanded is zero, then only the initial query is run.

Feedback documents:
   ranked documents to
   provide the source for

This defines the number of top
 be treated as relevant and to
 the terms for query expansion.

UCLA GSL: defines whether the standard Okapi GSL or
    the UCLA enhanced version of the GSL will be used.


 Because of the many parameters involved in each run
the names of runs have been deliberately made explicit,
which however resulted in rathcr long names. For exam-
ple, bmlS .phb. qey:r~ohi-i0-5 .uclagsly means that for
this run the weighting function used was the BM 15, phrases
were set to BOTH, query expansion took place, the r~lohi
algorithm was used for the ranking of terms for query ex-
pansion, 10 terms were added in the expansion, 5 docu-
ments provided the source of the terms for the expansion,
and the UCLA enhanced GSL was also used.


3.2   Go-See-List

The G~See-List (GSL) is a look-up table that contains
stopwords, semi-stopwords, prefixes, g~phrases and syn-
onym classes. The GSL is used during the indexing of a
database as well as during searching.

 Stopwords contain an array of terms that are thought
to contain no or little value for retrieval. These include,
contractions, prepositions, adverbs, etc.


                                           282

  The semi-stopwords are terms that are thought to have
low value for retrieval purposes.   Therefore, a semi-
stopword will be searched only during the initial search if
it has been part of the user's search statement. If, however,
the term has emerged as the result of a query expansion it
is stopped, i.e. excluded from the pool of candidate terms
for query expansion.

  Go-phrases are mostly noun-phrases that need to be
searched as one word or else the precision will be very low,
e.g. New York. GSL contains a small number of selected
g~phrases.

  Synonym entries contain a mix of terms/concepts that
are treated as synonyms for retrieval purposes. These may
be true synonyms, quasi-synonyms, or unrelated semanti-
cally terms which are grouped together because of some
common properties which have value for retrieval. Finally,
the synonym entries also contain term variants that are
known to "escape" from the conflation algorithm.  The
structure of the UCLA GSL is given in the table below.

             The GoSee-List (GSL)
                      City    added  UCLA
                           by UCLA    total
   stopwords           411       72    483
   semi-stopwords      58               58
   prefixes            18               18
   Go-phrases          43        84    127
   Synonyms            359      604    963

  For the UCLA GSL, the Titles and Concepts of Top-
ics 1-100 were analyzed and synonym classes were gener-
ated from the data. The list includes: 40 personal names,
and 250 synonym classes. In addition, a list of organiza-
tions and a list of common business acronyms and abbre-
viations was compiled.


3.3   Query term selection

Query terms were selected from the Title and Concepts
fields of the records.  The processing of these fields was
very simple. Programs written in awk and pen were used
to isolate the required fields, which were then parsed and
the resulting terms stemmed in accordance with the in-
dexing procedures followed for building the WSJ database.

This process resulted in one-word query terms. When ap-
propriate the procedure also output phrases by treating the
punctuation available in these fields as the phrase delime-
ter.

 Queries were then generated automatically from the Ti-
tle and Concepts fields. Exactly the same queries were used
in the Routing and Ad hoc searches.


3.4   Term selection for query expansion

a) Routing searches: Query expansion in the routing
   searches was performed through query modification
   without relevance information. As indicated in the ta-
   ble, that describes the construction of the runs inthe
   methodology section, the number of documents used
   could range from the top 0-20 documents, in incre-
   ments of 5 documents. These top ranked documents
   were treated as relevant and were analyzed in order
   to provide terms for the expansion. Expansion terms
   were selected by pooling all the terms and then weight-
   ing these terms with one of the five ranking algorithms
   as specified by the run. Then the top 10, 20 or 30 terms
   were added to the original query terms and searched.

b) Ad hoc searches: The term pool consisted of all the
   terms of the documents judged as relevant. For the Ad
   hoc searches with feedback of the official results, the
   top 10 terms as determined by wpq were chosen for
   expansion and were searched together with the initial
   query terms.

c) Rules for term selection: The following rules were
   followed for the inclusion or exclusion of a term during
   selection for query expansion:
   a) numbers were excluded as terms,
   b) all terms whose frequency (n) is equal to the num-
   ber of relevant documents seen (R), i.e., if n <= R,
   were excluded.


3.5   Search procedure

All searches, Routing and Ad hoc, were automatic and de-
termined by the specifications made for each run. There
were no manual searches.


3.5.1  Ad hoc searches and searchers


There were no manual searches. For the Ad hoc searches
with relevance feedback, i.e.  uclafi (official results), rel-
evance assessments were provided by two searchers. The
odd numbered topics were assessed by one searcher and
the even numbered topics by the other.


                                                    283

3.5.2   Relevance assessments

During the Ad hoc searches, the guidelines for relevance
judgements were:

 a) review the entire document, when judging relevance,
     even if it seems to be peripheral or not relevant. The
     reason being that many of the articles were found to
     be collections of brief news stories, with the relevant
     part of the text hidden in (the middle or the end of)
     the text.

 b) target for 10 relevant documents; stop as soon as 10 are
     found or at the 20th document. However, if 3 relevant
     have not been found continue till 3 are found (this is
     because OKAPI will not do an expansion if it has less
     than 3 documents).


3.5.3   Ad hoc additional runs

Following the TREC conference, a set of runs was con-
ducted on the Ad hoc queries in order to complete the eval-
uation of the five ranking algorithms for query expansion
that were studied.

  The relevance judgements made in the Ad hoc run
uc1a~ 1 (fdbk.bmlS.phb.qey:wpq-10-10.uclagsly) were ex-
tracted and used in the subsequent runs. The process fol-
lowed in these additional runs is described below:

  * Four new Ad hoc runs were done; one for each of the
     remaining algorithms which were used for the ranking
     of terms for query expansion, i.e., emim, porter, r~hilo,
     r~lohi.

  * The same initial query, which was generated automat-
     ically, was used for all searches.

  * The relevance judgements made in the initially re-
     trieved set of the official Ad hoc run were extracted
     and then simulated in the additional runs.

  * Query expansion terms were ranked using the algo-
     rithm that was designated by each run. The 10 top
     ranked terms from the pool were added to the query.


3.6    Problems & Limitations

Lack of equipment has been a major problem in our par-
ticipation.  In order to participate in TREC, SUN Mi-
crosystems provided an equipment grant (SUN Sparc-2)
in March, however no disk was initially available, but a 1-
Gigabyte disk was acquired in June.  Consequently, only
the Ad hoc runs were included in the official results.

 A limitation of the UCLA version of OKAPI is that it
does not allow modifications of the basic retrieval functions
(i.e., the BMs or best match functions).


4     Results and Discussion

The results of the Routing runs, the Ad hoc runs and the
Ad hoc additional runs are given in Table 1, Table 2 and
Table 3 respectively.


Routing runs

The 35 Routing runs given in Table 1 are presented in
descending
recall values. The runs bml5 ph Eynb] . qen. uclagsl ~yn],
i.e., the runs without query expansion, were used as base-
line runs in order to facilitate comparisons. All other runs
reported in the table include query expansion.

 The results indicate that runs with query expansion,
where the r~lohi or the r~hilo algorithm was used performed
better than all other runs in terms of Recall, Average Pre-
cision, and R-Precision.


Ad hoc runs

>From the three official Ad hoc runs, uclaal, was the au-
tomatic run that did not include query expansion and has
been used as a baseline-run, uclaa2, was an automatic run
that included query expansion without any relevance infor-
mation, and uclaf 1, was a run with user supplied relevance
feedback and query expansion.

 In terms of R-Precision and Average Precision the run
with feedback and query expansion (uclaf 1) did better
than the automatic run with query expansion (uclaa2),
but the baseline was slightly better.


Ad hoc additional runs

The results of the Ad hoc additional runs are given in Ta-
ble 3. The official run with feedback (uclafi) using wpq
for the expansion is compared to the runs which used the
r~lohi, r~hilo, emim and porter algorithms respectively for
the expansion. The results indicate that r~lohi and r~Ailo
have performed better than the other algorithms. These
results further corroborate the results obtained from the
routing runs.

  In order to further validate the results the sign test as
well as the t-test were performed on the data. The results
from the sign test are given on Tables 4-15. The tables
are arranged in sequence starting from Precision at 15, 30,
and 100 documents, Average Precision, Recall-Precision,
to Recall. In each case, two tables are given; the first ta-
ble gives the differences and the second the probabilities.
As it can be expected there are no differences at Precision
at 5 documents and at Precision at 10 documents because
these were the same for all five runs. For this reason the
corresponding pairs of tables have not been included in the
paper. The results also show no significant differences at
Precision at 15 documents and at 30 documents. Signifi-
cant results appear at Precision at 100 documents where
r~lohi ~ r~hilo > emim ~ wpq ~ porter.

  The sign test results on Average Precision demonstrate
that r~lohi  ~ r~hilo > wpq  ~ emim      ~ porter, where
emim   > porter. The results on Recall show some group-
ing between the algorithms, so that r~lohi  ~ r~hilo  >
emim       wpq  > porter. The results from the Recall-
Precision indicate that r~lohi ~ r~hilo ~ emim > wpq ~
porter with r~lohi> emim but not significantly better and
with wpq slightly better than porter.

  >From the study of the sign test results certain overall
comments emerge about the performance of the five alg~
rithms. The results seem to be consistent throughout with
r~lohi performing better than the other algorithms. Dif-
ferences between emim, wpq and porter are not consistent
but it seems that emim is slightly better than wpq which is
better than porter.

  To further strengthen the validity of the results the t-
test was per formed on the data.      The t-test results are
given on Tables 16-21. The tables are arranged in sequence
from Precision at 15, 30 and 100 documents, Average Pre-
cision, Recall-Precision, to Recall.  Each table gives the
Mean difference, the standard deviation difference, the t-
statistic and the probability. As in the case with the sign
test there were no differences for Precision at 5 documents
and Precision at 10 documents and therefore the corre-
sponding tables have not been included in the paper. Sim-
ilarly, there are no significant differences at Precision at 15
documents and Precision at 30 documents. The results at
Precision at 100 documents show that r~lohi ~ r~hilo >
emim    ~ wpq   ~ porter, this result is the same as the
sign test. The results from Average Precision demonstrate
that r~lohi  ~ r~hilo > emim        ~ wpq ~ porter, with
emim better than porter. For Recall the results are that
r~lohi     r~hilo > emim   ~ wpq        > porter. Finally,
the Recall-Precision results demonstrate that r~lohi  ~
r~hilo  ~ emim    > wpq  > porter, where r~hilo is bet-
ter than em'm.

                                                  The results of the t-tests are consistent for the algorithms


                                           284

and corroborate the results obtained from the sign tests.
The two tests indicate that r~lohi and r~hilo have performed
consistently better than the other algorithms.


5   Conclusions
 * The results obtained from the use of the standard and
   enhanced versions of the GSL indicate that further r~
   search is needed in order to determine the effectiveness
   of the GSL-synonym list in retrieval.

 * The combination of adding 10 terms from the 5 or 10
   top ranked documents contributed to better retrieval
   performance.
   The other term/document combinations, i.e. adding
   20, 30, or 40 terms from 15 or 20 documents, etc., had
   a negative effect on retrieval performance.

 * The results from the routing searches indicate that
   query expansion (i.e., feedback searches without rel-
   evance information, where X number of terms is ex-
   tracted from Y number of top ranked documents that
   are treated as relevant to the query) improved retrieval
   performance depending on the algorithm used.

 * The r~lohi algorithm (Efthimiadis, 1993a) improved
   retrieval performance in the routing runs when com-
   pared to the initial (baseline) search which did not
   involve either a feedback search or query expansion.

 * In the Ad hoc searches the results of the evaluation
   of the five ranking algorithms indicate that r~lohi per-
   formed better than the other algorithms. These results
   were further validated by the results obtained from the
   sign test and the t-test.

 * Although query expansion seems to work, the retrieval
   performance achieved was less than expected.
   There are many reasons that account for these results
   and which are briefly addressed below.

     1. Completeness of the TREC Queries:
       The major factor that is being attributed to these
       results is that the queries, i.e. TREC Topic De-
       scriptions, are almost complete, i.e. contain all
       the important words required for the search.
       Query expansion is the process of supplementing
       the original query terms and is particularly effec-
       tive when incomplete queries are available.
       Query expansion on these rather complete queries
       seemed to have contributed to a small or even
       a detrimental effect in overall retrieval perfor-
       mance.


                                          285

    2. Size of the TREC collection:
       The large size of the TREC collection raises the
       issue of scalability and effectiveness of retrieval
       algorithms. The TREC collection is very differ-
       ent from that of the standard IR test collections,
       such as ADI, Cranfield, CACM, NPL. TREC is
       1-4 Gigabytes of text whereas the other collec-
       tions are smallish in size, i.e., only a few (1-50)
       Megabytes. The behavior and effectiveness of al-
       gorithms in information retrieval has been stud-
       ied in small collections and TREC provides the
       challenge of scalability.
    3. Nature of documents:
       The documents in the WSJ database are mostly
       long documents; full-text as opposed to short
       bibliographic records; less structure when com-
       pared to bibliographic records; and with lan-
       guage and presentation less structured (journal-
       istic style compared to scientific style);
    4. Length of documents:
       The records are long and often contain short
       multi-story, usually unrelated, items.
       When such documents contain relevant informa-
       tion for a topic, i.e., when one of the stories is
       relevant but all the others are not, these increase
       noise and interfere with the selection of terms for
       query expansion. This is because all the terms
       of that document will be included in the pool of
       the terms for query expansion and there may be
       a number of terms from other stories in that doc-
       ument that will be ranked higher than the terms
       from the relevant story.
       This reinforces the need to be able to retrieve at a
       paragraph level rather than at a document level.


6   Future Research
 * evaluate in detail the level of the effect of the GSL-
   synonym list in retrieval performance

 * evaluate the different effect of a local versus a global
   thesaurus for query expansion

 * evaluate the effect of variable bias in query expansion
   term weighting

 * investigate the retrieval overlap between different ap-
   proaches, and

 * explore data fusion techniques for output integration

                              Table 1: Runs with and without query expansion for topics 51-100.
                                                  (Runs are presented in descending `Recall' values.)


                Run name                   Avg Prec      Prec[5]     Prec[1O]      Prec(15)    Prec[30]    Prec[1OO)      R-Prec     Recall
      bmlS.phb.qey rjohs 10 5 uclagsln        0.2960       0.5640      0.5160        0.4893      0.4400       0.3288       0.3465     0.7322

      bmlS.phb.qey rjohi 10 10 uclagsln       0.2960       0.5640      0.5160        0.4907      0.4400       0.3288       0.3472     0.7320

      bmlS.phb.qey rjohi 10 15 uclagsln       0.2961       0.5680      0.5160        0.4907      0.4400       0.3288       0.3472     0.7320

      bmlS.phb.qey rjohi 10 10 uclagsly       0.2960       0.5640      0.5160        0.4907      0.4400       0.3288       0.3472     0.7320

      bmlS.phb.qey r~iln 10 10 uclagsln       0.2860       0.5480      0.4900        0.4733      0.4167       0.3148       0.3327     0.7283

      bmlS.phb.qey rjiilo 10 10 uclagsly      0.2860       0.5480      0.4900        0.4733      0.4167       0.3148       0.3327     0.7283

      bmlS.phn.qen uclagsly                   0.2849       0.5560      0.5040        0.4720      0.4200       0.3164       0.3328     0.7264

      bm15.phn.qen uclagsln                   0.2849       0.5560      0.5040        0.4720      0.4200       0.3164       0.3328     0.7264

      bmlS.phb.qen uclagsly                   0.2846       0.5560      0.5040        0.4720      0.4200       0.3166       0.3325     0.7262

      bmlS.phb.qey e~ni 10 10 uclagsly        0.2746       0.5240      0.4960        0.4573      0.4027       0.2962       0.3106     0.7113

      bmls.phb.qey cmi 10 10 uclagsln         0.2746       0.5240      0.4960        0.4573      0.4027       0.2962       0.3106     0.7113

      bmlS.phb.qey r]ohi 20 5 uclagsln        0.2968       0.5760      0.5240        0.5027      0.4527       0.3362       0.3508     0.7060

      bmlS.phb.qey rjohi 20 15 uclagsln       0.2969       0.5800      0.5280        0.5067      0.4520       0.3362       0.3512     0.7060

      bmlS.phb.qey rJohi 20 10 uclagsln       0.2967       0.5800      0.5260        0.5067      0.4520       0.3364       0.3512     0.7060

      bmlS.phy.qen uclagily                   0.2406       0.4600      0.4680        0.4360      0.3800       0.2878       0.2967     0.6777

      bmlS.phb.qey wpq 10 10 uclagsln         0.2590       0.5360      0.4980        0.4440      0.3900       0.2884       0.3017     0.6768

      bmlS.phb.qey wpq 10 10 uclagsly         0.2590       0.5360      0.4980        0.4440      0.3900       0.2884       0.3017     0.6768

      bmls.phb.qey wpq 10 15 uclagsln         0.2570       0.5160      0.4780        0.4520      0.3953       0.2882       0.2963     0.6750

      bmlS.phb.qey rJohi 30 15 uclagun        0.2896       0.5760      0.5320        0.5173      0.4573       0.3388       0.3456     0.6691

      bmlS.phb.qey r]ohi 30 10 uclagun        0.2890       0.5800      0.5420        0.5240      0.4520       0.3370       0.3445     0.6683

      bmlS.phb.qey rJohi 30 5 uclagsln        0.2894       0.5840      0.5380        0.5133      0.4520       0.3370       0.3450     0.6678

      bmlS.phb.qey ems 20 10 uclagsly         0.2458       0.5280      0.4880        0.4547      0.3873       0.2696       0.2875     0.6599

      bmlS.phb.qey wpq 20 10 uclagsln         0.2443       0.5120      0.4660        0.4413      0.3847       0.2706       0.2838     0.6437

      bmlS.phb.qey'wpq 10 5 uclagsln          0.2438       0.5440      0.4900        0.4533      0.3800       0.2740       0.2849     0.6423

      bmlS.phb.qey wpq 10 5 uclagsly          0.2438       0.5440      0.4900        0.4533      0.3800       0.2740       0.2849     0.6423

      bmlS.phb.qey por 10 10 uclagsly         0.2509       0.5320      0.4980        0.4560      0.4047       0.2862       0.2989     0.6362

      bmlS.phb.qey por 10 10 uclagsln         0.2509       0.5320      0.4980        0.4560      0.4047       0.2862       0.2989     0.6362

      bmlS.phb.qey cmi 30 10 uclagsly         0.2377       0.5280      0.4860        0.4480      0.3780       0.2666       0.2809     0.6331

      bmlS.phb.qey por 20 10 uclagsly         0.2558       0.5320      0.5080        0.4760      0.4113       0.2856       0.3030     0.6323

      bmlS.phb.qey wpq 20 15 uclagsln         0.2342       0.5240      0.4940        0.4333      0.3827       0.2690       0.2701     0.6288

      bmlS.phb.qey por 30 10 uclagsly         0.2494       0.5000      0.5080        0.4733      0.4167       0.2926       0.3088     0.6270

      bmlS.phb.qey wpq 30 10 uclagsln         0.2318       0.5280      0.4920        0.4387      0.3753       0.2620       0.2744     0.6121

      bmlS.phb.qey wpq 30 15 uclagsln         0.2271       0.5040      0.4860        0.4467      0.3700       0.2646       0.2668     0.6079

      bm15.phb.qey wpq 20 5 uclagsln          0.2266       0.5360      0.4820        0.4467      0.3707       0.2570       0.2699     0.6016

      bmlS.phb.qey wpq 30 5 uclagsln          0.2173       0.5280      0.4680        0.4347      0.3660       0.2480       0.2651     0.5790

Table 2: Ad hoc

results: Runs with and without query expansion
(Runs are presented in descending `Recall' values.)

for topics 101-150.

             Run~nstme                        Avg Prec    Prec 5   Prec 10   Prec 15   Prec 30  Prec 100    R-Prec   Recall
uclaal: auto.bmlS.phb.qen.uclagsly              0.3345    0.5840    0.5380    0.4973    0.4333    0.3098     0.3629 10.81551
uclaa2: auto.bmlS.phb.qey:wpq-10-10.uclagsly    0.2957    0.5440    0.5180    0.4920    0.4207    0.2760     0.3289 10.77861
uclafi: fdbk.bm15.phb.qey:wpq-10-10.uclagsly    0.3090    0.5880    0.5220    0.4893    0.4360    0.2884     0.3459 10.7745

Table 3: Performace Averages over all Topics of the Ad hoc Runs with
               (Runs Naxried af~er the Algorithm used in the Expansion.)

Query Expansion.

               Run~ame       Avg Prec   Prec 5]  Prec[10]   Precfls]  Prec[30]   Prec[160]   R-Prec  Recall
                 r~lohi       0.3414    0.5880    0.5240     0.4947    0.4427     0.3152     0.3688  0.8290
                 ~hiIo        0.3388    0.5880    0.5240     0.4960    0.4347     0.3160     0.3692  0.8333
                 emirn        0.3176    0.5880    0.5240     0.4920    0.4433     0.2938     0.3554  0.7989
               ucla�1: wpq    0.3087    0.5880    0.5240     0.4893    0.4360     0.2882     0.3460  0.7753
                 por~er       0.2990    0.5880    0.5240     0.4893    0.4280     0.2798     0.3323  0.7457


                                                         286

       Table 4: Sign Test Difference8,
          Pr~~iq~n~ At 1 .~ fl~~iivn~~ta

  _______ I' wpq       porter  emim   r~lohi  r~hilo j
     wpq      Ijo       3       1      3        2  1
   porter Il~           0       4      3        1  ~
   emim       112       4       0      3        4  I
   r~Iohi [[~s          6~      4      0
   r~hito                       5      3           j


            Table 5: Sign Test Probabilities,
_______          Precision at 15 Documents
________    [1  wpq    porter  emim    r~lohi   r~hi1o
 wpq        II 1.0000
 porter II 1.0000      1.0000
 emim         1.0000   1.0000  1.0000
 r~lohi II    0.7266   0.5078  1.0000  1.0000
 r~hilo       0.4531   0.2188  1.0000  1.0000   1.0000


             Table 6: Sign Test Differences,
  _______        Precision at 30 Documents
 L~                                          _ _1'


            ~                                   r~hiIo
              1.0000
            ~j 0.3613  1.0000
            ~ 0.8445   0.7277  1.0000
            [1 0.4725  0.2652  0.8638  1.0000
_______       0.8501   0.3613  0.7353  0.1338   1.0000


             Table 8: Sign Test Differences,
  _______        Precision at 100 Documents
 LII                                               3.


              able 9: Sign Test Probabilities,
            T    Precision at 100 ~


              1.0000
 porter       0.3239   1.0000
  wpg       II  wpq    porter  emim    r~1ohi   r~hi1o
 emim         0.0543   0.2002  1.0000
 r~lohi       0.0003   0.0004  0.0250  1.0000
 r~hiJo       0.0003   0.0002  0.0046  1.0000   1.0000


             Table 10: Sign Test Differences,
  _______             Average Precision
 [ _______    II wpq   porter  emim   r~lohi  r~hiIo ]
     wpq      f[ 0      29      15     10       11
   porter        20     0       17     9        10
   em:m ~        25     32      0      13       16
   r~lohi II     ~      40      36     0        31
   r~hilo        38     39      33     17       0

r


  _______       wpq porter   emim  r~1ohi  r~hilo
    wpq         0     18     12     13      15
   porter       12    0      15     11      12
   emim         14    18     0      16      19
   r~1ohi       18    18     18     0       15
   r~hilo       13    18     16     7       0


           Table 7: Sign Test Probabilities,
________        Precision at 30 Documents
[ ________    wpq   porter   emim   r~1ohi        3
  wpq
 porter
 em:m
 r~Johi
 r~hilo


  _______       wpq porter   emim  r~lohi  r~hi1o
    wpq         0     22     8      8       8
   porter       15    0      15     8       8
   emim         19    24     0      12      9
   r~lohi       32    31     27     0       15
   r~hiIo       32    33     27     14      0


L. __


                                                    287

        Table 11: Sign Test Probabilities,
                      Average Precision
 I      II
  wpq   111.00001
  porter II0.25311.0000I
  em:m II0.15470.04551.0000I
  r~loh: ~0.00010.00000.00171.0000
  r~hilo        0.0002 0.0001  0.0223   0.0606   1.0000


               Table 12: Sign Test Differences,
   _______                 Recall
  r..II                                                J


        Table 13: Sign Test Probabilities,
 ____                      Recall
 [ ________ [1   wpq   porter  emim     r~lohi   r~hiIo  ]
   wpq  fi 1.0000
  porter jI 0.0080     1.0000
  emim jj 0.2100       0.0180  1.0000
  r~Iohi II 0.0085     0.0000  0.0101   1.0000
  r~hilo        0.0005 0.0000  0.0008   1.0000   1.0000


               Table 14: Sign Test Differences,
   ______              Recall/Precision


  (.11                                                 I


        Table 15: Sign Test Probabilities,
 _______               Recall/Precision

   u,pq ~ll 1~w00p0q0  porter  emim     r~lohi   r~hilo  ]
  porter jj 0.2012     1.0000
  emim jj 0.0169       0.0062  1.0000

 ~ rThlOh~ It 0.0367   0.0008  0.2012   1.0000
   hilo         0.0023 0.0017  0.1508   1.0000   1.0000

         wpq   porter   emim   r~1ohi  r~hi~o  I


________   wpq porter   emim  r~Iohi  r~hilo
  wpq      0     24      8     10      7
 porter    8     0       10    4       3
 emim      15    25      0     9       6
 r~Iohi    27    28      25    0       14
 r~hiIo    29    32      26    14      0


_______    wpq porter   emim  r~1ohi  r~hilo
  wpq      0     19      5     10      8
 porter    11    0       7     6       7
 emim      17    23      0     11      11
 r~lohi    23    26      19    0       13
 r~hiIo    27    26      20    12      0

                      Table 16: i-~es~,
                   Precision at 15 Documents
                   Mean         SD
   Rnns          Difference  Difference   i       Probability
r~hiZo/rJohi       0.0013      0.0285   0.3305      0.7424
r~h~Io/emim        0.0040      0.0367   0.7712      0.4443
r~lohi/emim        0.0027      0.0300   0.6282      0.5328
r~hilo/porier      0.0067      0.0278   1.6973      0.0960
r~lohi/porter      0.0053      0.0325   1.1579      0.2525
emim/porter        0.0027      0.0380   0.4957      0.6224
 r~hilo/wpq        0.0067      0.0337   1.3999      0.1678
 r~lohi/wpq        0.0053      0.0352   1.0703      0.2897
 emim/wpq          0.0027      0.0232   0.8137      0.4197
 porier/wpq        0.0000      0.0404   0.0007      0.9994


                      Table 17: t-tesi,
                   Precision at 30 Doc~~xnent~
                   Mean         SD
   Runs          Difference  Difference   t       Probability
r~hi1o/r~lohi      -0.0080     0.0298   -1.8984     0.0635
r~hilo/emim        -0.0087     0.0583   -1.0521     0.2979
r~lohi/emim        -0.0007     0.0585   -0.0810     0.9358
r~hilo/porter      0.0067      0.0539   0.8748      0.3860
r~lohi/ porter     0.0147      0.0518   2.0019      0.0508
emim/porter        0.0153      0.0694   1.5624      0.1246
 r~hilo/wpq        -0.0013     0.0534   -0.1770     0.8602
 r~lohi/wpq        0.0067      0.0530   0.8882      0.3788
 emim/wpq          0.0073      0.0458   1.1312      0.2635
porter/wpq         -0.0080     0.0450   1.2590      0.2140

                      Table 20: ~-iest,
                             Rerall
                  Mean         SD
   Runs          Difference  Difference    i     Probability
r~hilo/r~Iohi     0.0005      0.0312    0.1052     0.9166
r~hilo/emim       0.0363      0.0743    3.4503     0.0012
r~lohi/emim       0.0358      0.0819    3.0935     0.0033
r~hi Jo/porter    0.0731      0.1094    4.7267     0.0000
r~lohi/ porter    0.0727      0.1108    4.6356     0.0000
emim/porter       0.0369      0.0965    2.7010     0.0095
 r~hilo/wpq       0.0506      0.0835    4.2805     0.0001
 r~lohi/wpq       0.0501      0.0908    3.9007     0.0003
 emim/wpq         0.0143      0.0529    1.9106     0.0619
 porter/wpq       -0.0226     0.0739    -2.1583    0.0358

                      Table 18: t-iest,
                  Precision at 100 Documents

   Runs
r~hilo/r~lohi
r~hilo/emim
r~1ohi/emim
r~hilo/porter
r~lohi/ porter
emim/porter       0.0140     0.0535    1.8494      0.0704
 r~hiJo/wpq       0.0278     0.0557    3.5311      0.0009                             Table 21: i-lest,
 r~lohi/wpq       0.0270     0.0600    3.1815      0.0025                              Recall/Precision
 emim/wpq         0.0056     0.0301    1.3150      0.1946                         Mean        SD
                                                                     Runs        Difference Difference   t      Probability
porter/wpq        -0.0084    0.0410    1.4496      0.1536
                                                                  r~hilo/r~Johi   0.0003     0.0246    0.0978     0.9225
                                                                  r~hiJo/emim     0.0138     0.0450    2.1635     0.0354
                                                                  r~Johi/emim     0.0134     0.0517    1.8396     0.0719
                                                                  r~hi Jo/porter  0.0369     0.0636    4.1048     0.0002
                                                                  r~Johi/porter   0.0366     0.0669    3.8626     0.0003
                      Table 19: t~tes1,
                       Average Precision                          emim/porter     0.0231     0.0576    2.8398     0.0066
                  Mean         SD                                  r~hi'o/wpq     0.0231     0.0532    3.0729     0.0035
   Runs          Difference Difference    t      Probability       r~lohi/wpq     0.0228     0.0635    2.5401     0.0143
r~hilo/rJohi      -0.0026    0.0153    -1.1935     0.2384          emim/wpq       0.0094     0.0313    2.1135     0.0397
r~hilo/emim       0.0212     0.0421    3.5637      0.0008         porter/wpq      -0.0138    0.0441    -2.2100    0.0318
r~lohi/emim       0.0238     0.0470    3.5786      0.0008
r~hi Jo/porter    0.0398     0.0615    4.5727      0.0000
r~Johi/porter     0.0424     0.0658    4.5547      0.0000
emim/porter       0.0186     0.0592    2.2172      0.0313
 r~hiJo/wpq       0.0301     0.0537    3.9615      0.0002
 r~Johi/wpq       0.0327     0.0603    3.8291      0.0004
 emim/wpq         0.0089     0.0335    1.8726      0.0671
porter/wpq        -0.0097    0.0358    -1.9107     0.0619


                                                          288

Acknowledgements

Thanks to Stephen Robertson and Stephen Walker for making
OKAPI available to UCLA.

  We are especially thankful to Stephen Walker for all his con-
tinuing help over and above that indicated in the text.

  We wish to thank Donna Harman of NIST for her support
during TREC-2.

  This research was supported by a UCLA Academic Senate
grant that also provided financial support for the graduate re-
search assistantship of PVB.

  Finally, ENE is grateful to SUN Microsystems Inc. for the
equipment grant of the SPARC 2 that made this research p05-
sible.


References
Efthimiadis, E.N. (1993a) A user-centered evaluation of rank-
  ing algorithms for interactive query expansion. In: Korfhage
  R., Rasmussen E. and Willett P. (eds.) Proceedings of the
  16th International Conference of the Association of Comput-
  ing Machinery, Specialist Interest Group in Information Re-
  trieval, June 1993. Pittsburgh, PA: ACM Press. pp.146-159.
Efthimiadis, E.N. (in press) `User choices: As the yardstick for
  the evaluation of ranking algorithms for interactive query ex-
  pansion.' Information Processing ~ Management. In press.
Efthimiadis, E.N. `Interactive Query Expansion:   A User-
  based evaluation in a Relevance Feedback Environment.'
  Manuscript submitted for publication.
Harper, D.J. & van Rijsbergen, C.J. (1978) An evaluation
  of feedback in document retrieval using co-occurrence data.
  Journal of Documentation, 34(3), 189-216.
Porter, M.F. & Galpin, V. (1988) Relevance feedback in a public
  access catalogue for a research library: Muscat at the Scott
  Polar Research Institute. Program, 22(1), 1-20.
Robertson, S.E. (1990) On term selection for query expansion.
  Journal of Documentation, 46(4), 359-364.
Robertson, S.E. and Sparck Jones, K. (1976) Relevance weight-
  ing of search terms. Journal of the American Society for In-
  formation Science, 27(3), 129-146.
Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A. &
  Lau, M. (1993a) Okapi at TREC. In: Harman, D. K. (Ed.)
  The First Text Retrieval Conference (TREC-1). Proceedings.
  NIST Special Publication 500-207. Gaithersburg, MD: NIST,
  1993. pp2l-30.
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.,
  & Gatford, M. (1993b) Okapi at TREC-2. In: Harman, D.K.
  (Ed.) Text Retrieval Conference (TREC-2). Galthersburg,
  MD: NIST, August 30-September 1, 1993, Proceedings, to
  appear.


                                                      289

van Rusbergen, C.J. (1977) A theoretical basis for the use of
  co-occurrence data in information retrieval. Journal of Doc-
  umentation, 33, 106-119.
van Rijsbergen, C.J., Harper, D.J. & Porter, M.F. (1981) The
  selection of good search terms. Information Processing and
  Management, 17(2), 77-91.


A    Runs

The runs presented here are grouped by algorithm used for the
expansion. The list starts with runs that did not include query
expansion.

  bmlS.phn.qen.uclagsln
bmlS.phn.qen.uclagsly
bmlS.phy.qen.uclagsly
bmlS.phb.qen.uclagsly

bmlS.phb.qey:wpq-10-5.uclagsln
bmlS.phb.qey:wpq-10-5.uclagsly
bmlS.phb.qey:wpq-10-10.uclagsln
bmlS.phb.qey:wpq-10-10.uclagsly
bmlS.phb.qey:wpq-10-15.uclagsln
bmlS.phb.qey:wpq-20-5.uclagsln
bmlS.phb.qey:wpq-20-10.uclagsln
bmls.phb.qey:wpq-20-15.uclagsln
bmlS.phb.qey:wpq-30-5.uclagsln
bmlS.phb.qey:wpq-30-10.uclagsln
bmlS.phb.qey:wpq-30-15.uclagsln

bmlS.phb.qey:emim-10-10.uclagsln
bmlS.phb.qey:emim-10-10.uclagsly
bmlS.phb.qey:emim-20-10.uclagsly
bmlS.phb.qey:emim-30-10.uclagsly

bmlS.phb.qey:port-10-10.uclagsln
bmlS.phb.qey:port-10-10.uclagsly
bmlS.phb.qey:port-20-10.uclagsly
bmlS.phb.qey:port-30-10.uclagsly

bm15.phb.qey:r~hi1o-10-10.uclagsln
bm15.phb.qey:r~hilo-10-10.uclagsly

bmlS.phb.qey r lohi 10 5.uclagsln
bmlS.phb.qey r lohi 10 10.uclagsln
bmlS.phb.qey r lohi 10 10.uclagsly
bmlS.phb.qey r lohi 10 15.uclagsln
bmlS.phb.qey r lohi 20 5.uclagslu
bmlS.phb.qey r lohi 20 l0.uclagsln
bmlS.phb.qey.rAohi 20 15.uclagsln
bmlS.phb.qey:rAohi-30-5.uclagsln
bmlS.phb.qey:rAohi-30-10.uclagsln
bm15.phb.qey:r~lohi-30-15.uclagsln