UCLA-Okapi at TREC-2: Query Expansion Experiments Efthimis N. Efthimiadis* and Paul V. Biron Graduate School of Library and Information Science University of California at Los Angeles 1 Introduction This is the first participation of the Graduate School of Li- brary and Information Science, University of California at Los Angeles in the TREC Conference. For TREC-2, Cat- egory B, UCLA used a version of the Okapi text retrieval system that was made available to UCLA by City Univer- sity, London, UK. OKAPI has been described in TREC- 1 (Robertson, Walker, Hancock-Beaulieu, Gull & Lau, 1993a) as well as in this conference (Robertson, Walker, Jones, Hancock-Beaulieu, & Gatford, 1994). Okapi is a simple set-oriented system based on a generalized proba- bilistic model with facilities for relevance feedback. In addi- tion OKAPI supports a full range of deterministic Boolean and quasi-Boolean operations. 1.1 Objectives The main research objective of the UCLA participation in TREC-2 was to investigate query expansion within the framework as provided by Okapi. More specifically, the objectives were to: * use an enhanced version of the G~See-List (GSL) and evaluate its effect on retrieval performance. * investigate the performance of query expansion with and without relevance information by varying the number of documents that are treated as relevant and the number of terms that are included in the expan- sion. * compare the performance of different ranking alg~ rithms for the ranking of terms for term selection dur- ing query expansion. * compare the effectiveness in retrieval of user assigned relevance judgements against hypothetically assumed relevance judgements based on the top X documents. *To whom all correspondence should be addressed. Grad- uate School of Library and Information Science, University of California at Los Angeles, 405 Hilgard Avenue, Los Angeles, CA 90024-1520, e-mail: iacxene~mvs.oac.ucla.edu 1 279 1.2 The Okapi version at UCLA and the WSJ database The Okapi system consists of a low level search engine or basic search system (BSS), a user interface for the man- ual search experiments and data conversion and inversion utilities. The UCLA hardware consisted of Sun SPARC-2 machine with 32 MB of memory, and 1 GB of disk storage. The Wall Street Journal (WSJ) database was used for both the routing and ad-hoc searches. Because of the lack of adequate disk space on the UCLA machine the database was indexed at City University by Stephen Walker and it was then transferred (FTP-ed) to UCLA. For TREC2 the Okapi databases were built by index- ing mainly the DOCNO and TEXT fields of the records. Inverted indexes included complete within-document posi- tional information, enabling term frequency and term prox- imity to be used. Okapi's typical index size overhead is around 80% of the textfile size. The elapsed time for inver- sion of the WSJ database was about 12 hours. At this point it is worth noting of (a) the nature of the WSJ records, and (b) a limitation of Okapi's due to index- ing. (a) The WSJ records consist of documents that do not have the same kind of structure found in bibliographic databases, such as INSPEC or ERIC. The records contain the full-text of stories and have varied length, mostly longer than the length of an average abstract of a bibliographic database. In addition, the language and the style is mostly `journalistic' as opposed to `scientific', i.e. less structured. One important issue is that some WSJ records often con- tain short multi-story articles which are completely unre- lated one from the other. This type of record is usually a compilation of a number of one- or two-paragraph long news stories. The stories share no content relation between them, the only common feature is their co-existence in the same record. This has implications in retrieval effective- ness, especially when such records are included in the pool of documents that provide terms for query expansion, be- cause of the noise introduced by the terms taken from the irrelevant stories. (b) This last issue relates to a limitation of Okapi. The version of Okapi used at UCLA retrieves documents at the record level only. Retrieval at the paragraph level, which would have facilitated a better handling of some issues like the above, is not currently available. 2 The weighting functions The weighting of search terms can be said to involve two levels: level 1: A weighting function is used to weight the terms for the initial query as well as the terms for subsequent search iterations of the same query or some modified version of the query. level 2: A weighting function is used for the weighting of candidate terms for query expansion. Sections 2.1 and 2.2 discuss functions used in level 1 and level 2 respectively. 2.1 Search term weighting The theory of relevance weights (Robertson & Sparck Jones, 1976) provides the basic probabilistic model. The binary independence or relevance weight model assigns a weight to each term and the matching function for each document is given by the `~impk sum-of-weights' over all of the terms in the query. The weight of a term is calculated by following function which is also known as the ft point-5 formula: Wf4 = log (r + .5)(N - iz - R + r + .5) (n - r + .5)(R - r + .5) where, N is the total number of documents in the collec- tion; R is the sample of relevant documents as defined by the user's feedback; n is the number of documents indexed by term t; r is the number of relevant documents (from the sample R) assigned to term t. When relevance information is not available the above weight reduces to approximately the inverse document fre- quency (IDF). For calculating the total weight of a document the fol- lowing function was used which is based on the binary inde- pendence model, and takes into consideration the 2-Poisson model for within document frequency (tf) and the docu- ment length. These are described in detail in Robertson et al (1993b). The purpose of the UCLA Okapi system was to evaluate the existing Okapi models and therefore did not allow for modifications of the existing functions. For compatibility purposes and for comparisons it was decided to use the BM15 (best match) function for the runs. The BM15 best match weigthing function is: dOOW6~9htbml5 = ~(((k1tftf)) X wf4) + k2 x ~q x ((aavedl - dl) vedl+ dl) 0 where k1 and k2 are unknown constants. In the UCLA- Okapi implementation the values for these constants are: k1 = 1 and k2 = 1. 2.2 Query expansion term weighting Ther anking algorithms that were considered for the rank- ing of terms for query expansion were: wpq, emim, porter, r~lohi and r~hilo. These algorithms are described briefly below. 2.2.1 The wpq algorithm This algorithm is based on an independence assumption that holds between a query expansion term and the terms in the entire previous search formulation (Robertson, 1990). According to the relevance weighting theory, the inclusion of term tin the search formulation with weight Wj will increase the effectiveness of retrieval by (1) wpq = Wt (Pt - qt) (3) where, Wt is a weighting function, which in this case is the Wf 4; Pt is the probability of term t occurring in a relevant document; and qt is the probability of a term t occurring in a non-relevant document. This means that irrespective of the weighting function (Wt) used the rule for deciding the inclusion of a term in a query expansion search should be based on the ranking of wpq instead of Wt alone. Substituting the weighting func- tion and the probability of relevance in wpq with r, R, n, Nwe get: wpq = ig (r (+nL5)r(N+F5~R~Rr+~.5)5) .(~Rr~NnIRr) 280 0 The wpq algorithm combines the effects of the relevance weighting theory, as expressed by the Wf4 component, which assign greater importance to the infrequent terms with the frequency of occurrence of a term in the relevant document set. 2.2.2 The emim algorithm The expected mutual information measure (emim) is a term weighting model incorporating relevance information in which it is assumed that index terms may not be dis- tributed independently of each other. (van Rijsbergen, 1977; Harper and van Rusbergen, 1978; van Rijsbergen, Harper & Porter, 1981) The emim weight reduces to the f~ weight when the "de- gree of involvement", i.e. the joint probabilities, are all unity. Assuming the same definitions for n, N, r, R, as those already used earlier, the emim weight of a term is calculated as follows: Eiq = pilill - P12i12 - p2ii21 + p22i22 = 1og;~.r (n-~N -log .(n-r) (N-R)n (R-r)N -log .(R-r) (N-n)R (N-n-R+r)N +log (`v-n-R+r) (N-n)(N-R) 2.2.3 The porie~ algorithm Porter and Galpin (1988) describe a ranking formula used in the MUSCAT online catalogue: r n porler = - N where r, R, n, N are defined as in the f~ weight (eq. 1). 2.2.4 The r~lohi algorithm The r~lohi algorithm has been proposed by Efthimiadis (1993a) as the result of the observation of the ranking be- havior of six algorithms used for ranking terms for query expansion. The r~lohi ranking algorithm: * ranks terms according to r, i.e. their frequency of oc- currence in the relevant document set, in descending order and * resolves ties according to their term frequency, n, from low-to-high frequency. It was hypothesized that the r~lohi algorithm would have an almost identical ranking to porier and a performance ap- proaching that of wpq and emim. More differences between the algorithms may occur if the size of the set of relevant documents (R) gets larger. Conclusions about the algo- rithm however could not be drawn before it was evaluated against the other algorithms. The results of that evaluation are reported in Efthimiadis (in press) where the r~lohi al- gorithm demonstrated better performance when compared to the other algorithms. 2.2.5 The r~hilo sort A variant of the r~lohi algorithm is to rank candidate terms for query expansion using the r~hilo rank which: * ranks terms according to r, i.e. their frequency of oc- currence in the relevant document set, and * resolves ties according to their term frequency, n, from high-t~low frequency. Since the r~hilo algorithm will result in sorting terms in exactly the opposite way of the r~loh i algorithm it was in- cluded as a control for the study. 3 Methodology 3.1 Runs (5) Initial tests were performed in topics 1-50 where the depen- dent variables were the weighting function and the query processing of terms. ~From the results obtained it was es- tablished that the function to use will be BM15 and that the parsing of the Topics would include both single terms and "phrases" as defined by comma delimited text in the Topics. The table below (Table 3.1 gives all the variables used in constructing the runs. The options available for each variable are also provided. Weighting Function: Best match function BM15 (see equation 2). Phrases: Choice of YES, NO, or BOTH. This determines the type of parsing of the "Concepts" and "Title" fields 281 Table 3.1 Methodology for the Routing Runs on Topics 51-100 _______ Weighting Expansion Terms used for UCLA ~ Query Number of No. of Docs Function Phrases QE Algorithm Expanded Auto Rel Fbk GSL bm15 no no w~q 0 0 no yes yes emim 10 5 yes both por~er 20 10 r~lohi 30 15 __________ _________ r~hilo ___________ 20 ________ from the Topics, which were the source of the search terms. NO means that the terms extracted from the Concepts and Title fields are single terms only. YES means that phrases get extracted as determined by the simple routine, where a phrase is identified by using the punctuation found in the Concepts and Title fields. BOTH is the combination of the two methods and the terms are searched as single terms as well as phrases. Query Expansion (QE): The choice of query expansion algorithms is one of wpq, emim, portev, r~lohi, r~hilo. Terms expanded: This specifies the number of terms to include in the expansion. When the number of terms expanded is zero, then only the initial query is run. Feedback documents: ranked documents to provide the source for This defines the number of top be treated as relevant and to the terms for query expansion. UCLA GSL: defines whether the standard Okapi GSL or the UCLA enhanced version of the GSL will be used. Because of the many parameters involved in each run the names of runs have been deliberately made explicit, which however resulted in rathcr long names. For exam- ple, bmlS .phb. qey:r~ohi-i0-5 .uclagsly means that for this run the weighting function used was the BM 15, phrases were set to BOTH, query expansion took place, the r~lohi algorithm was used for the ranking of terms for query ex- pansion, 10 terms were added in the expansion, 5 docu- ments provided the source of the terms for the expansion, and the UCLA enhanced GSL was also used. 3.2 Go-See-List The G~See-List (GSL) is a look-up table that contains stopwords, semi-stopwords, prefixes, g~phrases and syn- onym classes. The GSL is used during the indexing of a database as well as during searching. Stopwords contain an array of terms that are thought to contain no or little value for retrieval. These include, contractions, prepositions, adverbs, etc. 282 The semi-stopwords are terms that are thought to have low value for retrieval purposes. Therefore, a semi- stopword will be searched only during the initial search if it has been part of the user's search statement. If, however, the term has emerged as the result of a query expansion it is stopped, i.e. excluded from the pool of candidate terms for query expansion. Go-phrases are mostly noun-phrases that need to be searched as one word or else the precision will be very low, e.g. New York. GSL contains a small number of selected g~phrases. Synonym entries contain a mix of terms/concepts that are treated as synonyms for retrieval purposes. These may be true synonyms, quasi-synonyms, or unrelated semanti- cally terms which are grouped together because of some common properties which have value for retrieval. Finally, the synonym entries also contain term variants that are known to "escape" from the conflation algorithm. The structure of the UCLA GSL is given in the table below. The GoSee-List (GSL) City added UCLA by UCLA total stopwords 411 72 483 semi-stopwords 58 58 prefixes 18 18 Go-phrases 43 84 127 Synonyms 359 604 963 For the UCLA GSL, the Titles and Concepts of Top- ics 1-100 were analyzed and synonym classes were gener- ated from the data. The list includes: 40 personal names, and 250 synonym classes. In addition, a list of organiza- tions and a list of common business acronyms and abbre- viations was compiled. 3.3 Query term selection Query terms were selected from the Title and Concepts fields of the records. The processing of these fields was very simple. Programs written in awk and pen were used to isolate the required fields, which were then parsed and the resulting terms stemmed in accordance with the in- dexing procedures followed for building the WSJ database. This process resulted in one-word query terms. When ap- propriate the procedure also output phrases by treating the punctuation available in these fields as the phrase delime- ter. Queries were then generated automatically from the Ti- tle and Concepts fields. Exactly the same queries were used in the Routing and Ad hoc searches. 3.4 Term selection for query expansion a) Routing searches: Query expansion in the routing searches was performed through query modification without relevance information. As indicated in the ta- ble, that describes the construction of the runs inthe methodology section, the number of documents used could range from the top 0-20 documents, in incre- ments of 5 documents. These top ranked documents were treated as relevant and were analyzed in order to provide terms for the expansion. Expansion terms were selected by pooling all the terms and then weight- ing these terms with one of the five ranking algorithms as specified by the run. Then the top 10, 20 or 30 terms were added to the original query terms and searched. b) Ad hoc searches: The term pool consisted of all the terms of the documents judged as relevant. For the Ad hoc searches with feedback of the official results, the top 10 terms as determined by wpq were chosen for expansion and were searched together with the initial query terms. c) Rules for term selection: The following rules were followed for the inclusion or exclusion of a term during selection for query expansion: a) numbers were excluded as terms, b) all terms whose frequency (n) is equal to the num- ber of relevant documents seen (R), i.e., if n <= R, were excluded. 3.5 Search procedure All searches, Routing and Ad hoc, were automatic and de- termined by the specifications made for each run. There were no manual searches. 3.5.1 Ad hoc searches and searchers There were no manual searches. For the Ad hoc searches with relevance feedback, i.e. uclafi (official results), rel- evance assessments were provided by two searchers. The odd numbered topics were assessed by one searcher and the even numbered topics by the other. 283 3.5.2 Relevance assessments During the Ad hoc searches, the guidelines for relevance judgements were: a) review the entire document, when judging relevance, even if it seems to be peripheral or not relevant. The reason being that many of the articles were found to be collections of brief news stories, with the relevant part of the text hidden in (the middle or the end of) the text. b) target for 10 relevant documents; stop as soon as 10 are found or at the 20th document. However, if 3 relevant have not been found continue till 3 are found (this is because OKAPI will not do an expansion if it has less than 3 documents). 3.5.3 Ad hoc additional runs Following the TREC conference, a set of runs was con- ducted on the Ad hoc queries in order to complete the eval- uation of the five ranking algorithms for query expansion that were studied. The relevance judgements made in the Ad hoc run uc1a~ 1 (fdbk.bmlS.phb.qey:wpq-10-10.uclagsly) were ex- tracted and used in the subsequent runs. The process fol- lowed in these additional runs is described below: * Four new Ad hoc runs were done; one for each of the remaining algorithms which were used for the ranking of terms for query expansion, i.e., emim, porter, r~hilo, r~lohi. * The same initial query, which was generated automat- ically, was used for all searches. * The relevance judgements made in the initially re- trieved set of the official Ad hoc run were extracted and then simulated in the additional runs. * Query expansion terms were ranked using the algo- rithm that was designated by each run. The 10 top ranked terms from the pool were added to the query. 3.6 Problems & Limitations Lack of equipment has been a major problem in our par- ticipation. In order to participate in TREC, SUN Mi- crosystems provided an equipment grant (SUN Sparc-2) in March, however no disk was initially available, but a 1- Gigabyte disk was acquired in June. Consequently, only the Ad hoc runs were included in the official results. A limitation of the UCLA version of OKAPI is that it does not allow modifications of the basic retrieval functions (i.e., the BMs or best match functions). 4 Results and Discussion The results of the Routing runs, the Ad hoc runs and the Ad hoc additional runs are given in Table 1, Table 2 and Table 3 respectively. Routing runs The 35 Routing runs given in Table 1 are presented in descending recall values. The runs bml5 ph Eynb] . qen. uclagsl ~yn], i.e., the runs without query expansion, were used as base- line runs in order to facilitate comparisons. All other runs reported in the table include query expansion. The results indicate that runs with query expansion, where the r~lohi or the r~hilo algorithm was used performed better than all other runs in terms of Recall, Average Pre- cision, and R-Precision. Ad hoc runs >From the three official Ad hoc runs, uclaal, was the au- tomatic run that did not include query expansion and has been used as a baseline-run, uclaa2, was an automatic run that included query expansion without any relevance infor- mation, and uclaf 1, was a run with user supplied relevance feedback and query expansion. In terms of R-Precision and Average Precision the run with feedback and query expansion (uclaf 1) did better than the automatic run with query expansion (uclaa2), but the baseline was slightly better. Ad hoc additional runs The results of the Ad hoc additional runs are given in Ta- ble 3. The official run with feedback (uclafi) using wpq for the expansion is compared to the runs which used the r~lohi, r~hilo, emim and porter algorithms respectively for the expansion. The results indicate that r~lohi and r~Ailo have performed better than the other algorithms. These results further corroborate the results obtained from the routing runs. In order to further validate the results the sign test as well as the t-test were performed on the data. The results from the sign test are given on Tables 4-15. The tables are arranged in sequence starting from Precision at 15, 30, and 100 documents, Average Precision, Recall-Precision, to Recall. In each case, two tables are given; the first ta- ble gives the differences and the second the probabilities. As it can be expected there are no differences at Precision at 5 documents and at Precision at 10 documents because these were the same for all five runs. For this reason the corresponding pairs of tables have not been included in the paper. The results also show no significant differences at Precision at 15 documents and at 30 documents. Signifi- cant results appear at Precision at 100 documents where r~lohi ~ r~hilo > emim ~ wpq ~ porter. The sign test results on Average Precision demonstrate that r~lohi ~ r~hilo > wpq ~ emim ~ porter, where emim > porter. The results on Recall show some group- ing between the algorithms, so that r~lohi ~ r~hilo > emim wpq > porter. The results from the Recall- Precision indicate that r~lohi ~ r~hilo ~ emim > wpq ~ porter with r~lohi> emim but not significantly better and with wpq slightly better than porter. >From the study of the sign test results certain overall comments emerge about the performance of the five alg~ rithms. The results seem to be consistent throughout with r~lohi performing better than the other algorithms. Dif- ferences between emim, wpq and porter are not consistent but it seems that emim is slightly better than wpq which is better than porter. To further strengthen the validity of the results the t- test was per formed on the data. The t-test results are given on Tables 16-21. The tables are arranged in sequence from Precision at 15, 30 and 100 documents, Average Pre- cision, Recall-Precision, to Recall. Each table gives the Mean difference, the standard deviation difference, the t- statistic and the probability. As in the case with the sign test there were no differences for Precision at 5 documents and Precision at 10 documents and therefore the corre- sponding tables have not been included in the paper. Sim- ilarly, there are no significant differences at Precision at 15 documents and Precision at 30 documents. The results at Precision at 100 documents show that r~lohi ~ r~hilo > emim ~ wpq ~ porter, this result is the same as the sign test. The results from Average Precision demonstrate that r~lohi ~ r~hilo > emim ~ wpq ~ porter, with emim better than porter. For Recall the results are that r~lohi r~hilo > emim ~ wpq > porter. Finally, the Recall-Precision results demonstrate that r~lohi ~ r~hilo ~ emim > wpq > porter, where r~hilo is bet- ter than em'm. The results of the t-tests are consistent for the algorithms 284 and corroborate the results obtained from the sign tests. The two tests indicate that r~lohi and r~hilo have performed consistently better than the other algorithms. 5 Conclusions * The results obtained from the use of the standard and enhanced versions of the GSL indicate that further r~ search is needed in order to determine the effectiveness of the GSL-synonym list in retrieval. * The combination of adding 10 terms from the 5 or 10 top ranked documents contributed to better retrieval performance. The other term/document combinations, i.e. adding 20, 30, or 40 terms from 15 or 20 documents, etc., had a negative effect on retrieval performance. * The results from the routing searches indicate that query expansion (i.e., feedback searches without rel- evance information, where X number of terms is ex- tracted from Y number of top ranked documents that are treated as relevant to the query) improved retrieval performance depending on the algorithm used. * The r~lohi algorithm (Efthimiadis, 1993a) improved retrieval performance in the routing runs when com- pared to the initial (baseline) search which did not involve either a feedback search or query expansion. * In the Ad hoc searches the results of the evaluation of the five ranking algorithms indicate that r~lohi per- formed better than the other algorithms. These results were further validated by the results obtained from the sign test and the t-test. * Although query expansion seems to work, the retrieval performance achieved was less than expected. There are many reasons that account for these results and which are briefly addressed below. 1. Completeness of the TREC Queries: The major factor that is being attributed to these results is that the queries, i.e. TREC Topic De- scriptions, are almost complete, i.e. contain all the important words required for the search. Query expansion is the process of supplementing the original query terms and is particularly effec- tive when incomplete queries are available. Query expansion on these rather complete queries seemed to have contributed to a small or even a detrimental effect in overall retrieval perfor- mance. 285 2. Size of the TREC collection: The large size of the TREC collection raises the issue of scalability and effectiveness of retrieval algorithms. The TREC collection is very differ- ent from that of the standard IR test collections, such as ADI, Cranfield, CACM, NPL. TREC is 1-4 Gigabytes of text whereas the other collec- tions are smallish in size, i.e., only a few (1-50) Megabytes. The behavior and effectiveness of al- gorithms in information retrieval has been stud- ied in small collections and TREC provides the challenge of scalability. 3. Nature of documents: The documents in the WSJ database are mostly long documents; full-text as opposed to short bibliographic records; less structure when com- pared to bibliographic records; and with lan- guage and presentation less structured (journal- istic style compared to scientific style); 4. Length of documents: The records are long and often contain short multi-story, usually unrelated, items. When such documents contain relevant informa- tion for a topic, i.e., when one of the stories is relevant but all the others are not, these increase noise and interfere with the selection of terms for query expansion. This is because all the terms of that document will be included in the pool of the terms for query expansion and there may be a number of terms from other stories in that doc- ument that will be ranked higher than the terms from the relevant story. This reinforces the need to be able to retrieve at a paragraph level rather than at a document level. 6 Future Research * evaluate in detail the level of the effect of the GSL- synonym list in retrieval performance * evaluate the different effect of a local versus a global thesaurus for query expansion * evaluate the effect of variable bias in query expansion term weighting * investigate the retrieval overlap between different ap- proaches, and * explore data fusion techniques for output integration Table 1: Runs with and without query expansion for topics 51-100. (Runs are presented in descending `Recall' values.) Run name Avg Prec Prec[5] Prec[1O] Prec(15) Prec[30] Prec[1OO) R-Prec Recall bmlS.phb.qey rjohs 10 5 uclagsln 0.2960 0.5640 0.5160 0.4893 0.4400 0.3288 0.3465 0.7322 bmlS.phb.qey rjohi 10 10 uclagsln 0.2960 0.5640 0.5160 0.4907 0.4400 0.3288 0.3472 0.7320 bmlS.phb.qey rjohi 10 15 uclagsln 0.2961 0.5680 0.5160 0.4907 0.4400 0.3288 0.3472 0.7320 bmlS.phb.qey rjohi 10 10 uclagsly 0.2960 0.5640 0.5160 0.4907 0.4400 0.3288 0.3472 0.7320 bmlS.phb.qey r~iln 10 10 uclagsln 0.2860 0.5480 0.4900 0.4733 0.4167 0.3148 0.3327 0.7283 bmlS.phb.qey rjiilo 10 10 uclagsly 0.2860 0.5480 0.4900 0.4733 0.4167 0.3148 0.3327 0.7283 bmlS.phn.qen uclagsly 0.2849 0.5560 0.5040 0.4720 0.4200 0.3164 0.3328 0.7264 bm15.phn.qen uclagsln 0.2849 0.5560 0.5040 0.4720 0.4200 0.3164 0.3328 0.7264 bmlS.phb.qen uclagsly 0.2846 0.5560 0.5040 0.4720 0.4200 0.3166 0.3325 0.7262 bmlS.phb.qey e~ni 10 10 uclagsly 0.2746 0.5240 0.4960 0.4573 0.4027 0.2962 0.3106 0.7113 bmls.phb.qey cmi 10 10 uclagsln 0.2746 0.5240 0.4960 0.4573 0.4027 0.2962 0.3106 0.7113 bmlS.phb.qey r]ohi 20 5 uclagsln 0.2968 0.5760 0.5240 0.5027 0.4527 0.3362 0.3508 0.7060 bmlS.phb.qey rjohi 20 15 uclagsln 0.2969 0.5800 0.5280 0.5067 0.4520 0.3362 0.3512 0.7060 bmlS.phb.qey rJohi 20 10 uclagsln 0.2967 0.5800 0.5260 0.5067 0.4520 0.3364 0.3512 0.7060 bmlS.phy.qen uclagily 0.2406 0.4600 0.4680 0.4360 0.3800 0.2878 0.2967 0.6777 bmlS.phb.qey wpq 10 10 uclagsln 0.2590 0.5360 0.4980 0.4440 0.3900 0.2884 0.3017 0.6768 bmlS.phb.qey wpq 10 10 uclagsly 0.2590 0.5360 0.4980 0.4440 0.3900 0.2884 0.3017 0.6768 bmls.phb.qey wpq 10 15 uclagsln 0.2570 0.5160 0.4780 0.4520 0.3953 0.2882 0.2963 0.6750 bmlS.phb.qey rJohi 30 15 uclagun 0.2896 0.5760 0.5320 0.5173 0.4573 0.3388 0.3456 0.6691 bmlS.phb.qey r]ohi 30 10 uclagun 0.2890 0.5800 0.5420 0.5240 0.4520 0.3370 0.3445 0.6683 bmlS.phb.qey rJohi 30 5 uclagsln 0.2894 0.5840 0.5380 0.5133 0.4520 0.3370 0.3450 0.6678 bmlS.phb.qey ems 20 10 uclagsly 0.2458 0.5280 0.4880 0.4547 0.3873 0.2696 0.2875 0.6599 bmlS.phb.qey wpq 20 10 uclagsln 0.2443 0.5120 0.4660 0.4413 0.3847 0.2706 0.2838 0.6437 bmlS.phb.qey'wpq 10 5 uclagsln 0.2438 0.5440 0.4900 0.4533 0.3800 0.2740 0.2849 0.6423 bmlS.phb.qey wpq 10 5 uclagsly 0.2438 0.5440 0.4900 0.4533 0.3800 0.2740 0.2849 0.6423 bmlS.phb.qey por 10 10 uclagsly 0.2509 0.5320 0.4980 0.4560 0.4047 0.2862 0.2989 0.6362 bmlS.phb.qey por 10 10 uclagsln 0.2509 0.5320 0.4980 0.4560 0.4047 0.2862 0.2989 0.6362 bmlS.phb.qey cmi 30 10 uclagsly 0.2377 0.5280 0.4860 0.4480 0.3780 0.2666 0.2809 0.6331 bmlS.phb.qey por 20 10 uclagsly 0.2558 0.5320 0.5080 0.4760 0.4113 0.2856 0.3030 0.6323 bmlS.phb.qey wpq 20 15 uclagsln 0.2342 0.5240 0.4940 0.4333 0.3827 0.2690 0.2701 0.6288 bmlS.phb.qey por 30 10 uclagsly 0.2494 0.5000 0.5080 0.4733 0.4167 0.2926 0.3088 0.6270 bmlS.phb.qey wpq 30 10 uclagsln 0.2318 0.5280 0.4920 0.4387 0.3753 0.2620 0.2744 0.6121 bmlS.phb.qey wpq 30 15 uclagsln 0.2271 0.5040 0.4860 0.4467 0.3700 0.2646 0.2668 0.6079 bm15.phb.qey wpq 20 5 uclagsln 0.2266 0.5360 0.4820 0.4467 0.3707 0.2570 0.2699 0.6016 bmlS.phb.qey wpq 30 5 uclagsln 0.2173 0.5280 0.4680 0.4347 0.3660 0.2480 0.2651 0.5790 Table 2: Ad hoc results: Runs with and without query expansion (Runs are presented in descending `Recall' values.) for topics 101-150. Run~nstme Avg Prec Prec 5 Prec 10 Prec 15 Prec 30 Prec 100 R-Prec Recall uclaal: auto.bmlS.phb.qen.uclagsly 0.3345 0.5840 0.5380 0.4973 0.4333 0.3098 0.3629 10.81551 uclaa2: auto.bmlS.phb.qey:wpq-10-10.uclagsly 0.2957 0.5440 0.5180 0.4920 0.4207 0.2760 0.3289 10.77861 uclafi: fdbk.bm15.phb.qey:wpq-10-10.uclagsly 0.3090 0.5880 0.5220 0.4893 0.4360 0.2884 0.3459 10.7745 Table 3: Performace Averages over all Topics of the Ad hoc Runs with (Runs Naxried af~er the Algorithm used in the Expansion.) Query Expansion. Run~ame Avg Prec Prec 5] Prec[10] Precfls] Prec[30] Prec[160] R-Prec Recall r~lohi 0.3414 0.5880 0.5240 0.4947 0.4427 0.3152 0.3688 0.8290 ~hiIo 0.3388 0.5880 0.5240 0.4960 0.4347 0.3160 0.3692 0.8333 emirn 0.3176 0.5880 0.5240 0.4920 0.4433 0.2938 0.3554 0.7989 uclaŁ1: wpq 0.3087 0.5880 0.5240 0.4893 0.4360 0.2882 0.3460 0.7753 por~er 0.2990 0.5880 0.5240 0.4893 0.4280 0.2798 0.3323 0.7457 286 Table 4: Sign Test Difference8, Pr~~iq~n~ At 1 .~ fl~~iivn~~ta _______ I' wpq porter emim r~lohi r~hilo j wpq Ijo 3 1 3 2 1 porter Il~ 0 4 3 1 ~ emim 112 4 0 3 4 I r~Iohi [[~s 6~ 4 0 r~hito 5 3 j Table 5: Sign Test Probabilities, _______ Precision at 15 Documents ________ [1 wpq porter emim r~lohi r~hi1o wpq II 1.0000 porter II 1.0000 1.0000 emim 1.0000 1.0000 1.0000 r~lohi II 0.7266 0.5078 1.0000 1.0000 r~hilo 0.4531 0.2188 1.0000 1.0000 1.0000 Table 6: Sign Test Differences, _______ Precision at 30 Documents L~ _ _1' ~ r~hiIo 1.0000 ~j 0.3613 1.0000 ~ 0.8445 0.7277 1.0000 [1 0.4725 0.2652 0.8638 1.0000 _______ 0.8501 0.3613 0.7353 0.1338 1.0000 Table 8: Sign Test Differences, _______ Precision at 100 Documents LII 3. able 9: Sign Test Probabilities, T Precision at 100 ~ 1.0000 porter 0.3239 1.0000 wpg II wpq porter emim r~1ohi r~hi1o emim 0.0543 0.2002 1.0000 r~lohi 0.0003 0.0004 0.0250 1.0000 r~hiJo 0.0003 0.0002 0.0046 1.0000 1.0000 Table 10: Sign Test Differences, _______ Average Precision [ _______ II wpq porter emim r~lohi r~hiIo ] wpq f[ 0 29 15 10 11 porter 20 0 17 9 10 em:m ~ 25 32 0 13 16 r~lohi II ~ 40 36 0 31 r~hilo 38 39 33 17 0 r _______ wpq porter emim r~1ohi r~hilo wpq 0 18 12 13 15 porter 12 0 15 11 12 emim 14 18 0 16 19 r~1ohi 18 18 18 0 15 r~hilo 13 18 16 7 0 Table 7: Sign Test Probabilities, ________ Precision at 30 Documents [ ________ wpq porter emim r~1ohi 3 wpq porter em:m r~Johi r~hilo _______ wpq porter emim r~lohi r~hi1o wpq 0 22 8 8 8 porter 15 0 15 8 8 emim 19 24 0 12 9 r~lohi 32 31 27 0 15 r~hiIo 32 33 27 14 0 L. __ 287 Table 11: Sign Test Probabilities, Average Precision I II wpq 111.00001 porter II0.25311.0000I em:m II0.15470.04551.0000I r~loh: ~0.00010.00000.00171.0000 r~hilo 0.0002 0.0001 0.0223 0.0606 1.0000 Table 12: Sign Test Differences, _______ Recall r..II J Table 13: Sign Test Probabilities, ____ Recall [ ________ [1 wpq porter emim r~lohi r~hiIo ] wpq fi 1.0000 porter jI 0.0080 1.0000 emim jj 0.2100 0.0180 1.0000 r~Iohi II 0.0085 0.0000 0.0101 1.0000 r~hilo 0.0005 0.0000 0.0008 1.0000 1.0000 Table 14: Sign Test Differences, ______ Recall/Precision (.11 I Table 15: Sign Test Probabilities, _______ Recall/Precision u,pq ~ll 1~w00p0q0 porter emim r~lohi r~hilo ] porter jj 0.2012 1.0000 emim jj 0.0169 0.0062 1.0000 ~ rThlOh~ It 0.0367 0.0008 0.2012 1.0000 hilo 0.0023 0.0017 0.1508 1.0000 1.0000 wpq porter emim r~1ohi r~hi~o I ________ wpq porter emim r~Iohi r~hilo wpq 0 24 8 10 7 porter 8 0 10 4 3 emim 15 25 0 9 6 r~Iohi 27 28 25 0 14 r~hiIo 29 32 26 14 0 _______ wpq porter emim r~1ohi r~hilo wpq 0 19 5 10 8 porter 11 0 7 6 7 emim 17 23 0 11 11 r~lohi 23 26 19 0 13 r~hiIo 27 26 20 12 0 Table 16: i-~es~, Precision at 15 Documents Mean SD Rnns Difference Difference i Probability r~hiZo/rJohi 0.0013 0.0285 0.3305 0.7424 r~h~Io/emim 0.0040 0.0367 0.7712 0.4443 r~lohi/emim 0.0027 0.0300 0.6282 0.5328 r~hilo/porier 0.0067 0.0278 1.6973 0.0960 r~lohi/porter 0.0053 0.0325 1.1579 0.2525 emim/porter 0.0027 0.0380 0.4957 0.6224 r~hilo/wpq 0.0067 0.0337 1.3999 0.1678 r~lohi/wpq 0.0053 0.0352 1.0703 0.2897 emim/wpq 0.0027 0.0232 0.8137 0.4197 porier/wpq 0.0000 0.0404 0.0007 0.9994 Table 17: t-tesi, Precision at 30 Doc~~xnent~ Mean SD Runs Difference Difference t Probability r~hi1o/r~lohi -0.0080 0.0298 -1.8984 0.0635 r~hilo/emim -0.0087 0.0583 -1.0521 0.2979 r~lohi/emim -0.0007 0.0585 -0.0810 0.9358 r~hilo/porter 0.0067 0.0539 0.8748 0.3860 r~lohi/ porter 0.0147 0.0518 2.0019 0.0508 emim/porter 0.0153 0.0694 1.5624 0.1246 r~hilo/wpq -0.0013 0.0534 -0.1770 0.8602 r~lohi/wpq 0.0067 0.0530 0.8882 0.3788 emim/wpq 0.0073 0.0458 1.1312 0.2635 porter/wpq -0.0080 0.0450 1.2590 0.2140 Table 20: ~-iest, Rerall Mean SD Runs Difference Difference i Probability r~hilo/r~Iohi 0.0005 0.0312 0.1052 0.9166 r~hilo/emim 0.0363 0.0743 3.4503 0.0012 r~lohi/emim 0.0358 0.0819 3.0935 0.0033 r~hi Jo/porter 0.0731 0.1094 4.7267 0.0000 r~lohi/ porter 0.0727 0.1108 4.6356 0.0000 emim/porter 0.0369 0.0965 2.7010 0.0095 r~hilo/wpq 0.0506 0.0835 4.2805 0.0001 r~lohi/wpq 0.0501 0.0908 3.9007 0.0003 emim/wpq 0.0143 0.0529 1.9106 0.0619 porter/wpq -0.0226 0.0739 -2.1583 0.0358 Table 18: t-iest, Precision at 100 Documents Runs r~hilo/r~lohi r~hilo/emim r~1ohi/emim r~hilo/porter r~lohi/ porter emim/porter 0.0140 0.0535 1.8494 0.0704 r~hiJo/wpq 0.0278 0.0557 3.5311 0.0009 Table 21: i-lest, r~lohi/wpq 0.0270 0.0600 3.1815 0.0025 Recall/Precision emim/wpq 0.0056 0.0301 1.3150 0.1946 Mean SD Runs Difference Difference t Probability porter/wpq -0.0084 0.0410 1.4496 0.1536 r~hilo/r~Johi 0.0003 0.0246 0.0978 0.9225 r~hiJo/emim 0.0138 0.0450 2.1635 0.0354 r~Johi/emim 0.0134 0.0517 1.8396 0.0719 r~hi Jo/porter 0.0369 0.0636 4.1048 0.0002 r~Johi/porter 0.0366 0.0669 3.8626 0.0003 Table 19: t~tes1, Average Precision emim/porter 0.0231 0.0576 2.8398 0.0066 Mean SD r~hi'o/wpq 0.0231 0.0532 3.0729 0.0035 Runs Difference Difference t Probability r~lohi/wpq 0.0228 0.0635 2.5401 0.0143 r~hilo/rJohi -0.0026 0.0153 -1.1935 0.2384 emim/wpq 0.0094 0.0313 2.1135 0.0397 r~hilo/emim 0.0212 0.0421 3.5637 0.0008 porter/wpq -0.0138 0.0441 -2.2100 0.0318 r~lohi/emim 0.0238 0.0470 3.5786 0.0008 r~hi Jo/porter 0.0398 0.0615 4.5727 0.0000 r~Johi/porter 0.0424 0.0658 4.5547 0.0000 emim/porter 0.0186 0.0592 2.2172 0.0313 r~hiJo/wpq 0.0301 0.0537 3.9615 0.0002 r~Johi/wpq 0.0327 0.0603 3.8291 0.0004 emim/wpq 0.0089 0.0335 1.8726 0.0671 porter/wpq -0.0097 0.0358 -1.9107 0.0619 288 Acknowledgements Thanks to Stephen Robertson and Stephen Walker for making OKAPI available to UCLA. We are especially thankful to Stephen Walker for all his con- tinuing help over and above that indicated in the text. We wish to thank Donna Harman of NIST for her support during TREC-2. This research was supported by a UCLA Academic Senate grant that also provided financial support for the graduate re- search assistantship of PVB. Finally, ENE is grateful to SUN Microsystems Inc. for the equipment grant of the SPARC 2 that made this research p05- sible. References Efthimiadis, E.N. (1993a) A user-centered evaluation of rank- ing algorithms for interactive query expansion. In: Korfhage R., Rasmussen E. and Willett P. (eds.) Proceedings of the 16th International Conference of the Association of Comput- ing Machinery, Specialist Interest Group in Information Re- trieval, June 1993. Pittsburgh, PA: ACM Press. pp.146-159. Efthimiadis, E.N. (in press) `User choices: As the yardstick for the evaluation of ranking algorithms for interactive query ex- pansion.' Information Processing ~ Management. In press. Efthimiadis, E.N. `Interactive Query Expansion: A User- based evaluation in a Relevance Feedback Environment.' Manuscript submitted for publication. Harper, D.J. & van Rijsbergen, C.J. (1978) An evaluation of feedback in document retrieval using co-occurrence data. Journal of Documentation, 34(3), 189-216. Porter, M.F. & Galpin, V. (1988) Relevance feedback in a public access catalogue for a research library: Muscat at the Scott Polar Research Institute. Program, 22(1), 1-20. Robertson, S.E. (1990) On term selection for query expansion. Journal of Documentation, 46(4), 359-364. Robertson, S.E. and Sparck Jones, K. (1976) Relevance weight- ing of search terms. Journal of the American Society for In- formation Science, 27(3), 129-146. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gull, A. & Lau, M. (1993a) Okapi at TREC. In: Harman, D. K. (Ed.) The First Text Retrieval Conference (TREC-1). Proceedings. NIST Special Publication 500-207. Gaithersburg, MD: NIST, 1993. pp2l-30. Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford, M. (1993b) Okapi at TREC-2. In: Harman, D.K. (Ed.) Text Retrieval Conference (TREC-2). Galthersburg, MD: NIST, August 30-September 1, 1993, Proceedings, to appear. 289 van Rusbergen, C.J. (1977) A theoretical basis for the use of co-occurrence data in information retrieval. Journal of Doc- umentation, 33, 106-119. van Rijsbergen, C.J., Harper, D.J. & Porter, M.F. (1981) The selection of good search terms. Information Processing and Management, 17(2), 77-91. A Runs The runs presented here are grouped by algorithm used for the expansion. The list starts with runs that did not include query expansion. bmlS.phn.qen.uclagsln bmlS.phn.qen.uclagsly bmlS.phy.qen.uclagsly bmlS.phb.qen.uclagsly bmlS.phb.qey:wpq-10-5.uclagsln bmlS.phb.qey:wpq-10-5.uclagsly bmlS.phb.qey:wpq-10-10.uclagsln bmlS.phb.qey:wpq-10-10.uclagsly bmlS.phb.qey:wpq-10-15.uclagsln bmlS.phb.qey:wpq-20-5.uclagsln bmlS.phb.qey:wpq-20-10.uclagsln bmls.phb.qey:wpq-20-15.uclagsln bmlS.phb.qey:wpq-30-5.uclagsln bmlS.phb.qey:wpq-30-10.uclagsln bmlS.phb.qey:wpq-30-15.uclagsln bmlS.phb.qey:emim-10-10.uclagsln bmlS.phb.qey:emim-10-10.uclagsly bmlS.phb.qey:emim-20-10.uclagsly bmlS.phb.qey:emim-30-10.uclagsly bmlS.phb.qey:port-10-10.uclagsln bmlS.phb.qey:port-10-10.uclagsly bmlS.phb.qey:port-20-10.uclagsly bmlS.phb.qey:port-30-10.uclagsly bm15.phb.qey:r~hi1o-10-10.uclagsln bm15.phb.qey:r~hilo-10-10.uclagsly bmlS.phb.qey r lohi 10 5.uclagsln bmlS.phb.qey r lohi 10 10.uclagsln bmlS.phb.qey r lohi 10 10.uclagsly bmlS.phb.qey r lohi 10 15.uclagsln bmlS.phb.qey r lohi 20 5.uclagslu bmlS.phb.qey r lohi 20 l0.uclagsln bmlS.phb.qey.rAohi 20 15.uclagsln bmlS.phb.qey:rAohi-30-5.uclagsln bmlS.phb.qey:rAohi-30-10.uclagsln bm15.phb.qey:r~lohi-30-15.uclagsln