Okapi at TI{EC-2 S E Robertson* S Walker* S JoneS* M M Hancock-Beaulieu5 M Gatford* Advisers: E Michael Keen (University of Wales, Aberyst- wyth), Karen Sparck Jones (Cambridge University), Peter Willett (University of Sheffield) 1 Introduction This paper reports on City University's work on the TREC-2 project from its commencement up to Novem- ber 1993. It includes many results which were obtained after the August 1993 deadline for submission of official results. For TREC-2, as for TREC-1, City University used versions of the Okapi text retrieval system much as de- scribed in [2] (see also [3, 4]). Okapi is a simple and robust set-oriented system based on a generalised prob- abilistic model with facilities for relevance feedback, but also supporting a full range of deterministic Boolean and quasi-Boolean operations. For TREC-1 [1] the "standard" Robertson-Sparck Jones weighting function was used for all runs (equa- tion 1, see also [5]). City's performance was not out- standingly good among comparable systems, and the intention for TREC-2 was to develop and investigate a number of alternative probabilistic term-weighting func- tions. Other possibilities included varieties of query ex- pansion, database models enabling paragraph retrieval and the use of phrases obtained by query parsing. Unfortunately, a prolonged disk failure prevented re- alistic test runs until almost the deadline for submission of results. A full inversion of the disks 1 and 2 database was only achieved a few hours before the final auto- matic runs. None of the new weighting functions (Sec- tion 1.1) was properly evaluated until after the results had been submitted to NIST; we have since discovered that several of these models perform much better than the weighting functions used for the official runs, and most of the results reported herein are from these later runs. 1.1 The system The Okapi system comprises a search engine or basic search system (BSS), a low level interface used mainly for batch runs and a user interface for the manual search * Centre for Interactive Systems Research, Department of In- formation Science, City University, Northampton Square, London EC1V OHB, UK 21 experiments (Section 5), together with data conver- sion and inversion utilities. The hardware consisted of Sun SPARC machines with up to 40 MB of memory, and, occasionally, about 8 GB of disk storage. Several databases were used from time to time: full disks 1 and 2, AP (disk 1) and WSJ (disk 1), full disk 3. All in- verted indexes included complete within-document p0- sitional information, enabling term frequency and term proximity to be used. Typical index size overhead was around 80% of the textfile size. Elapsed time for in- version of disks 1 and 2 was about two days. Running a single topic with evaluation averaged from about one minute to ten minutes, depending strongly on the num- ber of query terms. All preliminary evaluation used the "old" SMART evaluation program. Runs tabulated in this paper used an early version of the new evaluation program, for which we are grateful to Chris Buckley of Cornell University. 2 Some new probabilistic models Statistical approaches to information retrieval have tra- ditionally (to over-simplify grossly) taken two forms: (a) approaches based on formal models, where the model specifies an exact formula; (b) ad-hoc approaches, where formulae are tried be- cause they seem to be plausible. Both categories have had some notable successes. A more recent variant is the regression approach of Fuhr and Cooper (see, for example, [6]), which incorporates ad-hoc choice of independent variables and functions of them with a formal model for assessing their value in retrieval, selecting from among them and assigning weights to them. One problem with the formal model approach is that it is often very difficult to take into account the wide variety of variables that are thought or known to influ- ence retrieval. The difficulty arises either because there is no known basis for a model containing such variables, or because any such model may simply be too complex to give a usable exact formula. One problem with the ad-hoc approach is that there is little guidance as to how to deal with specific variables- one has to guess at a formula and try it out. This problem is also apparent in the regression approach- although "trying it out" has a somewhat different sense here (the formula is tried in a regression model, rather than in a retrieval test). The discussions of Sections 2.1 and 2.3 exemplify an approach which may offer some reconciliation of these ideas. Essentially it is to take a formal model which provides an exact but intractable formula, and use it to suggest a much simpler formula. The simpler formula can then be tried in an ad-hoc fashion, or used in turn in a regression approach. Although we have not yet taken this latter step of using regression, we believe that the present suggestion lends itself to such methods. 2.1 The basic model The basic probabilistic model is the traditional rel~ vance weight model [5], under which each term is given a weight as defined below, and the score (matching value) for each document is the sum of the weights of the matching terms: w = log (r + 0.5)/(R - r + 0.5) (n-r+0.5)/(N-n-R+r+0.5) where N is the number of indexed documents; n the number of documents containing the term; R the number of known relevant documents; r the number of relevant documents containing the term. (1) This approximates to inverse collection frequency (ICF) when there is no relevance informati6n. It will be referred to below (with or without relevance infor- mation) as 2.2 The 2-Poisson model and term frequency One example of these problems concerns within- document term frequency (if). This variable figures in a number of ad-hoc formulae, and it seems clear that it can contribute to better retrieval performance. How- ever, there is no obvious reason why any particular func- tion of if should be used in retrieval. There is not much in the way of formal models which include a if comp~ nent; one which does is the 2-Poisson model [7, 8]. The 2-Poisson model postulates that the distribution of within-document frequencies of a content-bearing term is a mixture of two Poisson distributions: one set of documents (the "elite" set for the particular term, which may be interpreted to mean those documents which can be said to be "about" the concept represented 22 by the term) will exhibit a Poisson distribution of a cer- tain mean, while the remainder may also contain the term but much less frequently (a smaller Poisson mean). Some earlier work in this area [8] attempted to use an exact formula derived from the model, but had limited success, probably partly because of the problem of esti- mating the required quantities. The approach here is to use the behaviour of the exact formula to suggest a very much simpler function of if which behaves in a similar way. The exact formula, for an additive weight in the style of w(i), of a term I which occurs if times, is (pIAtf 6-A + (1 - ~ + (1 - q')e~") w = log (q~Atfe~A + (1 - q~)~tf6~Ii)(pIe~A + (1 - p')e~'£) (2) where A is the Poisson mean for if in the elite set for is the Poisson mean for if in the non-elite set; p `is the probability of a document being elite for t given that it is relevant; q'is the probability of a document being elite given that it is non-relevant. As a function of if, this can be shown to behave as follows: it is zero for if = 0; it increases monotonically with if, but at an ever-decreasing rate; it approaches an asymptotic maximum as if gets large. The maximum is approximately the binary independence weight that would be assigned to an infallible indicator of eliteness. A very simple formula which exhibits similar be- haviour is if/(if + consiani). This has an asymptotic limit of unity, so must be multiplied by an appropriate binary independence weight. The regular binary inde- pendence weight for the presence/absence of the term may be used for this purpose. Thus the weight becomes _ if (ki+if) (3) where k1 is an unknown constant. Several points may be made concerning this argu- ment. It is not by any stretch of the imagination a strong quantitative argument; one may have many reservations about the 2-Poisson model itself, and the transformations sketched above are hardly justifiable in any formal way. However, it results in a modification of the binary independence weight which is at least plau- sible, and has just slightly more justification than plau- sibility alone. The constant k1 in the formula is not in any way determined by the argument. The effect of choice of constant is to determine the strength of the relationship between weight and if: a large constant will make for a relation close to proportionality (where if is relatively small); a small k1 will mean that if has relatively little effect on the weight (at least when if> 0, i.e. when the term is present). Our approach has been to try out various values of k1 (around 1 may be about right for the full disks 1 and 2 database). However, in the longer term we hope to use regression methods to determine the constant. It is not, unfortunately, in a form directly susceptible to the methods of Fuhr or Cooper, but we hope to develop suitable methods. 2.3 Document length The 2-Poisson model in effect assumes that documents (i.e. records) are all of equal length. Document length is a variable which figures in a number of weighting for- mulae. We may postulate at least two reasons why docu- ments might vary in length. Some documents may sim- ply cover more material than others; an extreme version of this hypothesis would have a long document consist- ing of a number of unrelated short documents concate- nated together (the "scope hypothesis"). An opposite view would have long documents like short documents, but longer: in other words, a long document covers a similar scope to a short document, but simply uses more words (the "verbosity hypothesis") It seems likely that real document collections contain a mixture of these effects; individual long documents may be at either extreme or of some hybrid type. All the discussion below assumes the verbosity hypothesis; no progress has yet been made with models based on the scope hypothesis. The simplest way to deal with this model is to take the formula above, but normalise if for document length (dl). If we assume that the value of k1 is appropriate to documents of average length (avdl), then this model can be expressed as w= (kalv>~d:f+if)w(i) A more detailed analysis of the effect on the Poisson model of the verbosity hypothesis is given in Appendix 7.4. This shows that the appropriate matching value for a document contains two components. The first compo- nent is a conventional sum of term weights, each term weight dependent on both if and dl; the second is a cor- rection factor dependent on the document length and the number of terms in the query (nq), though nol on which terms match. A similar argument to the above for if suggests the following simple formulation: correcijon facior = k2 x nq(avdl - dl) (avdl + dl) where k2 is another unknown constant. Again, k2 is not specified by the model, and must (at present, at least) be discovered by trial and error. Values in the range 0.0-0.3 appear about right for the TREC databases (if natural logarithms are used in the term-weighting functions1), with the lower values being better for equation 4 termweights and the higher values for equation 3. 2.4 Query term frequency and query length A similar approach may be taken to within-query term frequency. In this case we postulate an "elite" set of queries for a given term: the occurrence of a term in the query is taken as evidence for the eliteness of the query for that term. This would suggest a similar multiplier for the weight: _ qif (k3 + q1f)W(l) (6) In this case, experiments suggest a large value of k3 to be effective-indeed the limiting case, which is equiv- alent to w = qif~ w(i) (7) appears to be the most effective. We may combine a formula such as 6 or 7 with a document term frequency formula such as 3. In practice this seems to be a useful device, although the theory requires more work to validate it. 2.5 Adjacency The recent success of weighting schemes involving a term-proximity component [9] has prompted consider- ation of including some such component in the Okapi weighting. Although this does not yet extend to a full Keen-type weighting, a method allowing for adjacency of some terms has been developed. (4) Weighting formulae such as w(i) can in principle be applied to any identifiable and searchable entity (such as, for example, a Boolean search expression). An ob- vious candidate for such a weight is any identifiable phrase. However, the problem lies in identifying suit- able phrases. Generally such schemes have been applied only to predetermined phrases (e.g. those given in a dic- tionary and identified in the documents in the course of indexing). Keen's methods would suggest constructing phrases from all possible pairs (or perhaps larger sets) of query terms at search time; however, for queries of the sort of size found in TREC, that would probably generate far too many phrases. The approach here has been to take pairs of terms (5) which are adjacent in the query as candidate phrases. iT0 obtain weights within a range suitable for storage as 16-bit integers, the Okapi system uses logarithms to base 20.1 23 The present Okapi allows adjacency searches, so a phrase that is not specifically indexed can be searched, and assigned a weight in the usual Okapi fashion as if it had been indexed. One problem with that approach is that the single words that make up the phrase will probably also be included in the query, and that suggests that a docu- ment which contains the phrase will be overweighted, as it will be given the weight assigned to the phrase in addition to the individual term weights. So in the present experiments the weight assigned to the phrase has been adjusted downwards, by deducting the weights of the constituent terms, to allow for the fact that the individual term weights have necessarily been added. Where this correction would give a negative weight to the phrase, it has been adjusted again to an arbitrary small positive number. 2.6 Weighting functions used More than 20 combinations of the weighting functions discussed above were implemented at one time or an- other. Those mentioned in this paper are listed here. For brevity, most of the functions are referred to as BMnn (Best Match). BMO: Flat, or quorum, weighting. Each term is given the same weight. 4 Automatic query processing 4.1 Ad-hoc A large number of evaluation runs have been done to investigate . the effect of query term source * the use of a query term frequency (qq) component in term weighting, and * the use of algorithmically derived term pairs. 4.1.1 Derivation of queries from the topics Topic processing was very simple. An program (writ- ten in awk) was used to isolate the required topic fields, which were then parsed and the resulting terms stemmed in accordance with the indexing procedures of the database to be searched. A small additional stop list was applied to the NARRATIVE and DESCRIPTION fields only. If required, the procedure also output pairs of adjacent terms which occur in the same subfield of the topic and with no intervening punctuation. For ex- ample the command get~qterms 70 trecl2~93 tcd pairs=1 applied to BM1: ~(1) termweights. BM15: 2-Poisson termweights as equation 3 with doc- ument length correction as equation 5. BM11: 2-Poisson termweights with document length normalisation as equation 42 3 Document processing For TREC-1 City used an elaborate 25-field structure which was intended to make all the disparate datasets on the CDs fit a unified model. It would, for exam- ple, have been possible to restrict searches to "title", "headline" etc. In the event only the TEXT was used. For TREC-2, fields which looked useful for searching were simply concatenated into one long field. For most datasets fields other than DOCNO and TEXT were ignored, but the SJM LEAD PARAGRAPH, the Ziff SUMMARY and a few additional fields from the Patents records were included. This was done using a simple pen script (in contrast to the TREC-1 conversion program which used lex, ~acc and C). Most of the known data er- rors were handled satisfactorily, although for some rea- son there still remained a few duplicate DOCNOs from disk 1 ~nd/or 2. 2'n theory there was also an equation 5 document length cor- reciio~, but the best value of k2 was found to be zero. 24 Topic: Surrogate Motherhood <desc> Description: Document will report judicial proceedings and opinions on contracts for surrogate mother- hood. <con> Concept(s): 1. surrogate, mothers, motherhood 2. judge, lawyer, court, lawsuit, custody, hear- mg, opinion, finding (topic 70) gave 70:19:desc:1:contract:l 70:19:con:1:court:l 70:19:con:1:custodi:1 70:19:con:1:find:1 70:19:con:1:hear:1 70:19:con:1:judg:1 70:19:desc:1 :judici:1 70:19:con:1:lawsuit:l 70:19:con:1:lawyer:l 70:19:con:1:mother:1 70:19:tit:1:motherhood:3 70:19:con:1:opinion:2 70:19:desc:1:proceed:1 70:19:tit:1:surrog:3 70:19:desc:2:contract:surrog:l 70:19:desc:2:judici:proceed:1 70:19:desc:2:opinion:contract:i 70: 19:desc:2:proceed:opinion: 1 70:19 tit :2:surrog:motherhood:2 where the fields are iopic number, ~opic len9lh (number of terms counting repeats but not pairs), source field (in precedence order TITLE > CONCEPTS > NAR- RATIVE> DESCRIPTION > DEFINITIONS), num- ber of terms, term ..., frequency of this term or pair ii' the topic. 4.1.2 Document and query term weighting Table 1 shows the effect of varying query term source fields when no account is taken of within-query term frequency. Some tentative conclusions can be drawn: adding TI- TLE to CONCEPTS improves most measures slightly; TITLE alone works well in a surprising proportion of the topics; the DESCRIPTION field is fairly harmless used in conjunction with CONCEPTS, but NARRA- TIVE and DEFINITIONS are detrimental. (TIME and NATIONALITY fields, which are occasionally present, were never used.) This really only confirms what may be evident to a human searcher: that CONCEPTS con- sists of search terms, but most of the other fields apart from TITLE are instructions and guidance to relevance assessors. A sentence such as "To be relevant, a docu- ment must identify the case, state the issues which are or were being decided and report at least one ethical or legal question which arises from the case." (from the NARRATIVE field of topic 70) can only contribute noise. However, when a within-query term frequency (qi]) component is used in the term weighting, the infor- mation about the relative importance of terms gained from the use of all or most of the topic fields seems to outweigh the detrimental effect of noisy terms such as "identify", "state", issues", "question". Some results are summarised in Table 2. A number of values of k3 were tried in equation 6, and a large value proved best overall, giving the limiting case (equation 7), in which the term weight is simply multiplied by qtf. Many combinations of the weighting functions dis- cussed in Section 1.1, as well as others not described here, were first tested on the AP and/or WSJ databases. Some of them were eliminated immediately. The func- tion defined as BM15 gave almost uniformly better re- sults than t',(1), after suitable values for the constants had been found. BM11 appeared slightly less good than BMl5 on the small databases, but later runs on the large databases showed that, with suitable choice of constants, it was substantially, though not uniformly, better. This may be a consequence of the greater varia- 25 tion in document lengths found in the large databases. Table 3 compares the more elaborate term weighting functions with the standard ~(1) weighting and with a baseline coordination level run. Some work was done on the addition of adjacent pairs of topic terms to the queries (see Section 2.5). A num- ber of runs were done, using several different ways of ad- justing the "natural" weights of adjacent pairs. There was little difference between them, and the results are at best only slightly better than those from single terms alone (Table 3). There was also little difference between using all adjacent pairs and using only those pairs which derive from the same sentence of the topic, with no in- tervening punctuation. 4.2 Routing Potential query terms were obtained by "indexing" all the known relevant documents from disks 1 and 2; the topics themselves were not used (nor were known non- relevant documents). These terms were then given ~(1) weights and selection values [11] given by ~r ><~(1) where r and R are as in equation 1. A large number of retrospective test runs were per- formed on the complete disks 1 and 2 database, in which the number of terms selected and the weighting function were the independent variables. Overall, there was little difference in the average precision over the range 10-25 terms. This is consistent with the results reported by ilarman in [10]. With regard to weighting functions, BM1 was slightly better than BM15. However, look- ing at individual queries, the optimal number of terms varied between three (several topics) and 31 (topic 89) with a median of 11; and BM15 was better than BM1 for 27 of the topics. Two sets of official queries and results were produced. For the cityri run, the top 20 terms were selected for each topic and the weighting function was BM1. For cityr2 the test runs were sorted for each topic by preci- sion at 30 documents within recall within average pre- cision, and the "best" combination of number of terms and weighting function was chosen. When evaluated retrospectively against the full disks 1 and 2 database the eityr2 queries were about 17% better on average precision and 10% better on recall than the cityri. The official results (first and second rows of Table 4) show a similar difference. Later, both sets of queries were repeated using BM1 1 instead of the previous weighting functions (third and fourth rows of the table). These final runs both show substantially better results than either of the official runs. Table 1: Effect of varying query term sources (no query term frequency component) Query % of tops where source ave lgth Ave Prec Prec at 5 Prec at 30 Prec at 100 R-Prec Recall AveP > median TC 30.3 0.300 0.624 0.536 0.440 0.349 0.683 66 C 26.7 0.296 0.636 0.524 0.436 0.346 0.686 58 TCD 39~7 0.297 0.592 0.519 0.429 0.340 0.667 62 TCND 81.0 0.263 0.612 0.485 0.394 0.306 0.605 48 TCN 71.6 0.262 0.624 0.481 0.397 0.309 0.604 50 TCNDDef 86.3 0.257 0.580 0.468 0.387 0.303 0.604 46 TN 44.9 0.181 0.500 0.418 0.320 0.245 0.491 26 TND 54.4 0.179 0.492 0.403 0.317 0.243 0.491 24 TD 13.1 0.170 0.428 0.381 0.297 0.244 0.492 28 T 3.6 0.165 0.380 0.343 0.271 0.233 0.471 32 Terms: single. Document termweights: BM11. Database: disks 1 and 2. Topics 101-150 Query average length is the average number of terms taking account of repeats Table 2: Effect of varying query term sources (with query term frequency component) Query Weight % of tops where source function AveP PS P30 P100 RP Rd AveP > median TCND BM11 0.360 0.652 0.569 0.479 0.401 0.754 92 TCN BM11 0.356 0.644 0.565 0.482 0.399 0.749 92 TCNDDef BM11 0.354 0.648 0.559 0.474 0.395 0.751 92 TCD BM11 0.353 0.644 0.565 0.481 0.394 0.750 90 TC BM11 0.335 0.636 0.560 0.468 0.375 0.723 86 TC BM1S 0.284 0.560 0.485 0.416 0.336 0.685 56 TND BM11 0.283 0.556 0.503 0.414 0.338 0.652 60 TN BM11 0.274 0.556 0.497 0.399 0.331 0.643 56 TC BM1 0.232 0.504 0.435 0.361 0.289 0.601 28 Document term weights were multiplied byqif, equivalent to large k3 in eqn 6 Terms: single. Database: disks 1 and 2. Topics 101-150 Table 3: Effect of different document term weighting functions: single terms and adjacent pairs Weight % of tops where function Terms AveP PS P30 P100 RP Rd AveP > median BM1 1 singles + "natural" pairs 0.307 0.628 0.541 0.448 0.358 0.696 62 BM 11 singles + all adj pairs 0.304 0.612 0.544 0.447 0.357 0.694 62 BM11 singles 0.300 0.624 0.536 0.440 0.349 0.683 66 BM1S singles 0.227 0.500 0.434 0.351 0.285 0.595 38 BM1 singles 0.199 0.468 0.416 0.326 0.261 0.542 22 BMO singles 0.142 0.412 0.336 0.270 0.209 0.411 12 "Natural" means adjacent in the same sentence of the topic with no intervening punctuation Query term source: TC. qif component: none. Database: disks 1 and 2. Topics: 101-150 26 Table 4: Some routing results Weight Number % of tops where function of terms AveP PS P30 P100 RP Rd AveP > median BM1/BM15 variable 0.356 0.692 0.561 0.449 0.388 0.680 78 BM1 top 20 0.315 0.628 0.533 0.432 0.361 0.648 70 BM11 variable 0.394 0.700 0.599 0.48t 0.429 0.713 92 BM11 top 20 0.362 0.684 0.605 0.459 0.397 0.707 80 Best predictive run for comparison (BM11, qif with large k3, source TCD) 0.300 0.612 0.524 0.394 0.345 0.632 68 Database: disk 3. Topics: 51-100 5 Manual queries with feedback 5.1 The user interface The interface allowed the entry of any number of find commands operating on "natural language" search terms. By default, the system would combine the result- ing sets using the BM1S function described in Section 2.6, but any operation specified by the searcher would override this. All user-entered terms were added to a pool of terms for potential use in query expansion. Ev- ery set produced had any documents previously seen by the user removed from it. The show (document display) command displayed the full text of a single document (or as much as the user wished to see) with the retrieval terms highlighted (sometimes inaccurately). Unless specified by the user this would be the highest-weighted remaining document from the most recent set. At the end of a document dis- play the relevance question "Is this relevant (yin!?)" appeared; the system counted documents eliciting the "?" response as relevant3. The DOCNO was then out- put to a results file, together with the iteration number. Once some documents had been judged r~levant the exiraci command would produce a list of terms drawn from the pool consisting of user-entered terms and terms extracted from all relevant documents. Terms in the pool were given w(i) weights. User-entered terms were weighted as if they had occurred in four out of five fic- titious relevant documents (in addition to any real rele- vant documents they might have been present in). Thus for user-entered terms the numerator in equation 1 be- comes (r+4+0.5)/(R+5-r-4+0.5) [2]. Query expansion terms were selected from the term pool in descending order of the selection value [11] termweight x (r + 4)/(R + 5) for user-entered terms 31t was possible for searchers to change their minds about the relevance of a document. Subsequent feedback iterations handled this correctly, but the DOCNO would be duplicated in the search output. This appears to have led to some minor errors in the frozen ranks evaluation in a few topics. 27 otherwise iermweight ~ r/R, subject to not all docu- ments containing the term having been displayed, and the term not being a semi-stopword4 (unless it was en- tered by the user). A maximum of 20 terms was used. These selected terms were then used automatically in an expansion search, again with the BM1S weighting function. Each invocation of exiraci used all the available rele- vance information, and there was no "new search" com- mand. This was intended to encourage compliance with the TREC guidelines; it was not possible for a dissat- isfied user to restart a search. When the searcher de- cided to finish, after some sequence of find, show and exiraci commands, the resulis command invoked a final iteration of exiraci (provided there had been at least three positive relevance judgments). Finally, the top 1000 DOCNOs from the current set were output to the results file. Apart from the aforementioned commands, users could do info seis and hisiory. 5.2 Searchers and search procedure The searches were done by a panel of five staff and re- search students from City University's Department of Information Science. Search procedure was not rigidly prescribed, although some guidelines were given. There was a short briefing session and searchers were encour- aged to experiment with the system before starting. Procedures seemed to be considerably influenced by in- dividual preferences and styles. Some searches were done collaboratively. Searchers tried to find relevant documents by any means they liked within a single session. The number of iterations of query expansion varied between zero and four, with a mean of two. The IDs of all documents looked at were output to the results file, together with the iteration number. At the end of the session, if at least three relevant documents had been found the sys- tem did a final iteration of query expansion and output 4Semi-stopwords are words which, while they may be useful search terms if entered by a user, are likely to be detrimental if used in query expansion: numerals, month-names, common ad- verbs etc. the top 1000 IDs; if less than three the top i00~ trom the set which was finally "current" were output. There seemed to be an impression that the new top- ics (topics3) are more difficult than the old. Results may also have been affected by the huge stoplist which was being used at that time because of a breakdown of the only disk large enough to hold the very large scratch files generated during inversion. Lack of the number "6" affected one topic, days of the week all- other ("Black Monday"). The searcher was urged to leave "Black Monday" to the end in case we were able to reindex before the deadline, but she decided to try it and thought it worked quite well. An edited transcript of one searcher's notes is given below as Appendix B. 5.3 Results The official results of the manual run (Table 5) are dis- appointing, with average precision 0.232 (60% of topics below median), precision at 100 docs 0.4 and recall 0.59. The final iteration was later re-run with BM1t instead of BM15, and the results combined with the feedback documents from the original searches for a frozen ranks evaluation5. This did somewhat better on a majority of the topics, but overall the manual results were very poor compared to some of the autoniatic runs. 6 Other experiments 6.1 Query modification without relevance information Some iterative automatic ad hoc runs were done in which the top 10-50 documents obtained by the best existing method were used (a) as a source of additional terms and (b) as a source of "relevance" information for the ~(1) weight calculation. Expansion terms were selected as described in Section 4.2, in descending order of r ~ ~ The maximum R number of additional terms was set at half the number of query terms. For many of the topics most of the top terms extracted from the feedback documents were in any case topic terms, so the number of additional terms was small. Example (topic 112) Topic 112: Funding biotechnology 30 feedback documents used In the table which follows, term sources are given either as doc, in the case of expansion terms, or as a topic field, where tit> con> nrlr> desc. In this example, final weights involve a qtf component, and were obtained using equation 6 with 5There were two topics where the searcher found no relevant documents, so for these topics the original results were inserted. 28 (tile res~ilting weight was multiplied i~y k3 to obtain ad~(jUate gia~ular~ty hL an integer representation). For ex- pansion terms., was taken as 1 and the same correction apph(d. `Term :`rc qif ~ docs `)iotechnoiog] fit 9 30 -oft 4 29 fund tit 2 23 capit r 2 21 ph arrnaceo t doc (0) 15 ventur it ar 1 21 tinanci.. - oar 2 17 startup.. nar 1 11 research n i- r 26 f"n.1nc doc (0~ 15 partner doc (0) 17 dreg dcc (0) 18 rovestor dcc (0) 19 ~,rovid oar 14 1i?~ 1 22 lechnologi dcc (0) 23 company... doc (0) 28 academ nar 1 4 corpor oar 2 9 desc 1 18 stock nar 1 20 industri doc (0) 23 develop dcc (0) 25 laborat on nar 1 9 quantifi oar 1 1 profit nar 1 14 enterpr nar 1 4 estabh.sh oar 1 10 arena* tiar 2 0 data oar 4 6 sale nar 1 12 loss nar 1 7 government. liar 1 13 assist nar 1 6 much desc 1 11 answer desc 1 2 follow oar 1 7 rel* desc 1 1 eg* nar 1 0 question desc 1 3 worldwid* nar 2 0 division * oar 1 2 fig~~* nar 1 2 Weights 145 80 55 51 73 67 36 62 61 54 55 53 - 52 66 21 36 50 50 - 48 73 48 76 26 37 43 33 43 42 - 42 51 39 82 39 40 38 59 33 38 29 148 15 108 8 30 24 39 22 24 20 39 20 28 20 52 16 26 9 52 9 67 8 37 8 126 4 41 6 41 5 Orig 765 148 78 78 55 64 70 35 6 from the topic. The here, Iline of the 43 terms are not starred terms were not used in the final search because their selection value ~ ~ is zero (to the nearest integer). For this topic, the additional terms were beneficial and reweighting alone rather neu+ral. 6The terms followed by ellipses represent synonym classe~ Final 614 213 88 81 64 59 57 55 54 48 48 47 46 45 44 44 42 42 41 38 38 37 37 34 34 33 29 25 24 21 21 19 17 17 17 14 8 8 7 7 6 5 4 Terms Wts AveP P5 P30 P100 RP Rd All Final 0.407 1.000 0.867 0.640 0.457 0.739 Topic Final 0.362 1.000 0.733 0.620 0.440 0.698 Topic Orig 0.373 0.600 0.800 0.700 0.433 0.680 Discussion The main motive for experimenting with this type of query expansion is that it is one way of finding terms which are in some sense closely associated with the query as a whole. It does not fit particularly well with the Robertson/Sparck Jones type of probabilistic theory [5], the validity of which depends on pairwise indepen- dence of terms in both relevant and nonrelevant docu- ments. However, it is clear, if only from the results in this paper, that mutual dependence does not necessarily lead to poor results. There are many variables involved. In our rather lim- ited experiments most of the initial feedback searches were done under the conditions of the first row of Ta- ble 2, that is with terms from title, concepts, narrative and description (there were a few runs using title and concepts only, but the results for most topics were not good); and weighting function BM11 with termweights given by equation 6 with large k3 (1000). This gave nearly the best precision at 5 and 30 documents of any of our results. The number of feedback documents was constant across topics and was varied between 10 and 50. For the final search, terms were always weighted with BM11, but several values of k3 were tried (in- cluding zero). Some runs used topic terms only and some used expansion terms as well. There was one run omitting narrative and description terms from the final search, but it was not among the very best and is not reported in the table. The number of terms in the final search was varied from 10 upwards, terms being selected as usual in descending order of Iermwe?gh~ x ~Rr. Some evaluations were done using frozen ranks, in case the initial searches tended to give better low precision, but this turned out not to be the case. A few of the results are summarised in Table 6. They include results which appear better than the best oth- erwise obtained, but the difference is small, and these runs have not yet been repeated on the other topic sets. A qq weight component is still needed (compare rows 2 and 14 of the table). The number of feedback docu- ments is not critical. Speeding searching by using only the top 10 or 20 terms is detrimental. It is interesting that results do not seem to be very greatly affected by the precision of the feedback set. Looking at the individual topics in the run represented by the top row of Table 6, 25 did better than in the feedback run, 18 did worse and the remainder about the same. Restricting to the 20 topics where the precision at 30 in the feedback set was below 0.5, the corresponding figures are 7,10 and 3. 29 6.2 Stemming A comparison was made on the AP database between the normal Okapi stemming which removes many suf- fixes and a "weak" stemming procedure which only con- flates singular and plural forms and removes "mg" end- ings. For some weighting functions weak stemming in- creased precision by about 2% and decreased recall by about 1%, but the observed difference is unlikely to be significant. 6.3 Stoplists Some runs were done on the AP database to investigate the effect of stoplist size. A small stoplist consisted of the 17 words a, the, an, at, by, into, on, for, from, to, with, of' and, or, in, not, et and a large one contained 209 articles, conjunctions, prepositions, pronouns and verbs. There was no significant difference in the results of the runs, but the index size was about 25% greater with the small stoplist. 7 Conclusions and prospects 7.1 The new probabilistic models The most significant result is perhaps the great improve- ment in the automatic results brought about by the new term weighting models. In the ad-hoc runs, with no qif component, BM1S is 14% better than BM1 on average precision and about 9% better on high precision and re- call. The corresponding figures for BM11 are 51% and 34% (Table 3). For the routing runs, where a consider- able amount of relevance information had contributed to the term weights, the improvement is less, but still very significant (Table 4). For the manual feedback searches (Table 5) there was a small improvement when they were re-run with BM11 replacing BM1S in the final it- eration. The drawback of these two models is that the theory says nothing about the estimation of the constants, or rather parameters, k1 and k2. It may be assumed that these depend on the database, and probably also on the nature of the queries and on the amount of relevance information available. We do not know how sensitive they are to any of these factors. Estimation cannot be done without sets of queries and relevance judgments, and even then, since the models are not linear, they do not lend themselves to estimation by logistic regression. The values we used were arrived at by long sequences of trials mainly using topics 51-100 on the disks 1 and 2 database, with the TREC-1 relevance sets. Taking advantage of the very full topic statements to derive query term frequency weights gives another sub- stantial improvement in the automatic ad-hoc results. Comparing the top row of Table 2 with the top row of Table 1, there is a 20% increase in average precision. The "noise" effect of the narrative and description fields is far more than outweighed by the information they give about the relative importance of terms (compare the "TCND" row of Table 1 with the top row of Table 2). It remains to be discovered how well these new mod- els perform in searching other types of database. Term frequency and document length components may not be very useful in searching brief records with controlled in- dexing, but one would expect these models to do well on abstracts. It is also rare to have query statements which are as full as the TIPSTER ones, so there are many situations in which a q'f component would have little or no effect. 7.2 Routing Our results here (Table 4) were relatively good, and fur- ther improved when re-run with BM11. However, the TREC routing scenario is perhaps not particularly re- alistic, given the large amount of relevance information, which we made full use of as the sole source of query terms. In addition, the best of our runs depended on a long series of retrospective trials in which the num- ber of query terms was varied. In a real-world situation one would have to cope with the early stages when there would be few documents and little relevance information (initially none at all). It would be necessary to develop a term selection and weighting procedure which was ca- pable of progressing smoothly from a minimum of prior information up to a TREC-type situation. It may be possible to come up with a decision procedure for term selection using something similar to the selection value w(i) x ~ Perhaps a future TREC could include some more restrictive routing emulations. 7.3 Interactive ad-hoc searching The result of this trial was disappointing except ou pre- cision at 100 documents (Table 5), scarcely better than the official automatic ad-hoc run. On three topics it gave the best result of any of our runs, and two more were good, but the remaining 45 ranged from poor to abysmal. Little analysis has yet been done. For some topics it is clear that the search never got off the ground because the searcher was unable to find enough relevant documents to provide reliable feedback information, but the mean number found per topic was ten, which should have been enough to give reasonable results (cf Table 6, where ten feedback documents performs quite well). Currently, there are discussions towards a more realistic 30 set of rules for interactive searching for TREC-3, and we hope to develop a better procedure and interface. 7.4 Prospects Paragraphs When searching full text collections one often does not want to search, or even necessarily to retrieve, complete documents. Our new probabilistic models do not apply to documents where the verbosity hypothesis does not apply (Section 2.3). Some of the TREC-2 participants searched "paragraphs" rather than documents, and this is clearly right, provided a sensible division procedure can be achieved. We made some progress towards de- veloping a "paragraph" database model for the Okapi system, but there has not been time to implement it. Further work then needs to be done on methods of deriv- ing the retrieval value of a document from the retrieval value of its constituent paragraphs. Parameter estimation Work is in progress on methods of using logistic regres- sion or similar techniques to estimate the parameters for the new models. Derivation and use of phrases and term proximity A few results are reported in Table 3. They are not particularly encouraging. There is probably scope for further experiments in this area, not only on tuples of adjacent words but also on Keen-type [9] weighting of query term clusters in retrieved documents. References [1] D.K. Harman (Ed.), The Firs' TexL RE~rieval Conference (TREC-1). Gaithersburg, MD: NIST, 1993. [2] Robertson S.E. ei aL Okapi at TREC. In: [1] (pp.21-30). [3] Walker, S. and Hancock-Beaulieu, M. Okapi a~ Cii~: an evalua~ion facili~y for in'erac~ive IR. Lon- don: British Library, 1991. (British Library Re- search Report 6056.) [4] Hancock-Beaulieu, M.M. and Walker, S. An eval- uation of automatic query expansion in an online library catalogue. Journal of Documeniajion, ~8, Dec.1992, 406-421. [5] Robertson, S.E. and Sparck Jones, K. Relevance weighting of search terms. Journal of ~he American Socieiy for Inform aiion Science, 27,1976, 129-146. [6] Cooper, W. et aL Probabilistic retrieval in the TIP- STER collection: an application of staged logistic regression. In: [1] (pp.73-88). [7] Harter, S.P. A probabilistic approach to automatic keyword indexing. Journal of the American Societ~ for Information Science, 26,197-206 and 280-289. [8] Robertson, S.E, Van Rusbergen, C.J. & Porter, M.F. Probabilistic models of indexing and search- ing. In Oddy, R.N. et al. (Eds.), Information Retrieval Research (pp.35-56). London: Butter- worths, 1981. [9] Keen, E.M. The use of term position devices in ranked output experiments. Journal of Documen- tation, 17,1991,1-22. [10] Harman, D. Relevance feedback revisited. In: SI- GIR 92. Proceedings of the 15th International Con- ference on Research and Development in Informa- tion Retrieval (pp.280-289). ACM Press, 1992. [11] Robertson, S.E. On Term Selection for Query Ex- pansion. Journal of Documentation, 16,1990, 359- 364. A 2-Poisson model with document length component Basic ideas The basic weighting function used is that developed in [8], and may be expressed as follows: w(x) = P(xIR)P(o IR) log P(xIR)P(QIR) where (8) x is a vector of information about the document; 0 is a reference vector representing a zero-weighted document; R and R are relevance and non-relevance respec- tively. For example, each component of x~ may represent the pres- ence/absence of a query term in the document (or, as in the case of formula 2 in the main text, its document frequency); 0 would then be the "natural" zero vector representing all query terms absent. In this formulation, independence as- sumptions lead to the decomposition of w into additive com- ponents such as individual term weights. A document length may be added as a component of x however, document length does not so obviously have a "nat- ural" zero (an actual document of zero length is a patholog- ical case). Instead, we may use the average length of a doc- ument for reference; thus we would expect to get a formula in which the document length component disappears for a document of average length, but not for other lengths. 31 Suppose, then, that the average length of a document is A. The weighting formula becomes: w(x, d) = log P((x, d)IR)P((Q, A)IR) P((x, d)IR)P((0, A) iR) where d is document length, and :: represents all other in- formation about the document. This may be decomposed as follows: w(X, d) = w(X~, d)1 + w(x, d)2 where w(::, d)1 - log P(xI(R,d))P(Qt(R,d)) P(xI(~R,d))P(0 I(R,d)) and w(x, d)2 - log P((Q,d)IR)P((Q,~)IR) P((Q~d)IR)P((Q~~)jR) These two components are discussed further below. Hypotheses (9) As indicated in the main text, one may imagine different reasons why documents should vary in length. The two hy- potheses given there ("scope" and "verbosity" hypotheses) may be regarded as opposite poles of explanation. The ar- guments below are based on the Verbosity hypothesis only. The Verbosity hypothesis would imply that document properties such as relevance and eliteness can be regarded as independent of document length; given eliteness for a term, however, the number of occurrences of that term would de- pend on document length. In particular, if we assume that the two Poisson parameters for a given term, A and ~, are appropriate for documents of average length, then the num- ber of occurrences of the term in documents of length d wrn be 2-Poisson with means Ad/A and ~d/A. Second component The second component of equation 9 is w(x~, d)2 = log P(0~I(R, d))P(Qj(R, A)) + log P(dIR)P(A IR) P(QJ(R, d))P(QI(R, A)) P(dIR)P(A IR) Under the Verbosity hypothesis, the second part of this formula is zero. Making the usual term-independence as- sumptions, the first part may be decomposed into a sum of components for each query term, thus: (p~e~Ad/~ + (1 - pI)e~~dIA)(qIe~A + (1 - q')e-~) w(i, d)2 = log (qIe~Ad/A + (1 - qI)e~~d/~)(pIe~A + (1 - p')e-~) (10) where t is a query term and p', q', A and ~ are as in formula 2. Note that there is a component for each query term, whether or not the term is in the document. For almost all normal query terms (i.e. for any terms that are not actually detrimental to the query), we can assume that p' > q' and A > i£. In this case, formula 10 can be shown to be monotonic decreasing with d, from a maximum as d 0, through zero when d = A, and to a minimum as d 00. As indicated, there is one such factor for each of the nq query terms. Once again, we can devise a very much simpler function which approximates to this behaviour; this is the justifica- tion for formula 5 in the main text. First component Expanding the first component of 9 on the basis of term independence assumptions, and also making the assumption that eliteness is independent of document length (on the basis of the Verbosity hypothesis), we can obtain a formula for the weight of a term t which occurs tf times. This formula is similar to equation 2 in the main text, except that A and 1£ are replaced by Ad/~ and sd/~. The factors dIA in components such as Atf cancel out, leaving only the factors of the form e~AdlA Analysis of the behaviour of this function with varying If and d is a little complex. The simple function used for the experiments (formula 4) exhibits some of the correct proper- ties, but not all. In particular, the maximum value obtained as d 0 should be strongly dependent on tf; formula 4 does not have this property. B Extracts from a searcher's notes Choice of search terms Suitable words and phrases occurring in title, description, narrative, concept and definition fields were underlined- often this provided more than enough material to begin with. Sometimes they were supplemented by extra words, e.g. for a query on international terrorism I added "nego- tiate", "hostage", "hijack", "sabotage", "violence", "propa- ganda", as well as the names of known terrorist groups likely to fit the US bias of the exercise. I did not look at reference books or other on-line databases, and tended to avoid very specific terms like proper names from the query descriptions, as I found they could lead the search astray. For instance, the 1986 Immi- gration Law was also known as the Simpson-Mazzoli Act, but the name Mazzoli also turned up in accounts of other pieces of legislation, so it was better to use a combination of "real" words about this topic. In some queries, it was necessary to translate an ab- stract concept, e.g. "actual or alleged private sector eco- nomic consequences of international terrorism" into words which might actually occur in documents, e.g. "damage", "insurance claims", "bankruptcy", etc. For this purpose the use of a general (rather than domain-specific) thesaurus might be a useful adjunct to the system. Like the other participants I was surprised at the contents of the stop~word list, e.g. "talks", "recent", "people", "new", but not "these"! However it was usually possible to find synonyms for stop-words and their absence was not seriously detrimental to any query. Grouping of terms, use of operators Given the complexity of the queries, it was obviously nec- essary to build them up from smaller units. My original intention was to identify individual facets and create sets of single words representing each, then put them together to form the whole query. ~...] For example, for a query about 32 the prevention of nuclear proliferation I had a set of "nu- clear" words (reprocessing, plutonium, etc.), a set of "con- trol" words (control, monitor, safeguards, etc.) and sets of words for countries (argentina, brazil, iraq, etc.) suspected of violating international regulations on this point. This proved a bad strategy-the large sets (whether ORed or BMed7 together) had low weightings because of their collec- tively high frequencies, and the final query was very diffuse. A more successful approach was to build several small, high-weighted sets using phrases with OP=ADJ or OP=SAMES[entence] (e.g. economic trends, gross national product, standard of living, growth rate, productivity gains), and then to BM them together, perhaps with a few extra singletons (e.g. decline, slump, recession). Because of the TREC guidelines, I didn't look at any documents for the small sets as I went along, although under normal circum- stances I would have done so. Our initial instructions were to use default best-matching if at all possible, rather than explicit operators. As al- ready suggested, ADJ and SAMES were an absolute neces- sity given the length of documents to be searched, but AND and OR were generally avoided-on the occasions when I tried AND (out of desperation) it was not particularly use- ful. For one query where I thought it might be necessary (to restrict a search to documents about the US economy) it luckily proved superfluous because of the biased nature of the database, indeed it would have made the results worse as the US context of these documents was implied rather than stated. Viewing results, relevance feedback Normally I looked at about the top 5-10 records from the first full query. If 40% or more seemed relevant, the query was considered to be fairly satisfactory and I went on down the list trying to accumulate a dozen or so records for the ex- traction phase. As ... noted by other participants, there was a conflict between judging a record relevant because it fitted the query, and because it was likely to yield useful new terms for the next phase. On the one hand were the "newsbyte" type of documents containing one clearly relevant paragraph amidst a great deal of potential noise, and on the other the documents which were in the right area, contained all the right words, but failed the more abstract exclusion condi- tions of the query. I tried to judge on query relevance, but erred on the side of permissiveness for documents containing the right sort of terms. The competition conditions discouraged a really thorough exploration of possibilities when a query was not initially successful. In one very bad case, having seen more than 20 irrelevant records and knowing that they would appear at the head of my output list, I felt that the query would show up badly in the [results] anyway and that it was not worth exploring further, as I might had there been a real question to answer. 7BM = "best match"; the default weighted set combination operation was BM15 (see Section 2.6) Extracting new terms I tried to get at least six relevant documents for the extrac- tion phase, and usually managed a few more. As already noted, sets generated by term extraction contain only sin- gle words, so before looking at the new records I sometimes added in a few phrases to this set, either important ones from the original query or others which had occurred in relevant documents. The extracted sets of terms tended to be larger than the original query and certainly included items which a human searcher (at least one unfamiliar with this genre of literature) would not have thought of. It was amusing, for instance, to see "topdrawer" and "topnotch" (epithets for companies) extracted from documents about investment in biotechnology, and "leftist" (an invariable collocate for San- danista) pulled out of documents about Nicaraguan peace talks. Some material for socio-linguistic analysis here! My impression ... is that where the original document set from which terms were extracted was fairly coherent, the de- rived set [from query expansion] also had a high proportion of relevant documents. Not surprisingly, where I had scraped the barrel and tried several different routes to a few relevant documents, extraction produced equally miscellaneous and disappointing results. Normally I went through two or three cycles of selec- tion/extraction, but looking at fewer records each time. The set of extracted terms did not seem to change materially from one cycle to the next, and I would have expected the final result file reflected the query quite well even though the phrases had been lost. Conclusion In spite of the frustrations of this exercise, I found it a more interesting retrieval task than normal bibliographic search- ing, mainly because it was possible to see the full documents to gauge the success of the query, and use a broader range of natural-language skills to dream up potentially useful search terms. Acknowledgments We are most grateful to the British Library Research & Development Department and to DARPA/NIST for their financial support of this work. Our advisers have been unstintingly helpful. We blame the system, not our panel of searchers, for the poor results of the interactive trial. Above all, we wish to thank Donna Harman of NIST for her outstandingly efficient and courteous or- ganisation and management of the TREC projects. 33 Table 5: Manual searches with feedback % of tops where Run AveP PS P30 P100 RP Rd AveP > median Official (BM15) 0.232 0.492 0.468 0.400 0.297 0.591 40 Re-run (BM11) 0.247 0.480 0.477 0.411 0.315 0.607 48 Database: disks 1 and 2. Topics: 101-150 Table 6: Some results from query modification #fbk Term % of tops where docs source k3 Terms AveP P5 P30 P100 RP Rd AveP > median 50 TCNDdoc 8 all 0.369 0.660 0.591 0.487 0.408 0.754 92 30 TCNDdoc 8 all 0.368 0.668 0.585 0.486 0.407 0.749 88 10 TCNDdoc 8 all 0.363 0.668 0.584 0.485 0.400 0.748 88 50 TCNDdoc 9 all 0.360 0.624 0.573 0.482 0.399 0.748 88 50 TCNDdoc 9 top 40 0.354 0.620 0.573 0.478 0.394 0.741 86 50 TCNDdoc 9 top 30 0.353 0.632 0.567 0.480 0.395 0.742 88 30 TCNDdoc 8 top 30 0.360 0.676 0.577 0.479 0.402 0.741 88 30 TCNDdoc 8 top 20 0.348 0.636 0.571 0.474 0.392 0.734 82 30 TCNDdoc 8 top 10 0.318 0.604 0.537 0.449 0.366 0.702 78 50 TCND 8 all 0.364 0.636 0.573 0.487 0.406 0.755 92 30 TCND 8 all 0.362 0.644 0.573 0.484 0.408 0.749 90 10 TCND 8 all 0.363 0.640 0.574 0.481 0.406 0.754 88 30 TCND 0 all 0.334 0.652 0.559 0.458 0.374 0.711 80 30 TCNDdoc 0 all 0.310 0.645 0.546 0.448 0.359 0.675 66 Initial feedback run for comparison (top row of Table 2) None TCND large all 0.360 0.652 0.569 0.479 0.401 0.754 92 Retrospective run using all known relevant documents to reweight the topic terms Variable TCND 0 all 0.371 0.708 0.600 0.497 0.408 0.758 92 Database: disks 1 and 2. Topics: 101-150 34