Automatic Retrieval With Locality Information Using SMART C1hris 13ucktc~' c;(1'ar4 Salton aild James Allan Abstract The Smart project at Cornell University', using a completely antomatic ~`ipproach for both rout- ing an(l ad-hoc experiments, performed extremely ~velI in the first Text Retrieval Confereilce. The basic ad-hoc approach nses local/globc'4 i~atching to achieve its results. I\ global illatch ensures that e('lch retrieved document uses the s('lme vocai)lilai'y c'~s the (~~iery; a loc('1i match then ~`~ttenll)ts to guaxantee some local part of Ilie (locilmelit (eg. a par~gr('lpli Ol' seiltelice ) i'ocuises Oil the query topic. l~nns ~vere niade with and ~vitIiout siiiIl)le adjaceucy phrases. \ sinil)le relevance feedback algorithm is nsed for routilig ex1)erillielits; I lie origin~'~ (jilely' is exp~'in(le(l hy. terms occil i.ring in relevaut documents, ~vith terul ~veiglits I)eing based ii 1)011 oCclIi.i.en('(' ill the i.elev~'int docti ments. In additioii, a set of system design issnes ail(l tradeofi; aic exanimed. Introduction For over 30 years, the Smart project at Coruell has been iiiterested in tile analysis, search, and retrieval of heterogeucous text databases, where the voc('i.l)lilary is allowed to vary ~vid ely, and tile subject matter is unrestricted. Such databases may in ci I1(le newspaper articles. `iewswire disl)atclies. textbooks, dictionaries and encyclopedias, niaiiuals, niagazine articles, and so ou. tl'lie usual text analysis and text indexing approaches that are based oil tile use of thesauruses aud other vocabulary control devices are difficult to al)plv iii un restricted text. en vi lolimeuts, because t lie word nicanings are not stable in such circumstances aud the iliterl)retation varies dependiug on coiitext. The applicability of more complex text analvsis systems that a.i.e based on the coustruction of kuowledge ba.ses covering the detailed structure of particiil('ir sul)ject areas, together with infereuce rules designed to derive relationships betw('.eIl the relevaut concepts, is eveil more questionable iii such cases. Complete theories of knowledge representation (10 not exist, and it is unclear what concepts, concept relationships and inference rules may' be ileeded to understaii(l particlilar texts.[5] Accordingly, a text analysis and retrieval coinpoiieiit must necessarily be l)ase(J j)rin)a.rily on a. study of the available texts thejuselves. l~)rttinateIy vei'y large text (lat ab ~ aic now available in machine- readable form, and a. 5 n bstaiitial amount of in fori iiation is automatically derivable about the occurrence properties of wor(ls and expressions iii iiatnral-laiigiiage texts, and about the contexts in which the words are used. This informa.t.ioii can help in determining whether two or niore texts are semantically homogeneous. that is, whei.her they cover similai' subject areas. VVhen that is the case, such semantically lioniogeneous texts caii be liliked, thereby generating an automatic structured text. (hypertext ) I'eI)i.eseiitatioil; alternatively, in a retrieval setting a text can be retrieved when another semantically homogeneous text is sul)niitted as a. query. *Depart,ment of CoiiIJ)Iiter Scieiice, Coijiell Univeisity, Ithaca.. N'i" 14X..53-7501 . ~~~iis. siudy ~vas sIil)i)ort.e(1 iii part by the Natioiial Scieiice Fon'idation under gralit IRi 89-i5847. 59 Automatic Indexing In the Sinart context, the vector-I)I'occssiIIk-) ino4el of ret i'i~vai is ~sed to transfoi'in 1)0th tile available information re(llIests as well as t~le stored doclinlents il)to vector form of' ti~c type where D~ represents a. doci~ment (or (IlI(~Iv ) text and ~1"ik IS a tei'ni weight of tei'm 7~ (`Ittached to documeut I)~. A weight of zero is used for terius that a~e a.l)5C11t from a. 1)articnI(~r document, and positive weights characterize terms actually assigned. ~~lie assuiliption is that I. terlils in Mi a.re available for the repiesentation of the lufoimation. In choosing a. term weightiug sysl.eIil low weights ShOUl(l l)e assigned to lligl1~fre(IIIency terms that occur in many documents of a collectioii, and high weights to ter~s that are important in particular documents but uniniportant iii the renia.indei' of the collection. ~:lie weight of terms that occur rarely in a. colk'ction is ii iiimporta~t, l)ecailse such terilis colltI'il)ute little to the rela.tivelv needed similarity computation l)etween diffeicilt texts. A well-known term weighting system following that l)rescril)tion assigns weights U'ik to term Tk in doc~ent D~ in proportion to the fre(1uency of occu ricuce of a. terni iii D~, and in inverse proportion to the iium1)er of documents to which the terili is assigned .~6~9] Such a. weightiug system is known a.s a. t.f * i(~f (term fre(~uency times iiiverse dociuneut frequency) weighting system. In practice the document lengt~i, an(l heuce the liii mher of noli-zero term weights assigned to as document, varies wi(lelv. To give each text item an equal cltauce of being retrieved, it. is couveujent to use a length normalization factoi' a.s 1)ai't of the terni weighting formula. A liigh-(~tiality term weighting fo~ula for U)~k, the weight of term Tk in document D~ is ~~ik fik * log(~'/il.k) (1) ;Zk~=~(Lk * log(A"/ilk))2 where fik is the occurrence frequency of `I~ in D~, AT is the collection size atid 1~k the number of (1ocuiiients with term 1~ assigned. The factor log( ~V/Il.k ) is a.n inverse collectioii frequency factor which decreases as ternis are used widely iii a collection, and the (lenominator iii expression (1) is used for weight nor niMi zat ion. The terms Tk included in a. given vector cali in principle represent any entities assigned to a. document for content identihcation. In the Smart cotitexi., such terms are derived by a. text transformation of the following kind : [2] 1. recognize individual text words 2. use stop list to eliminate unwanted fuuction words 3. perform suffix removal to genera.te word stems 4. optionMly use term grouping methods based on a. statistical word co-occurreilce, or word adjacency, computation to for in terni phrases (alternatively syntactic analysis computations can also be used) 5. assign term weights to all remaining word stems and/or phiase stems to form the term vector for all information items. 60 Ouce ter~ vectois ale availa.l)ie for all iii forina.tio~ itetus, all .~ii l)s'equefli. plocessilig is based on terui vector m~ipiila.tions. The fact tha~ the indexiug of 1)0th (loctlments and qiiei~ies, is (`oml)letel~' automatic inea~s that the results obtajued are reasona.b{v collectioii indepejident a ud should be V(ili(l across a ~vi(le rafige of collections. No hitnian expeitise in t lie sii bject n~a.tter is Ie(111i1'ed for cit lici the initial collectioii creatloil, or the a.ctii al qnery for inn lat ion. Text Similarity Computation \~hen t lie texts are represeuted l)v t(~I'Iii vector of the foi'i~i L)~ ( U)~J ~I'i2 . . , U'~~ ) and I)~= (~~`j1, *~`j2~ ` ivJ1 ) for docti itients D~ a.II(l I)~ a. Si flu laril V ( ~ ) (`()I~ plitatioti l)('t~V('C1I t~vo items cali conveit iently be o l)ta.i iie(1 as 1. ii e 11.111 Ci' l)i'o(l ii ci. l)et~veeii (`01' respond I ug ~veiglited ten n vector as follows: `~( I)~, D~ ) ~ ( U'1A * u',A ) (2) k=1 Thus, the similarity' between two tex is ( ~vliet her (juCly ol' (bc ii i~ieii1. ) dej)eil(15 ~ 1, he weights of coincidiug terms in the two vectors. Iiiforma~ion retrieval aud text liuki ii g.' sy stenis l)a.se(l oil the use of gI()l)('i.l texi siiiiila.i'i iv meastires such as tlia.t of eX1)I'CSsioIi (2) ~vil] be successful wlieii I lie coiiiii~oii terilis iii i lie two vectors are in fact tised in se~a~ticall'y Si liii lar way's. Iii hiahy cases ii. ihay ho wever lia.ppeii thai. Ii ighly'- weighted terms I hat coiitribnte substautially' to t he text siiiiila.i'ity are Sd ~~anticalIy (tistilIct. I~or example a. sound may be an a.u(lible l)henomeiloii or a body' of water. In determining the meaning of in(lividlial words we take advice from ~\`ittgeiistein and others who suggest that text understanduig ni list l)e l)a.se(l oil a. St tidy of how text words are used in the language ( word use9' theory of text nieaning).[i 1] Iii a illecha.ilize(l text ~a.iiijiiila.tion environmeut, word use" may be interpreted as the coiitexts in which the words are use(l iii the texts iii which they occur. The assumption then is that identical woI'(ls used in ideutical coiitexts ( that is, in substantially similar j)~i.ases, sentences au (I l)aragral)liS ) are in faft seiiiantically lioniogeticolis. Coutrauwise, siiiiila.r woi'(1s such as "soil n(l" are exj)e(.te(l 1.0 occ iii' iii (I iffereut local euvi ron inejits when they represent (Ii ffereni (`ii tItles such a.s l)o(lies of water an(l aii(l I l)le l)llenonieiia. To detect similai'Ii.v of local tei'iii eli \`I `on hid 1.5. we (`a.i'i'V 0111. text Si liii Ian tv iIl('asllrerlieiltS such as those suggested b,~,' exp ressIon (2), l)ut aj)j)lIe(l 1.0 sitiall text. liii its such as text senten('eS and text paxagraj)hs. Two texts are t lien accejit e(l as relate(l ouly when a 511 fficieiii.ly high glol)a.l text simila~ity exists, as well a.s sufficient local text. siniilai'ii les In the forni of sImIlarities between sentences and/or paragraphs iii the texts tiiidei' stli(ly.[T.(~] A complete text retrieval svsteni l)ase(l oil text Sii)iIla.i'It'v con I J) 11 tatlous is t heti geucrated in the following way: 1. formation of term vectors for the text itenis 2. computation of text similarities, aud elimination (if text pairs with lusufliclent global text si inilarities 3. computation of local text similarities foi' the reinaiiiing texts 4. retrieval of text iteuts with stiflicleutly large global a 11(1 local similarities 5. use of user relevance judgmeiits for replii'asing of search i'eqtiests usIii~ iclevance feedback. 61 rihis J)I.0CC55 is exl)Ccted to perforlil ~fVectIve1y in l)ro(l Ilcilig' 1)0th high I)1Qcisiofl (`15 well as high recall. System Description ~lie (ornell ~ experinlents use tile S~l~\ R~ Infonuatioji l~etriev~'~j ~` ~1eln \`ersion 11 and `vere mu on a. dCdiC('1.tCd Sun Spaj'cs 2 with 64 Nibyt es of iiieiiiory a.~d 5 ( b~ t(5 of local disk. S~i~\R'I' Version 11 is the latest in a. long line of (`XJ)erinlellt al inforinai iou retrieval sys te~s, dating l)ack over 30 ye developed under the gt'ida.nce of (~. Salton. \ ci ~ion It is a reasonably coniplete re-write of earlier versions, a.n(l was designed a.ll(l co(le(l by (~. I~ucl,le~ flie new version is approximately 41,000 lines of (` code and doctinientation. SNi~~RT Version 11 offers a. basic fl'a.lnewol.k for invesl.iga.tiolls lilto the vector sJ)ace and related niodels of information retrieval. f)ociiinelit~ are ftilly ailtoinaticaily mdc xed, wit Ii each documejit representation being a weighted vector of concel)ts, with the weight indicatill')' the ill1l)ortance of a Concept to that particnla.r document. 7['lie dociiiiient represelltatives are physically stored on disk as ah iliverted file. Nattiral lalignage (111Cl'iC5 will go through the sanie in(lexing l)i'Ocess. `tile (Iliery re1)resentative vector is theii coInl)ared with 1. lie iiidexed docilnient i.el)reseiitatives to arrive at a. sinlilarity. The docninei~t~ are then fiilI~' ran ke(l l)y siiiiilarjtv. Specific Methodology Used for TREC Sttidy `i'liere are two major sets of (1.oriiell ~~REC'. expeniucittal 1.11115. rilie first set. is the official `fi~EC. set with ad-hoc runs usilig the locai/glol)al iii atchi ng l)roce(l ii I.e deScri l)ed above in steps I -4. There are two automatic I'll ns ill this set; olie ilsilig only siugle terili iil(lexiilg and the secolid lisilig both siugle term and two term phrases. `i'hei'e is also an official lolitilig run lisilig a. Si niple relevance feedback techni(jue to form a. revised (fuery l)ased oil relevance jiidgeiiieiits l'i'oin the training set. The other set of runs provide a.n exa.lniliatioii of some of the tra.(leofL~ (disk space memory, time, and effectiveness) elicountered withi [1 a. siugle iiiforina,tion retrieval systeiii. There a.re manv decisiojis that need to be made wheii (lesigning a. systeni; the goal in this set of 1.11115 is to explore the coliseqilences of some fu~dainentaI choices includuig 1. I)egree of stemming 2. Size of stopword list 3. luverse document freq weighting 4. Phrases 5. Query Optiinizatioii Both sets of runs use completely a.lltoiiia,tic indexing of ~jlieries and dociiiiients. Queries and documents are treated as flat text; some sections (like DO('ID) might be omitted, but all indexable text is treated the same (unfortuiiately, even if preceded by' a. NOT!). This ignores the structure (in both form and semantic meaning) of the queries which could be very useful. SV1~~1~T has the capability to treat different parts of query, or documeilt iii differeut appropriate maimers. However, using this structure would have tremendously complicated the second set of runs by addiug another large set of variables 1.0 the experinlents. Now that the choices investiga.te(l in the second set of runs have been made, future runs can use the strncture of documeuts and ~iueries 62 Official Runs (Ad-hoc Queries) rI~he overall I)roce(i II IQ II.~c(l for l)Ot Ii of II(~ offlci~i i~iiii.~ ( on(~ ~vitli s.iI~gJQ tQI~iIIs oil Iv. the other incindi ug phrases) i.~ gi V(~~ In figti r(~ 1. Automatically index document collection D1+D2 using tf * idf weights, cosine normalized (ntc) creating inverted file. 2. Automatically index query collection Q2 using tf * idf weights, cosine normalized (ntc) 3. For each Q in Q2 3.1 Compute sim of Q to each of documents in D1+D2 keeping track of the top 500 documents. 3.2 Re-index q, breaking it down into sentences. Each sentence is reweighted with tf * idf weights (ntn) and formed into a vector. 3.3 For each D in the top 500 global sim documents 3.3.1 Re-index D, breaking it down into sentences. Each sentence is reweighted with tf * idf weights (ntn) and formed into a vector. 3.3.2 Do a pairwise comparison of every sentence vector of Q against every sentence vector of D. If some sentence match satisfies the local criteria, add a large constant to the global sim of Q vs D already computed. 3.4 Return the top 200 documents out of the set of 500 documents. Given the method above, first will be the documents satisfying the local criteria, sorted by global sim, and then the documents not satisfying the local criteria, sorted by global sim. r1~he local criteria. varied Iii the t~vo offici~i.I ni is. Single Term Global/Local Matching I `idexiug the docu~ent ( o1I( ( t lou for the siugle terni `.1111 ( stel) 1) took ~ (3' l~ tJ lioii is. creating an iliverte(1 file of 690 Nib~ te~ The actual retrieval took t lo ~ C'. i~tJ sccoIi(1s for all 50 (ilierics ~`i.Ii(I coflsid('r('1.1)ly' longer in elapsed tinie (a. large amount of tlm( being speilt. ~vaitiiig for (115k io). For ea{'li qnery the text of 500 docunients had to be read in 1)loken (1o\v11 into sentelices, ii~dexe(19 aud ~veighted ~vitli i(lf ~veights. rI'hen every indexed sent~n(( In the (iiiery lla.(i to I)e coIiil)are(l against every in(Jexe(1 selitence in the document. The local matching criteria a.re 1)ased on the (`etectioli of matching siibs.trllc1.lires in both the (luery texts and the texts of retrieved doc~iIi)ents. ~`he criterion used for the saml)le rims required the presence of a.t least one paii. of niatchiug text sentelices ~vitli a. l)air~vise sentence siiiiilarity of at least 100.0. rfliis was (`lioseli to I)e high cuougli so I.liat veiN, f('w niatches of only one terni ~vould satisfy the threshold but low eliolIgIl so tIi~'i.t sevel'('il iiie(l 111111 weighted ter~s (`0111(1 niatch and reach the threshold. 63 Phrase Global/Local Matching The phrases being used ~vere t\vo-terII) SNl~~i~T adjaceucy I)1Ir~ses. I~IIv(~~es ~vere (~.dj(-l.cent non- stop~vords, term components stem me(1, that occutred ~ least 2:5 times in tile lea~ing (Dl) doc'i- inent set. The term components ~vere put ijito a1phai)etic(~ oi'der, tIi~is the text j)lirases ~inforrnation retrievd~.l" and "retrieving information~ both mapped to the same phrase concept. The phrases ~vere treated as a separate concept type (ctype) within an ludexed vector, and ha(1 their own dictionary and juverted file separate from those of the single terms. The components of phrases rei~ained in the single term ctype. Determination of phrases took 5.S hours, fluding 4~7OO,OOO phrases occunuig in Dl at least once. Of those phrases 15~,OOO occllrre(1 at least 25 tii~ies. These phrases were theii put into a dictionary and nse(] as controlled vocal) ii Iai'y for phrase~ ~vlien (101 hg the I Il(lexing of' D i+D2. The single term indexing reiiiai ned exactly ~".5 it. was in the siugle term run ( teri~i~ occurring in phrases were not removed from the vector). Both the terms and phrases were giveli a `~natural" If * i(/f weight (le, the i(1f weight was based on tile collection fre(~uency of the plii'a.s:e itself' r~tlier tli~'in being a fiiuctioii of the i(1f values of the single term components). The cosine norlllalization was kandled Iii the following way. ~ll terms in the vectors had their weight divided by' the cosine leiigth of the single term sub-vector iiistead of the vector as a whole. Thus the weights of single terms in the final vector were exactly the same as if phrases were not being used. Yt i'etrieval tilue, the effect of a. l)llrase iiiatch was divided by 2. (The same effect could have l)een obtained l)y dividing the indexed weight of' a phrase l)y sqrt(2).) Indexing the document collection with l)hrases took I 0.6 hours, creating an luverted file of 840 Nlbytes. The actual retrieval took 2105 C1~U secolI(ls for all 50 queries ~`i.Ii(l again considerably longer in elapsed time. more complicated local match criteria was used for the phrase rim. J;'he basic threshold was reduced from 100.0 (used in the single tei'm run) to TS.0, but an additional restriction was placed on the match to ensure that no one term woltid contril)llte more than (15 percent of the computed pairwise sentence similarity for any Sd tence pal I'. Uhis effectivel V eliminates sentence matches due to the presence of a. single highly weighted term. TIle ().`5V() was determined enipirically from tests using the learning query/docunient sets ~)ercentages railgi ng from 55V(~ to `,5V( perforuted equally well. The more complicated local matclt, iii conjunction with tl~e phrases 1)10(1 uced a. very significant improvement in the phrase run as opposed to the single- teim ni ii. The II -point average over 50 queries for the single-term run was 0.17:38 while the phrase run did very well at 0.2032. Offical Routing Queries Standard SMART relevance feedback tecliniq nes were Used to automatically con sti'ii ct routing queries to be run on the test set of documeuts ( D2). Each routing query was coniposed of' ternis from the original indexed query plus the "best" 30 terms fi'om the documents in the learning set that were relevant to that query. The weight of each routing query term was a linear combiiiation of the tf x idf weight in the origiiial query, the tfxidf weight in each of the relevant documents, and the t.fxidf weight in a. single non-relevant document. E. Ide's [4,10] feedback formula was used: = QQId { ~( (17q) - rci The idf component in the query and document weights was based on occurrences of the term in the learning set of documents only. 64 ~he routing query was then r~u ag(~iust e~.cIi of the doctiiiieiit.~ in the test set. i?hose documeuts were judexed in the standar(1 ~ f~stiion~ with e~.cIi lerni receivilig ~1 if.~idf weight, cosine norm~ization. Again the i(1f document weight was deterllliue(i by the occtii~'.ences of tile terni in the learning set of docuineuts onlv. ~~hus no collectioii information fron~ the test set of documents was used. it took 306 secon(Is to coiisti~t~ct 1.1l(' fii II fee(lback query set ( most of the ti Inc s.1)eIlt deciding whiCh terms shonid 1)C added to each query ). It took 1.9 hours to index D2, forming an inverted file, and then 293 seconds to mu the .50 reformulated (111C!iC5 against the ii~verted file. Eflectiveness of this simple nletho(I was reaso~able 1)IIt not spectacular. `[lie lipoint average over 50 queries was 0.1924. Tradeoff runs This set of runs provide an examination of some of the tradeoffs (disk sI)ace, memory~ time, and effectiveness) enco~ii tered within a 51 `igle In forni ation retrieval system. 1~liei~e are I lially decisions that need to be niade when desigiung (I syslein. the goal iii this set. of runs is to exl)lore the conse- quences of some fn n(l (imeiltal choices hid il(Ii ug stol)wor(ls. steni I~l i hg., 1)h rases, an(l term weiglitiug. Conceptually the stan(lard SNi \ 1~Tin(lexiIlg ~11(l retrieval algoritli ins (liC giveli below. INDEXING For each document/query text 1.1 Break the text into tokens 1.2 Determine if token is a common word (stopword) to be discarded. 1.3 Stem all remaining tokens to their root forms. 1.4 Assign concept numbers to each root, forming a ``vector'~ of concepts. 1.5 Weight each term in the vector. 1.6 Store vector in an inverted file RETRIEVAL For each term in the query 2.1 Get the inverted list of documents containing that term. 2.2 For each document/weight on inverted list 2.3 Add Qi * Di to the partial similarity computed for this term so far (Qi is this terms query weight and Di the document weight) 2.4 Add document to current list of top documents if similarity is high enough. of top documents to the user. Return list 65 Tradeoff runs STANDARD 1. ntc.ntc (single terms) Full 2 pass indexing 2. ntc.ntc (single terms) alternate indexing method making document vectors STOPWORD 3. ntc.ntc automatic stopword (added 69 terms occurring in 10\X of coil) 4. ntc.ntc automatic stopword (added 350 terms occurring in 5\e/e of coil) 5. ntc.ntc automatic stopword (added 1286 terms occurring in 2\Y, of coil) STEMMING 6. ntc.ntc only plural stemming 7. ntc.ntc no stems LOCAL/GLOBAL local/global (single terms) local/global (single terms) same thresholds as 2nd official run) QUERY OPTIMIZATION query efficiency optimization (15 docs guaranteed good) PHRASES 11. ntc.ntc phrase dictionary. (> 25 times in Dl, 158,000 out of 4.7 million) *12. ntc.ntc local/global (phrases) *8. ntc.ntc 9. ntc.ntc 10. ntc.ntc 13. nnc.ntc 14. lnc.ltc 15. lnc.ltc Doc Indexing Time (hours) 4.5/4.9 4.7/0.7 4.3/4.6 4.0/4.3 3.7/3.9 4.3/5.0 4.2/4.7 4.7/0.7 liii ii Ii 1. 2. 3. 4. 5. 6. 7. *8. 9. 10. 11. *12 13. 14. 15. OTHER WEIGHTS (single terms) (single terms) (phrases) Query Inverted Other Retrieval Speed 50 queries es) ) (seconds) 358 I'll Indexing File Time Size (seconds) (Mbyt 2.3(13.6) 667 2.7 3.2(13.1) 624 3.0(12.8) 528 2.8 381 2.7 724 1.6 752 2.7 667 liii lii lii' liii liii 7.5/8.0 3.8 9.7/0.9 2.7 4.5 (88.5) 4.5 2.7 8.1 File Size (Mbytes 100 790 100 100 100 98 98 790 `III 892 104 892 1040 667 89 liii liii 892 104 **: timing * indicates official TREC run, Retrieval-Effectiveness (averaged over 47 queries) 11-pt NumRel Total 1813 3114 liii liii Recall/prec at 200 2614/3313 ii 306 1828 3101 2587/3299 166 1750 2978 2524/3168 78 1538 2658 2237/2828 251 1745 3148 2605/3349 235 1709 3101 2545/3299 1465 1783 3150 2636/3351 I'll 1982 3400 2856/3617 97 1693 2983 2476/3173 415 1903 3298 2814/3509 2405 2080 3555 3076/3782 262** 1818 3203 2614/3407 2249 3746 3272/3985 396 2424 3886 3394/4134 on machine with 128 Mbyte memory Query timing numbers in parenthesis indicate CPU time using dictionary on disk 66 I~he ~veigliting sclleH)( `I~( (I ill ~ ..~ (l(t('rlfli iie~ \v lid her 1. lie eli Ii we iii (lexi if) ~ l~ J)lo~l{li (`(11~ lie (lone in oiie ~)a.ss ol' reqili! ( ~ ~ 0 1 lie ~ ii(l(') `(I `, ~IJ~ 1~'I ~vei()li1 ~v Ii i('li \ve VCCOIH men(I ~vheii wiothing is kiio~vn aliout the collc( I loll w~ a si r('i.iglit 1~[ * idf COsille ll0l'Ill('lllZCd ~~ei{,rllt ( ~ ). t; w fow'tunL'telv tile "i(lf~' ~`aiue c(~.[1liot l)C (01111)111 (`(1 \viI liowi t l~wio~vi wig I lie (locH iwlent fre(Iwlellc.~' of tlwe terill. `fhwis (1.fl ~cciwra.te i(lf l.e(j liii ( ~ (I I \~ o l);lss (ilgorit Ii iii I.lie fI w'st fi 11(11 wig t lie collectioli l'l'('(J 11(11 CV Of (1.11 terwii~ (1.11(1 the second a{'l lldl~ (1~5lgli I wig l lie I 1*1(1 f ~veiglit . .\l I ew'wi~I i Ve olie l)('iss ~`.eigliI I wig .~clwe~es are (tiscilsse(I at the eli(l of the tra(le()fl' (lisc lls.~i()n . ~iitil thewi (`III I wi(lexI ng ni li~ (I is(.'llssed ~vIll lie t~vo pass rIw 115 indexiwig lie (lo('H iwiewil 5 \\`II.li ~ Ii i~Ic ~veigliI Intermediate Document Vectors. t~vo ~)a.ss approach tli~.t iw~e~ liii lii nw~'1l sJ)('1.('C involves (101 Ilk); slejis 1.1 th wowigli I .~l (`l,l)()ve On pa~s olie but insl cad of ~veig1it I ng ~lnd ~tow'ing the (`ict iwal vectors. jIlsI keep tI'('~'k Of 1 lie (`01 leclion frequency' of each terrn. filcH l)d~~ 2 i epe~'i.I.s sl.el)s I -4 1311t then (`(`Ill go ~`i.li('~'i(l awul (`()IH 1)11 I.e t lie ~veight in 1.5 aud store the VC( t Ol l{ I N I iii `l~i i)le I gives i.he ii liii wig figii res fow' ills (`~lil)l'()(i('lw : .1.5 hours for pass 1 and 4.9 liowiw ~ tow p~~:~s 2 ( 1)1.55 2 t~ kes lougew' l)('c('l use Ii. wiee(ls 10 (.`onsl.w'iwcI. l.lie inverted index). i~ii aiterijative l 0 t his al) 1)1 ()(l( Ii i~ I 0 sl.()w'e H Ii ~veighI elI (lo(' Ii lii Cii I vel' low's ( as (1o(' will wewi I. vectors in not iii invel'te(l i wutex f;~riii ) tow p 1 1. ?1~lieii l)('155 2 C(i Ii igiioI'e s~('l~~ I. I 1 .1 a.li(l go (lirectl,v to the ~veigliting aud iii verted ill (le\ (on~l I ww('l lou . \s f~ I; N 2 s1io~vs tli is is iii ii cli (IH ickew' (4.7 hours for pass 1 and 0.7 hoiw rs fol' pa~~ 2) I) it at a cost ol' (loll 1)11 wig the awiwount Of' disk si)a.ce needed Obviously the choice of thcse I \~ o appi'oa{'.hes de1)eii(ls owl ~vliet,1ier iii (lexi wig tiwne ow' (iisk space is i~'iorta.nt to the (fat abase a(l 111111 lbI I ~ 1.01'. Stopwords No retrieva.l syste~ \vaiits to sl ore Iwivel'te(l I wid ices fow' all ~`oi'ds ill the text ( at leasi. fow' retrieval purpos('s ) . \\~ords like `tlie~' `Of" al (I "a.~' ai'e hot wisefww I fow' (I isti ii giw isli I wig reIev~'i.iil (loch wiwewits aud take lip an cxl. reiwiely la.w'ge ainoiw wil. of' si)a.ce siwice 1. liev occiw I' lii nea.i'I'v evewy dociiw~ewit. r1~he question is ho~v ina.iw,v stoIi~vow'(ls to igliowe. S \i ,:\ I~~I Ii as a. sla wida w'(I collectiowi I wide1iewi(lewlt list of 571 ~vords that seeni to con vev Ii 1.1 Ic ill forlIw at lou (`1.1)0111. w'elcv~'i ii ~`c. 13.111 111(11 \`i (1ww al (`01 lect bus often contain an additional iiuinlici' of \vow (l~ t ha I give 111.1 Ic iii f(iw'wwia.I.i(iwi for 1. lie l)a.w'l.icn law' 51 I),ject wn~'i.tter covered by that collection. Three runs RUNS 3-5 ~vere wii~de oii 1 I~ E(;' a.(l(l mo I lie niost fre(1 Iwently occwirring ~vords oc('iwr- ring in TREC to the standard ~ V l~T top~vor(l list.. i~ I N 3 added l.Iie 69 l.cw'ms occww w'I'i wig in iwiore than 10% of' the collection.' R.t' N 4 (1(Ide(l 350 ternis Occiiw'w'iiig in more thawi 5% of the colleclion; and Ru N 5 added the 12(56 tew iii~ O( ( Ill I'Iiw(' in iiiow'e tha ii 2~/(. of t lie collection. ?I.~lie sI)acc sa~ings ale sulistantial ranoilig frouw 77(. 10 I ;3V. of Ilie I wivew'I e(l file size ~vith a. cow'w'esl)ond iwig savings iii indexiug ti~e. ltd w'ieva,I tuwie 15 (`veIl Iiiorc aff'c('ted , as lii ~`iwi y of tlic vci',v bug iii verted lists for commoii words no louger have to lie (lealt ~vith. I~ U N £1 saves 5'1V1.. awid 1~ I' N 5 79%. The penalty that needs to lie l)a.id fow' these saviligs is the w'etrieval effectiveness. There `5 no penalty for Ru N 3, aud the reduced effccl.ivcncss ill R IT N 1 is insigwii{icaiit.. hut Ru N 5 loses about 15%. Except if you need maximal effectiv~iess RITN 4 ~vo1.1l(l sceni to lie worthwhile in practice. One other potential prolilem with removing the most common words (if the collection is user mystification. U. sew's can understand that woi'ds like ~the~' (Ion `I help retrieval but. whay be snrpw'ised when sentences like The head and president of an Amei'ican cowwil)nter systeni colilpa.iiy liase(t in \.\Jasliington said she expected to in('ikc a. niillioii systems hy the (`lid of the year. 67 Cofltd' I'll 1)0 iiidc'x~i hl( \vor(ls at (`III! .\ II \voI(l~ (l.I( (II nolig I lI(' ~1 (`111(1 `1I'(l ~ \l \ l~'l ~Ioj)~"oi'(l~ OP OCCi] 1' fl101(~ t1~~ 10V1 of t hE' ?1'l~ {( (IOCIIIIIQIl Is. Stemming ~t('ini~iing is ~ of tIIO5Q aI'c('~s ~Vl1('1'Q IllE' I1'(~{l('off~ C('1 ii f)Q so1I1('\vlIat 51 l)IlQ~2.:~]. ]~llQ st~d~i'(i ~NI~\ l~i) a{)l)roa('11 lIs('5 full StQlll U) I hg \vlI('l'(' most Siiffixcs (`I I'(' I'Q1l1ovC(l 1~ UN I l~ UN Ci 1'CIIIOVQ5 only' j)l 1I1'~'~5 (`mEl i~u N T (foes 110 Steni liii ug ~ I ~~ll . 11 ofi~-ci tc(f (I v('1~vl)aCk of lot (loilig fii II ste~~iflg Is li0\V('VQi' tilit is re~~soii~ l)lv ill, the iliCI'Ca.5e i~ the (liCtioiiai'v size,' sigflifiC('llil l'~i p ~oi'~ I in I)ortaflt is the Inc lease ill iiiverl.ed file sIze (I lie to ii~ii It.I l~~(' forius of I lie s~'i me ~voi'4 OCCI i'1'ilig ill ~ do~u~eiit. l~1urai lilElexing inC1'('.~se4 t lie inverted file l),V ~ ~n(l ilsi hg no 5I.~ns ii) creased it b,y I 2.7%o In(lexing spee(f is glveii a's an a(fvali t.~'ige of hot Elollig fii II SI eniull ~ l)iit (`igaln t fiat's i'easofla.l)l,y insig'iiFic~t. If full sieni inilig is efh'~ient I lieu the cost is (`il iiiost completely' (`oh iite1'-l)('ilaiice(l by' the cost of (`reating a lai'ger inverte(f ludex. l~et i'Iev('il speed Is not. iioriii ally' nientlone(j (`15 (`1. (1 IsadvaIlt,1'.ge of full ste~~ing, 1)11 t seenis to be a (`Oil sI(le1'('i l)fe f('~tor. I~ [`N 6 ( ~ ii rais ) Is :~()7( f('i.sI CI' I~liaii I~u N ~ (full). l~et i'Iev('~ effectl veiiess IS ofteii gI vi i)g (`IS (`iii (`1(1 ~`a iii (`ig(' of' f'ii II St('lii 1111 hg ovel' ,j list. 1)1 ~i i'al i'einoval, 1)11 t the results hei'e agree ~viI. Ii otliel' i'Q('('ii t 1'('s ii 1.5: the (II fl'('i'Cli CeS l)('t~V('(.'Ii I lie t\Vo (`i.l'C I isiguificant No steiiiuiing at all Is notICe('i.l)ly' \VOI'5('. 1)111 hot by' ah extr~()i'(lIii('i.ry' aiiioii lit ((3~X~). Local/global I~he basic locai/glol)a.1 aigori t hiii is (lE'5('i'Il)('(l In a l)i'evIolis secl.Ioii of lii Is !`ej)oi'I. . () lii' current iniplei}ienta,tion is (lesigne(l to Iii crease flexl l)ilItv at t lie cost of' i'eti'ieval ii nie: f'oi' evei'y' (j ii cry', ~ve had to go out an(f ludex aud "veiglit `~O() (lo('li hid Is fi'oiii SE' l'('i.tcli . ~l'li at ii icalis at ret rIev('iI tinie, ~ve can (10 any' sort of iii dexIng. ~veIglitI hg. `111(1 lest l'i('tions ~"~` 11 ke. Iii (`ill Ol)QI'('1 I.Ioii ("1 syst eli), lio~vevei'9 it's exJ)ected that tIlE' local i'est i'Ictloii 01)CI'('i I lou ~voul(1 l~e (lQtE'l'liii iied iii a(l valice(l ` (`1.11(1 ~voiild use I)reindexed sent CIICQ vectors. ?I'h us I udexlug tI ~ne and sf)ace `volil(l i nci'e~'i Se, but ret rieva] speed ~vould go to a rea SOli('i ble level (it cii rl'eliI.lv lakes 1 iI liies as bug for ret i'Ie"al ~vIth foca.1/global inatchl ug). Ru N 8 gives the tilnilig and effectiven~~~ ligui'es f'oi' the fi i'st Cloniell offIcial i'un. EfFectIveness for that run ~vas disappointl hg; al)oli t the saiiie a S I.~ IN I ~vIl.lioiit lo('al inatcliliig. ifo~vever, bet~veen the tune ~ve sllbniitteEl oil 1' fii'st officIal I'll!) ali(l the tIlii(' Of' oil]' se('olid officIal Pun, ~ve \vere able to get a. better local restrictloii method ~v0l'kIlig. I.' slug the restl'ictIoll iliethod of the SC('Oli(1 official run ( dESCribed ill the iiialn portioji of the ~vrI tell1))' 1)11 t Oil t lie si ugle terhi collectloii (0111' second official run used ph i'ases ), `Ve get a i0/~ iiii f)l'o\'emeiit ( f~ UN (,) ), Qnery Optimization. In the stop~vord SC('tioli al)ove, the tl'('i(leofF bet~veen i'eti'Ieval speed anEl l'eI.l'iev('il efFectIveness ,vas examined by completely removilig long Iliverled file sto1)~vol'(l Ilsi s fi'oin I. lie collectIon. `J'liis tradeoff can be exa~ined dIi'ecl.lv at i'etrleval 11 Ilie I),' (`Onslderilig sclteiiies 1.0 avoi(l looking at the longest inverted lists for query terms unless Forced to. The basic niethod used here, ( (lescl'I l)e(l Iii more det('iIl iii [1]) Is to sol't the (j llCl'y' by' decreasing query' ~veight ( thtis hopefully' pllttIlTg (IilCl'y' ternis ~vith bug lists aii (1 thereFore lo~v 1(1 f' at the end), go through the quel'y' term- by-tei.iii . aii El sI.OJ) ~vhen it Is gii `ii'alltee(l that a c.ei'talii nnmber of "good" documents ~vi11 appear in the final list of 200 to1) Eloculilents. Here. `~good" means retrieved In the top 200 if all query teruis are u se(l. 68 If \ goo(I (1oc11nI('fll S (1 IC 10 ~)(` ~ ~ I'~~I~t('('(1 oil ~ 200 (bc ii lI~Q iii S t Ii (` (l('('isi()I1 j)I'0CC'(I U V(' i1l~0I~C(l l)('for~ (`(`Icil tcrii~ A of tllQ ~l ~ Ii ~s l)QCIl (I~l(' is l (`SI i Il~ ( `,(I)2UU) ~ < 5(I)x) (1UQI'V. ~liis in~A\'QS thQ ~S5lj~1)ti()I~ thit tll(' \\Qi(~1Il Of ~i li-ill iii (1 docum('iiI \viII I)(' 1(55 t1i~ thc ~v~ight of that tcii~l Ifl tuc' (j~IQry. 1i.~s (i.l n~()St (`~1\v~~vS tUii(' ~villi I? I ( \Vei~1ll S ( (`xc('I)I fol' vc1y sliori docu~c'iits ). ~`i iiic' I~iii~c' 1~('t ~`it 200 i 71 1 I I 2(~2~ 1575 2;~;~0/:300~) 5 i~I(i 2911 i()':~0 2:391/3097 10 2911 I(i59 2129/;3i:~2 15 97 17~ 29(S3 j(j91 2 Il()/;117:~ 25 i0()' 21S 3017 1721 2)0)/:V210 50 123 255 3031 17r)1 2~ I 3/:322S 75 291 ;30(i'9 1 7() ) 2)1 /32(i'5 100 167 317 :30~2 17~2 2.~~.~/;~279 150 221 159 3()~5 177(S 259;~/:~2~2 200(F~ill) 371 721 :3111 I~13 26 I4/:~:3 13 X decreases, retricv('4 effectivcness `111(1 C~P V Ii II~( (l('('rQ('~:(' j)I'QI ty SlIi()OtIi1~ ~viIIi ret ricv~ effectiveness re~('1I1li Ill' rcaSoflaI)I(' for (juite ~`i. 1011k' iii il('. L~x'ictl'v \V Ii cli 1)Oi ii t iSSUit('ll)1(' for (`Yfly 1)~'tic11lar a,1)14ica.tion is detei'miiied by the rel('i.liV(' j)1.iOi'jt i~'s of efficicucy ~`i.li(I effectivenesS. Phrase runs rihe basic adjacency 1)kra.se a})pro('~'h iIS('(1 1)\' S\1 \ I~ i', (l( S('I'iI)('(l iii I li(' ()ffj('i('l.i I'll Ii 1)()I'tiofl of the paper. \Ve've lOOkC(1 al other methods of ~ HciIik~ 1)111 (i~('S: 1)111. 0111' ot1i~'i. i Ill pletudil (`YtIo~s ~VCl*e too slo~v to be of use ~vitli 1'I~ IC'.. :`vdj~cency I)Iil'('iSeS lia\ C l lie ~.(l v~'l.iii (`ige of 1)('iiig f~'~Si . siiiij4c' ~.fl(I producing reason('i.1~e resli ks. ~`1iev Ii ~ve the (liSa(1v('i nt('ik'e lii ~ 5011 IC soi'i of Ii Itering 01)era.tion ha's to be performed to cOiliC np ~vitli a good ~)l1ra.se li~i I here (`~I'C j~ist too ilially 1)('~irs of terms to index all of them. For these runs, \vC n~ed the criteria. I.ha.t the I)lira.se 11(1(1 to occur more than 25 times ii~ Dl, the learning docninent set. Vve had hoped that phrases ~vo~d lid1) 51i1)5t('Yliti~ly in ~`i1 1"JC' since (`iS the collection gro\vs, the need to be more 5pecific in the qnerv grows and 14ira.ses ~ould l)e (`1 good w("y of increasing precision. `1i\Te got improve~ent, but it re~a.ined in the range of 5-S~,", ai)onI. \"`h('~. it is on the very small conventional test co1iection~ of ilic l)("'l Peihal)" other phrase a.pj)roaches can (10 l)etter. Phrases are indexed with 1? I( ~~ci~iI l)nt the cosine iiormaiiza.tion of I lie entil'C vector is done over the length of the single te~ snl),'ector onl\ I lii', means that the single teI'~s end up ~vith exactly the same weight as t hev would ii I he etitii'C (`011e( I ion was iii (1 exed with only' single terms. Thns, phrases only' increase similal It\ 1 his "( cius to l)e (1iiitC impoi'tant foi' sOme collections, alt hongh not crucial for our 1)h I a~e ",e1e( 11011 OIl I l~ 1 ( 69 In o~r rl'ns, tuc ~oI)~i I)I~r~S( VHI~ R I: N I t (10(5 (~1)oIIt :)V; I)Qttev H f(tri(v~I (f1'Q.ctive1~Q.ss t1~~~i I~u N 1 at a cost of I IIcr('('~.S('(I I I~(I('xI II~ iI IIIQ IH(1('XI Ilk) sI)~{c' ~~I~(I rcbt I~I('v(lI t IIIlQ. .~I iiilla.rly, tIl( toc~I/gIo1)aJ I)hraSe run RI N I 2 is (`11)011 :7)V1, 1)cttci' t 11(111 11I(' 1oc(11/gIol)a.1 Si I1k~1~' terili 1.1111 ~vIth the Same I)a.ralneters. ~ (I N 9. Alternative Weighting Schemes The t~vo passes uceded for idf ~~`eIgIits iii (locuments aic (i (lefililte hurdeii . in earlier experi~ents ~vit1i ot' her collectious ~ve fonud that hot usilig i(If I H docti lieu is ( ~\` hue sill I us ug It Iii (Illeries) ~va.s very reasonable jilsi a bit less efrective tlia.ii tisilig i(If. ~V'Jien tried on T]~EC1, if-coslue nornialized ( I~ i~c ) doci il)eiil ~veig1its even I)Iove(I 1,0 be iiia~'gin~ly better tli~ the *i~ic (locitmeilt ~veig1its. ?1:lie OlIC ~ 11 I1( ~veigh is Ru N 1:3. took less luau half the total lildexing tiltie of R I' N I `l~Iiere is ho q tiesi iou t Ii~i I I his is a iliajor a(l V(i iii age of i~ i~c. The possible ad vantage l.li at i~ic ~~eiglits iii Iglit Ii ave is in fee(l l)a('k )V here H()~i11 a I ly ~I liery ~veights aud (1ocn~ent ~veiglits are coni 1)1 ile(l to foi~in I1('\V ~I iierV ~veiglii 5. \ vaii~t of teriii fie(luency ~veiglit lug I lia.t~s l)eeil iii `, ~l :\ It'l foi a coil l)Ie of vears is the I sche~e ( eg, it.c inst ea.(t of 11 Ic). I stan(ls for log; (1.0 ~ lii (1 1.)) is lise(I instea(l of if I lie nuniber of times a. term occurs in a. docunient . l~Iie goal is to (1o~~n~veIglii tile illiporil ice of the tf factor in collections ~vhich have very long (loch ucilt 5. `1~lia.t fits 1~l~ i'.~(2 very ~vel1. I~u N 14 and 1~u N 15 describe usilig lIlc document ~veig.li1.s a.Ii(l Ii~ ~~tiery ~veigIits foi single terms and phrases respectively. Fi~hey ~voi~k reiiiarkablv ~vel1 a 1)0111 20V. l)ettei' l.lia ii I lie c()l~respon(1ing i).iic.nI.c runs. Thai 5 alt enormous iflil)rovemeni. :\ciua.1ly, half oft he ini1)roveineli{. is soine~vliat (1tiest lona.I)le. ;\ l)oti i i0V(. out of the I otal 20V(., 15 (1tie to the lic (`tiery ~veights a.ii(l 1 he oilier half is (1 lie 1.0 the Ii,.c (loctilneili \\eigllis. ]~he document ~veiglit ilnI)rovemeiii is I'ea'sona'I)le: I, Iiere~s lT0~000 (lo('11111e111 s of all sizes iii(lexe(I ~vith 1,~c ~veiglits. I feel the strong imj)rovenieni dIle to //c (lliCI'Y ~veiglits is almost cei'iaiiilv ah artifact of the TREC: queries, and possil)ly even an art fact of the secoud (jilery set ( (Ilieries Si - 100). 50 (Ineries is a. sm~l enough nuniber so that ralidoni effects caii be un l)oI.ta.Iit.. \Il average `iser siii)l)lied query ~vill hot have the (list ri I)1i1.ion of terilis thai the f~ll E(7. queries have. For that reason, lic (~nery welolits call not; be geilcially i'ecomlneIld(.'d. Iii tests on small collections, Itc performs ai)oYii the same as?~ic. It sholil(l iii liii ii to lise lIc, 1)111. (ion't I)et the farm on it for TREC 2! Failure Analysis There seems to 1)e little consisteiti that can l)e sai(I a,I)o1i I the perfoi'ma.iice ol Siii~ it. iii the ad-hoc experiments. Sm art. does coni pa rat. I vel y l)eI I ci ( wlieii (`Otil ia re(l wit.li the fli( (I iaii ~~il lies) on queries ~vith a. lot of relevant docunlelits, as opposed to those wit Ii few rele~~tiit (10(11 iiient~ ~vliere it often does substantially worse. But ii is haid to tell whet icr that. is a feature of Ilic ~Vst(ill or the queries. For sonie queries, the Sniari perfoimance is very poor, I)ccaHse the quely sti icini ( is ignored. That is especially' true of queries using NO~1 clauses. Uhe N()?l) is ignored an(I I lic followiug words a.re treated as positive relevance ili(Iicat toil~ In general, the local match req iiii( iiient does not have as l)Ig of all effect on queries as it has on other collections. There are d((iiilt( ~H( cesses; for exail) pie., query 69 on "Attempts to Revive the S~1A' II Treaty.". The local I c(jHii ciii( lit. rejects (`ill doctimeuts that. deal with 111(1 ustrial salts insstea(1 of a. peace treaty. Bitt iii I 11 ~ ( 1 ii least, there are few queries in which anibiguous ~vords played an important part. 70 Query 69 - Global Match Only Num Rel? Sim Title 349983 Y 0.48 REVIEW & OUTLOOK (Editorial): Breaking With SALT II 339345 Y 0.45 Letters to the Editor: Salt Ceilings Serve U.S. Interests(/HL> 204128 0.36 [disposal of waste salt] 187056 0.33 [Superconducting compositions of the general formula] 370883 0.33 Diamond Crystal Peppered by Rivals Admits It's Licked --- Maker 582868 0.32 Salt Rationed in Many Parts of China 358376 Y 0.32 REVIEW & OUTLOOK (Editorial): No-Sweat SALT 232619 0.31 [process for recovering metals and metallic salts] 132873 0.30 [Salt deposits have economic significance] 206721 0.30 [interaction between intact salt and crushed salt] Query 69 - Local/Global Match Num Rel? Sim Title 349983 Y 10.48 339345 Y 10.45 358376 Y 10.32 342288 Y 10.27 90352 10.27 530499 Y 10.27 167476 Y 10.25 353163 Y 10.23 534139 Y 10.23 166087 10.23 340954 Y 10.22 REVIEW & OUTLOOK (Editorial): Breaking With SALT II Letters to the Editor: Salt Ceilings Serve U.S. Interests REVIEW & OUTLOOK (Editorial): No-Sweat SALT [House banned funds for deployment of weapons] [Arms control purposes include strengthening] Arms Control Restrictions Figure In Pentagon Budget Battle House OKs Pentagon Spending Bill Eds: To update i [examines all the major arms control treaties] House Passes Bill Slashing \$33 Billion From Reagan's Military ~I'1'e routing i~un I)('I~forI1ls (j tilte 1)a.(l lv oh ~ liii Ill 1)('r of ~1 I('I~I('5 l)cca.11sc of (j 1I('1\ cx I)(1 flsi()fl . 1?lle q~iery length after t cr1115 fron) rclcv(l.111 (l()( II I) Id ts (lIC (1(1(l(~(l 5 (11)0111 t ~vicc I lie (j iicrv l('ligt h of the original (lilely. i\ large l)ortion of t lie ~l(l(l(~(l I ei'iiis are g('Il('I~~1l l cr115 I. Ii;1 I li~~ ve ~oI Iii ug to (10 ~vitli the (jnery topic, but still get a. high ~vcigli1 l)CC(l.1IsC of t lie lit Iii 1)er of re1ev~tii1. (loch tudi 1.5 1.hev' occur iu. rf~vo approaches to try iii ttie fii 1.11 re ale Iii i)i'oviiig I lie ~I ii('i~ CX ~);Yii5IO1i I)ro(~css ~1.II(l using 1oc~/glo1)aJ ~a.tchiIi~~' 10 ensure a retrievet (l()(illiicl)t has soilid lilug iii coluluoli ~vitli I lie origilial (~uery as ~vell as the cxpan(lc(l (lilely. Automatic versus Manual ()ne of the great nil rcsolve(l (lcl)a.tcs of iii foi~iii a I loll rel `leva is ~v !i('t. li('l. a iii 0111 a.I.ic a l~l~ l'()a.('lies ilsilig 110 (lirect hunian cx[)ertisc aje bet I.e]' or \vol's(' l haji 111(1 ii liii al)l)roachcs. ~v here 1111 luau expertise is directl~' involve(l ill fa.shioni~g a (Illery. 1.jOll(' l eriii of (`oil 5('. the a~s~vci' is ol)viotIs. B v (leflllition the automatic a.pproa{'~ is baried fro~ ushig luau iia.1 te('lI Ii (I lies, ~vliile tll(' ii]a.iiiia.l a.l)pI'oa.clI can use tile best autor[ia.I.ic a.ppi'oa.cli aild then a.(l(l (`I. bit (~f Ii it~a.n k llo\vle(lge oIl 1.01) of I hat. But absent such piggy-backiug, the resli Its froiii ~ f'~( I suggest for the thuc l)('ilig thc a.l)l)roa.clies are roughly e(Itlai. The best all toinatic ni us a.li(l the 1)esl. ma 11(4 I'll 115 cud lip ~`l.l)0IJ I. the same. As cxpected, the inanti al ritlis Seefli to (Jo iclatively bet tel' oil l)recisioli a.l1(l the all toiiia.tic runs better on rec~l, 1)111 the effects are so sinai as 1.0 l)e lusigili ficaut . It is cle('l I' that t lie local ~atcll 71 rccjii iv~~ent of th~ (1oi'nc~1l Jocal/glol)al a.j)])1'o1.(.11 ~v Ii i(~1I i~ al I)) o.~t 1)11 1'(i.\. ~l J)1'C(~lSl0II cii lia.iicing (Ievicc still (loes not I lIcrc('lsc j)I~cci~ion 10 a 1)011 t. t Ii (it I~ (IS k).oo(l as a. ~ all a I sca rcli Conclusion ~~hc C1oi'iicll a.1)1)ioa.CJ1 of ii Si hg coli 1)lcl.QIy alit 011) a tic i lid. Ii 0(15 1.0 iii (Icx a 11(1 id l'i('\'(' ~voi~ks cxtrei~ic1y ~vcIl. I~c(luiring that soinc shi all 1)a rt of' a (foci iliclil ( Scill cii('c ) Ii i~li1y ~a1 clics t lic (f licry. ~vcl1 gains ai)olit I O~X) in cffcct I vcncss. I~sing 51111 f)Ic (`l(Ija.('.clicy 1)liraScs in a(l(Iition 10 Sill k.~1(' 1 ci'ii~s iUfll)rovcs cfffcctivcness by 5(/) to ~ fhcvc aic a. host of tra{IcofVs to I)c c()flSi(Icl'c(l \vli(~Ii (lcsih(.)Il lug an(I cicat l])~~ all ill for~a.tion retrieval collection. ~i a ity of tliciii 1)10(1 Ii (`C 511 rj)1'isilig ~;a his Ill cfflcicIlc\', (`II 0111 \` a nil iior cost in effectiv~ess. References 1. C. B~ck1ey an(I :\ . I~c~vi1 1. ()J)tilli i7;l1.ion of lIi\C1'tC(I \7cctoi' SCal'(.'licS 1~ i'oc. I£ightli Jut :\ (:~/SIC1 Ji~ (`on fcrciicc Oil l~ (`sca rcli a 1.1(1 [)c"cI()f)ni(~ti 1 11.1 Iii forni at oh l~ ci. i'icval i\ssocia.t ion for cl'ofllpn tilig, ~Ia.clliIic1'v Nc~v ~ork, I9~5 97-1 lo 2. 1). iIa.r~a.n, :\ l~ailui'e Vualysis oi ti ic i~inii1.atioii of' .~~ilffixing 11.1 all olilille [~nviron~ent, 1~i'oc. Tenth hit. .AVcl.NI/S [(~ II.~. Coii fcl'clicc 011 l~csc;i.rcJi a 11(1 Dcvcl()j)iiiclit iii Ill for~a.ti~ l~ctrieva.1 `\ss()cia.tioll for (2.01111)11 1.in~ NI;l('1l I IlCl'V Nc~v \ork I 987 3. D. Ilarman, ]:o~va.rds Iiitcra.ctivc QlIcry ~ixpa lisioli [~i'o~. U~lcvcn1 h 1 lit. i\(3'\I /~I(~'l1~ Con- ference on i~.csca.rd1 a.11(I I)CV('l0f)11iCii t Ill II) fOl'1li a 1 lou 1~ ct ricva.f . :\.sSociati()ll fol' CoIn f)IltiJIg Kla.chinerv Nc~v York, ~ :~2 f-3:~'1 4. F. 1(le ~Nc~v f£x f)crillicnts Iii 1~ clcva icc l~cc(l f)a('k' iii Flic `,` ~I ~\ l~'V f~ctricva I ~ vsteni - 1£x- 1)cri~ents in .-\ ii tOlfl('1.tI(' [)()c11 incnt j) ~ ~ (`(I (I'. S II ton. f) ~ I (`C If all f.'~ilgIc\voo(I Cliffs, NJ. 1971, CIIa.I)t'cr i()'. 5. (~`. Salton Devclopmeiits ill .~l1toll1a.tic f~cxt Rctric\'a.f Scielice 253. 30 Vng(ist 1991 974-980. 6. C~' . Saiton, ~iitoinatic ~`cxt 1)l'occ5sil)g ~~1ie 1~'a1isf01'nia.tioli :\iia.Iysis a.li(1 1~ctrieva.1 of In- f()rmatiofl by Computer, iLV(l(I isoii-\~~s1cy 1~n I)IisI1illg (`o., Rea.(I lug, NI.\ ~ 989. (i. SaJt~ ~(1 (.`. Buckley, (ilobal ?1~x1. ~ia.tcIlill{,) for In fol'nla.tioll 1~ct l'icva I, Sciclic( 253:5023, 30 August 1991 1012-.10i5. 8. (~` . Salton a.n(I (.. Blickley \111.oiiTa.1 Ic TcxtS tl'll('tli ling a.fl(j 1~ctrieva.1 B'xj)erilnclits in ~llto- ni atic EncycloJ)edia Seardli ng, f) roc. Foiii'tcciitli Jut.. .~.Ck1/SI(;' l]~ Con fcl'encc on I{esearch aud De~'elo1)1nent 111 IllforIIla.tioii i~ ci. rieval, $\ssocia.tioil for Coiii f)ll tilig NIach Illery, Ne~v York, 1991, 21-30. 9. (i'. Saiton and C. Buc~ey, Tcrin-~'eig'Ii ting API) roaclics in s\ ntoina1.ic Tcxt I~cti'ieva.1. lufor- mation 1)rocessjng ~ Nianagelnent \`oI 24 No 5, 1988.513-523 10. C1'. Saiton and C. Buckley, lIn proving 1~et ricva.1 i~ci'forma.ncc l)y' 1~Qleva.nce I~ec(l l)a.ck. J AS IS, Vol 41 No 4, 1990, 288-297 11. L. N\7ittgenstei ii. 1~hiIosophica.l IIivcst.iga.i.iolis. II asil ItIack~vcIl ~Q' (`.o. Ltd. ()xf01'(I . [£nglancl, 1953. 72