APPENDIX B This appendix contalns charts created from the supplemental forms filled out by each group about their sys- tern. These charts &e meant to supplement the papers and contain a stand&ded and formatted descripfion of system features and tiruing aspects. B-i `A. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OTIIER DATA STRUCTURES METIIODS USED `EM NAME ~ Dortmund ~ Cornell ]- Berkeley [ Rutgers [6)1 Siemens [ UMASS Yes, `d List Yes, SMART Yes, SMART Augmented SMART Yes, SMART Yes rofWords 571 571 600 571 424 Yes, 4 terms, assigned [ed Vocabulary No No No ___________ ______________ automatically ig Yes Yes Yes Yes Yes .rd Mgorithms SMART SMART [3] SMART ver. 10 SMART Porter ological analysis None None ighting Yes [1] Yes [4] Yes [5] Yes [7] None during indexing [)iscovery Yes Yes None None during indexing f phrase _____________ cal Methods (2] [2] ~c Methods ~ parsing None None during indexing .`nse Disambiguation None Yes [8] None during indexing c Associations None Yes None during indexing ~fmition Wordnet Links ~king ~anual Correction) ______________ ______________ None None during indexing Correction None None during indexing ~oun Identification Not used Not used None None during indexing r Not used Not used None Yes Company, city, & Patterns? country names lanually Indexed Terms No No `cliniques to Build Data ~s None CONSThUCTION OF INDICES, KNOWI£DCE BASES, AND O¶HER DATA SThUCTUBES MEIHODS USED Lc" document weights for dortVI, "1sp" document weights for dortPl, Sr," document weights for dort~, dort12 Sr," query regression query weights for dort~ Inn" query weights for dort12 ~y pair of adjacent non stopwords that occur 25 times in Dl. !odified lovin's Algorithm. tc" document weights for adhoc and routing. tc" documents weights for adhoc. occhio baeed feedback query weights for routing ~ights determined from various frequency statistics by logistic regression. ~ause we used the UMASS INQUERY system and its indexing, all of the answers to the questions in this section for our systems are identical to those for th~ )ocument vectors use "inc" weights suggested by Buckley et al. in TREC-l Proceedings: weight of a term is proportional to its frequency in that document and osine factor. Query vectors use `Yltn" weights: term frequency factor multiplied by inverse document frequency factor. Original query terms normalized 1" ormalization; additional terms normalized by the length of the original terms. )0ne by hand when selecting synsets to add to topic text. ~gt = 0.5 + 0.5 * tfhnax~tf(doc). SMART "ann" weighting. ource text preparsed into SMART format before being processed by SMART according to above p&amete~ [EM NAME ~ GE [1] J CLARIT ~ LSI [ IINC J_NYU ~ SYRACUSE ~ QUEE] *d List Yes No Yes, SMART Yes Yes Yes (Stage 2) Yes ~r of Words 5 [2] 570 375 380 48 (Stage 2) 630 led Vocabulary No No No No No No ng None Yes Yes Yes Yes Yes lard Algorithms SMART Lovins No Kelly & Stone [13] Porter ihological analysis _______________ Yes [3] No Yes No Document eighting Ranking Gnly Yes [4] "ltc" [10] tf, idf Yes Yes [14] Yes Discovery No Yes Yes Yes Yes simplex noun NP's, VP's, Proper nouns & of phrase _______________ phrases [5] ______________ _________ Others complex nominals 2-word tical Methods No Yes Partially No Yes .ctic Methods Yes [6] _______________ __________ Yes Yes Ic parsing No Yes, see ahove No Yes Yes (partially) [15] No ense Disambiguation No Yes [7] No No Yes [16] No ic Associations No No No Yes Yes No Synonymy, Semantic relations Definition _________ specializations between concepts hecking Manual Correction) No No No No No No `,Correction No No No No No No Noun Identification No Yes [8] No Partial Yes No `2r No No No Yes Yes No .`h Patterns year numbers Dates Manually Indexed Terms No No N/A No No No rechniques to Build Data ~res Yes [9] Yes [11] _________________ [17] I~ CONSTRUCTION OF INDICES, KNOWI~DGE BASES, AND OThER DATA STRUCURES - METhODS USED No Pre-indexing Automatic query generation only A comprehensive inflectional morphology is used to produce word roots. Participles are retained in surface forms (although normalization is possible). NE morphology is used. 1) IDF[UF over phrases for retrieval. 2) A combination of statistics, induding frequency and distribution, for thesaurus discovery. Simplex noun phrases -- not including prepositional phrases or relative clauses. A deterministic, rule-based parser that nominates noun phrases based on testing for phrase-boundary conditions. The parser grammer indudes heuristics tor syntactic category disambiguation. Words not identified in the lexicon (about 100,000 root forms of English) are assumed to be "candidate" proper nouns. This technique does not appeal te information, etc. 1) Thesaurus Discovery -- which we use for query vector augmentation -- involves the identification of core characteristic terminology over a document set. It according to several parameters, including frequency and distribution, and then selects the subset of terminology that optimizes for these scores. 2) Documents are broken into smaller, paragraph size units called "subdocuments." The subdocuments are the units from which statistics are drawn and over' is measured. Yes, "1tc"; log (term freq in doc) term-idf weight * document length normalization. The ISI~VD analysis of the term by document matrix. I~I takes a term~ocument matrix, transforms it by a user-specified weighting scheme (SMARTS'S "Ii experiments)9 and then calculates the best k~dimensional approximation to this matrix using singular value decomposition (SVD). The number of dimensions, k for TREC-2. All retrieval is done in this 2o(kiimensional 151 space rather than using raw term ovedap. A table of 528 manual and 13787 automatic 2-word phrases. When these are identified in adjacent posistions in documents or queries, they are used as additioi Kelly & Stone's Algorithm with minor modifications, using U)OCE as the accompanying lexicon for table look-ups. Stage 1: based on the text structure (discourse structure) of texts. Stage 2: implicitly with the conceptual graph matching/scoring scheme. Part~l-speech tagging9 phrase and clause bracketing, and special handlers in the RCD module. Stage 1: for all words in textL Stage 2: only when RIT codes are assigned. 1) assignment of subject field codes to individual words 2) SFC vector construction for documents 3) proper noun knowledge base construction 4) inverted index construction 5) assignment of RIT codes 6) conversion of text into conceptual graph representation IA. CONSThUCTION OF INDICES, KNOWIEDGE BASES, AND OIlIER DATA STRUCTURES -- METhODS USEU ~M NAME I c~ ~ ERIM ] mic ~ CUIRI ~ UCLA [ SEC ~fflfl Ust Yes No Yes No Yes No ThThes 658 stopwords 438 stop words `of Words 284 semi- 58 semi- 250 ____________________ stopwords [1] _____________ 370 stopwords (5) ___________ dVocabulary No No No No No No No ,, Yes No Yes Not currently Yes Based on Based on Porter Suffix stripping d Algorithms Porter [21 lovi's (2] (Porter, 1980) ~ogical analysis Not yet No ~hting No No Yes IDF No Yes Yes (7) No, a few Yes No iscovery phrases are _______________ recognized No Yes No Yes [6) ________ phrase Adjacent words Noun Statistical ~ Methods tagging ______________ ic Methods Yes parsing No No No No No No `ise Disambiguation No No No No None No Associations No No No No No ~tinition ecking ~anual Correction) No No No No No No Correction No No No No No No ~oun Identification No No No No Yes No -~ Not used [3) No No Not used (3] No No Patterns? Date ranges Date ranges _____________ ~anually Indexed Terms No No No No No No echniques to Build Data `es No [4] Yes [8] IA. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OThER DATA STRUCTURES -- MEIHODS USED [wo Stoplists were used. ~ystem 1: For routing and manual/feedback 411 stopwords + 58 semi-stopwords. ,ystem 2: For the automatic adhoc run 247 stopwords + 226 semi-stopwords ~mi-stopwords are not used in query expansion unless they also appear in the query. ~ased on Porter with enhancements to deal with peculiar plurals and partial conflation of British/American spellings. )ate ranges are recognized, but no use was made of this. ast inversion method operating in limited main memory (peak requirement 40 MB) and limited temporary disk (peak requirement 50 MB more than the f ioth text and index stored compressed. Ml alphanumeric strings indexed. ;emi-stopwords are not used in query expansion unless they also appear in the query. ;ome phrases are recognized by default okapi go4ist. New phrases "discovered" from concepts section of topics 1-100 by treating comma-separated text ~ ~ff * nidf; totally n documents, feature f, document d; feature frequency: ff(f,d) iormalized feature frequency: nif (f,d) = ff(f,d)/max {ff (fi,d) I fi element d} iocument frequency df(f)= I {dj If element dj} I nverse document frequency idf(f)=log ((n+1)I(df(f)+ 1)) iormalized inverse document frequency nidf(f)=idf(f)/log(n+ 1)= 1-log (df(f)+ 1)/log (n+1) nidf's are pre-computed from disk 1 & 2 only new features from disk 3 have an nidf of 1.0 ~apping each term t to a 32 bit integer by applying two hash lunctions to the term and by hashing the two resulting numbers into one number. *-> max 3 terms mapped to same integer (only in two cases). *-> only 426 terms (0.1% of all terms) are mapped to ambiguous numbers. We used no special index data structures for TRW1 ~roximity queries). IA. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OilIER DATA STRUCTURES -- MEIHODS USED SYSThM NAME ~__ADS UIC [2] MEAl) ~ UIF J IDSR ~ Stopword Ust Yes ~ Yes ~ No ~ Yes Yes -\ - Number of Words 421 632 None 166 (6] 399 Controlled Vocabulary No No No No No Yes, but Stemming Yes (adsl) Yes No not used Yes Extracted from SMART, coded in lovins, - Standard Algorithms Paice, 1990 SPlIBOL None modified Paice - Morphological Analysis No No No None No Term weighting No IDF [3] No tf * idf Yes Yes Phrase Discovery No Yes No No No - Kind of phrase _____________ [4] None __________ _________ ____________ Statistics on word pairs computed & - Statistical Methods used None - Syntactic Methods No None __________ ____________ Syntactic parsing No No None No No Word Sense Disambiguation No No No Yes [7] No ___________ Heuristic Associations No No No No - Short Definition word co~occurrences None Spell Checking (With Manual Correction) No No No No No Spelling Correction No No No No No Proper Noun Identification No No No No No Tokenizer No No No No No - Which Patterns Use of Manually Indexed Terms No No No No No Other Techniques to Build Data List of Structures Yes (1] offsets [5] None IA. CONSThUCTION OF INDICES, KNOWLEDGE BASES, AND OIlIER DATA STRUCTUUES -- MEIHODS USED Binary classification trees built automatically from the original documents and the topic statements. Searching is done on the fly, as raw text is processed. The intermediate data structure is discarded as the search is completed. What is saved, howe~ record of all words appearing within three word positions of each query word. ] Inverse Document Frequency with a base of only those documents containing at least one of the query words. ] All word co-occurrences within 3 word positions of a query word are listed as word pairs. ] A list of offsets to the beginning of records (articles) is generated at the beginning of the session for each data file. 166 stop words, 122 abbreviations, 47 hyphenated words, 24 entries for abbreviations and alternate notation for months, 35 entries for legitimate wc to be prefixed, and 6 entries for legitimate prefixes. T] The semantic lexicon we used is based on word senses found in Roget's Thesaurus. m. CONSThUCTION OF INDICES, KNOWLEDGE BASES, AND O'IHER DATA STRUCTURES -- STATISTICS ON DATA STRUCTURI NAME ]__DORTMUND [ CORNELL ~ BERKELEY ~ RUTGERS [8]] SIEMENS_~ UMASS [___ r'dex Yes Yes Yes Yes Yes N/A - 460 (D3) 390 (D3) 667 (1)1/1)2) 882 (1)1/1)2) orage (MB) 863 (1)1/1)2) [2] 863 (1)1/1)2) [4] 1GB [5] 337 (1)3) 415 )D3) ~mputer Build Time 4.8 (D3) 4.8 (D3) 8 (1)1/1)2) 9 (1)1/1)2) [2] 9 (1)1/1)2) [2] 6-20 [6] _________________ 2 (1)3) [9] 40 tic Process Yes Yes Yes Yes Yes Manual Hours `sitions Stored No No Yes [7] ________________ No Yes ~rms Only No [3] No [3] Yes Yes Yes None Not used N/A orage (MB) ___________________ __________________ __________________ _____________ _______________ omputer Build Time Lion of Method `t~C Process of Manual Hours Suf~Ix arrays, Signature None Not used N/A orage (MB) ________________ ________________ ____________ _______________ ___________ _____________ omputer Build Time tion of Method Ltic Process of Manual Hours lB. CONSTRUCTION OF INDICES, KNOWI£DGE BASES AND OThER DATA STRUCTURES - STATISTICS ON DATA STRUCTURES 460 Mbytes for D3 (test set for routing Mbytes for D12 (`earning set for routing and test set for adhoc). CPU hours for D3, 9 CPU hours for D12. ove figures are roughly 25% phrases. ) Mbytes for D3 (test set for routing). Mbytes for D12 (`earning set for routing and test set for adhoc). e GB total for all 7 collections, 3 disks. nged from 6 to 20 hours for each collection. t used in final runs. cause we used the UMASS INQUERY systeni and its indexing, all of the answers to the questions in this section for our systems are identical to those for the 1 ghtly over 2 hours to build inverted index for Disk 3 given structures for Disk 2. IL CONSTRUCTION OF INDICES, KNOWl£DGE BASES AND OThER DATA STRUCTURES - STATISTICS ON DATA STRUCTURES T~M NAME ~ GE [1] [ CLARIT ~__151 ~__HNC ]_NYU ~_SYRACUSE Ii Index None [2] _________________ Yes Yes Yes (MB) 1500 MB (Dl, D2) torage __________________ __________________ ___________ 760 MB (D3) 305 MB [3] 1,284 [5] ____ ~mputer Build Time 250 (WSI) ) ___________ ___________ _______ -100 100(SIM) 24+Hrs. [6] ___ ~ic Process Yes Yes Yes r Manual Hours ositions Stored No No No No Ferms Only _________________ _________________ __________ N/A No Yes (Stage 2) No Yes (tested on None smaller corpora) No No No torage (MB) - 1KB/cluster ~mputer Build Time -48 Hrs. for 75K ) ______________ documents (230 MB) __________ ___________ tion of Method K-MEANS Itic Process Yes r of Manual Hours Suf~Ix arrays, Signature None No No No No torage (MB) ________________ omputer Build Time tion of Method `t~C Process of Manual Hours `B. CONSTRUCTION OF INDICES, KNOWLEDGE BASES AND OThER DATA STRUCTURES -- STATISTICS ON DATA SThUC~IU Pre-indexing -- most of the helow don't apply. ~d SMART's pre-processing to construct a term~ocument matrix for input to the SVD. This took ahout 9-10 hours. After this, SMART is used only to acce ~ do not store an inverted index, since we use the 151-space for matching and retrieval. 4 MB (0.55 GB WSJ text); 101 MB (0.3 GB SJMN text). stem creates a network. Files created are descrihed in B.5 (special routing structures) and B.6 (other data structures). Proper noun, complex nominal, and text structure index - 1,000 MB for WSJ and SJM. conceptual graph - 284 MB (WSJ). Proper noun, complex nominal, and text structure index - 24 hours (for both WSJ and SJM). conceptual graph -? ~* CONSTRUGHON OP INDICES, KNOWLEDGE BASES, AND OJIIER DATA STRUGIURES -- STATI~CS ON DATA STRUC] IEM NAME ] CITY [__ERIM [__`MC [~_CITRI ~__UClA ]__SEC ]~ No index, system Index Yes was a filter Yes Yes Yes No 512 MB for approximately entire 130 (document equal to torage (MB) System 1: 1260 (2.2GB) index) - 750 MB uncompressed System 2: -1420 training 185 (page index) text collection ~mputer Build Time ) 40 CPU hours 0.3 (20 min.) 4 12 Hrs. [4] ____________ _____ ~ic Process Yes (1] _________________ Yes Yes Yes (1] Yes ______ r Manual Hours Yes, (field, Yes (field, Dositions Stored sentence, word) Yes [2] No (3] sentence, word) No Some prespecified Some pre- [erins Only phrases are treated Yes Yes specified phrases No as words are treated as words No None No No No torage (MB) _________________ _______________ ____________ ______________ ~mputer Build Time ) _________________________________________________ ___________________________________________ ___________________________________ ____________________________________ _______________ tion of Method rtic Process r of Manual Hours Suf~lx arrays, Signature No None No No No torage (MB) ________________ ______________ ______________ _____ omputer Build Time ~ _____________________________________________________ tion of Method ~ic Process r of Manual Hours `B. CONSThUCIION OF INDICES, KNOWI£DGE BASES, AND OIlIER DATA STRUCIYURES -- STATISTICS ON DATA STRUCIVI es, when enoug~~ disk available --> needs > 100% scratch. )aragraph, sentence, word position). erm frequencies within document are stored. ~n't know exact time, estimate 12 hours. Database was indexed by Stephen Walker at City University. `B. CONSTRUCTiON OF INDICES, KNOWLEDGE BASES AND OilIER DATA STRUC'n~5 -- STATISTICS ON DATA STRUCTL r SYSThM NAME J__ADS r ~ f_Daihousie ~ MEAD J UIF ]_IDSR ~: Inverted Index No No Yes No - Total Storage (MB) ________________ - Total Computer Build Time (Hours) - Automatic Process - Number Manual Hours - Term Positions Stored - Single Terms Only Clusters No No Yes No - Total Storage (MB) _________________ - Total Computer Build Time (Hours) Description of Method - Automatic Process - Number of Manual Hours N grams, Sufrix arrays, Signature Files No No No Equal to size of - Total Storage (MB) _________________ original text [1] _______________ - Total Computer Build Time (Hours) ________________ 100 hours - Description of Method __________________ [2] _________________ ___________ - Automatic Process Yes - Number of Manual Hours `B. CONS'T1~U~ON OF INDICES, KNOWLEDGE BASES AND OTIIER DATA STRUCTURES -- STATISTICS ON DATA STRUCTUR] )TES: Files of all word pairs for which at least one memher is in the query word list are nearly equal in size to the original text. Find shortest paths in the network of all word pairs including at least one query word. (Full realization of shortest path approach not done for TREC- directly matching pairs used for official results. CONSTRUCTION OF INDICES, KNOWI£DGE BASES, AND OIlIER DATA STRUCI'URES -- STATISTICS ON DATA STRUCTURES (CC SYSIEM NAME J DORIMUND [CORNEll I_BERKEIEY [__RUTGERS (3] r SIEMENS ~_UMASS [ VPI ~nowI~ Bases None N/A Total Storage _____________ Number of Concepts represented Type of Representation _____________ Total Computer Build Time Total Manual Time Use of Manual ILabor (*) _______________ _____________ Auxiliary Files Needed (+) __________ LOUtI~ Structures Yes [1] Yes [1] None None N/A `otal Storage (MB) 1.5 1.5 Total Computer Build Time 1.8 1.8 Automatic Process Yes Yes Number of Manual Hours Description of Method ~her Data Structures Yes None Yes [4] None Yes (5 45 (D1/D2) Total Storage (MB) _______________ ___________ _____________ ________________ 36 (D3) Total Computer Build Time 2 (D1/D2) (Hours) _______________ ___________ _____________ ________________ 1.75 )D3) __________ ___________ Automatic Process Yes Yes Description of Method [2] ___________ ______________ ________________ ______________ __________ [6 (a) Mostly manually built using special interface (1,) Mostly machine built with manual correction (c) Initial core manually built to `bootstrap" for completely machine-built completion (d) Other (a) Machine-readable dictionary (1') Other B. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OIlIER DATA STRUCTURES-- STATISTICS ON DATA SmUCTURES (C OTES: ] Occurrence statistics for the most frequently occurring (in learning set rel docs) 1000 terms for each routing query. ] For the adhoc runs, the `query regression' method was used. The query regression coefficients were computed from the query.nnn and doc.lsp-file (wh created by polynomial regression). Afterwards reweighting of the q3-query-file. 4 query.nnn -> query.lsp. ] Because we used the UMASS INQUERY system and its indexing, all of the answers to the questions in this section for our systems are identical to tt the UMASS system. ] Document vector files and term dictionary produced by SMART: Fach individual collection was indexed separately, so sizes/times are average per col with the range of values specified. The collection statistics are based on the summation of individual collection values so are perhaps less accura collection size of the term dictionary cannot be effectively estimated with this approach. Term positions are not stored within the document vecto Average Range Collection Document Vector Files (MB) 120 31-124 1100 Term Dictionary (MB) 16 15-17 Unknown Time to create both above files (Hours) 10 6-14 120 5] Standard process as implemented by SMART, following parameters as in Part I, Section A. - -.,. O~ LNDICES, KNOWLEDGE BASES, AND OIlIER STRUCTURES -- ST~STICS ON DATA STRUCTURES (CO GE ]_CIARIT I ~i I HNC r NYU ~ I___ -.------ .- ---- _______________ _________J ~ I SYRACUSE ~ Basi.~ None Yes No No (~rag 1.7 of (~rE~pt.~ ~~re~nLed 31,320 [~epr~s(iIta~ It Associations :Input~. ]3u:1l I;nie 20 ~ T~rne 0 \Ianual Lal> Y: None ~ Files ~feec;ed (+) _________________________ (a) OAU) See"( ;:ruc~ures None No No Data __________ Struct ~~age (AiB) I 1GB 8 x 0.' )rnputer 13ui1~j `lime startir _________________ ___________________ _____________ direct t ic Process Yes of Manual Hours ionofMetho'l [1] _________ _________ _________________ ____________ Expansions from ta Structures corpus (20,000 Word & document __________________________ terms) Yes (2] Yes (3] context vectors Yes [15] Yes 620 MB for words orage (MB) 2 1 676 MB (4] 740 MB for Docs ______________ 2,201 MB [16] ______ )mputer Build Time "-`72 Hrs. for Disks less than 1 _________________________ 12 3 42 Hrs. [5] 1-3 hour [17] tic Process Yes Yes Yes Yes Yes No ion of Method See [2] [6] [18] 1.' macnine-readable' aictioi ~ry other Mostly manually built using special interface Initial core manually built to "bootstrap" (1)) Mostly machine built with manual correction (d) Other Is. CONSTRUCTION OF INDICES, KNOWIEDGE BASES, AND OTHER DATA STRUCTURES - STATI~CS ON DATA STRUCTURES (CON'T) .`d IF'IDF to pull out significant terms from documents, found significant term pairs using entropy-based statistic. irst~rder thesaurus (which is a collection of core terminology) is constructed for each topic. In the case of routing topics, the source for the thesaurus for e~ of query-specific known relevant documents. In the case of ad-hoc queries, relevant documents are automatically identified via an initial retrieval (queryin~ ire collection. [uses "reduced~imensional" term and document vectors (sec below for details of how they are constructed). For TREC-2, we used approximately 200 dimensi I each document has a real-valued vector in this 200 dimensional space. For the routing queries, we used an 88112 term x 68385 doc sample to construct an 151- tors are embedded in this space. (At this point, all the documents used in the scaling could be removed. We did not do so.) The resulting data structure )uld be reduced to 70 MB if documents omitted.) For adhoc queries, we constructed an 82%8 term x 69997 doc 151-space. We folded in the 672358 CD-12 re not in this sample. The resulting data structure was 549 MB. MB for routing (could be reduced to 70 MB), 549 MB for Adhoc. proximately 20 hours for the routing SVD and 20 hours for the adhoc SVD. In addition, 672K documents were added for the adhoc run, taking about 2 h( on a SpardO with 128 MB RAM or 384 MB RAM.) I,sVD analysis of document collections. Create a raw term-by document matrix, and transform the cell entries by the appropriate weighting scheme. Used SMART pre-processing for this. Calculate the best reduced k~imensional approximation to this matrix using singular value decomposition (SVD). For the TREC-2 experiments, about were used in the approximation - 199 dimensions for adhoc and 204 dimensions for the routing queries. Retrieval uses this 2O()~dimensional 151-sr If necessary, fold in any terms or documents that are not in the original SYD analysis. Necessry for adhoc queries, not for routing queries. ~work node, edge files; routing using network node and edge files is straightforward. ~e file: 8x20; Edge File: 8~ for 1 GB. )uting: 1. Process 1 GB from Disk 1 (WSJ1, APi, DOE, FRi, ZIFi). 2. Process queries against Disk 1 (training). 3. Process new Disk 3 as if they were queries -- to make use of Disk 1 statistics. 4. Combine queries, (old) dictionary and Disk 3 into network for retrieval. 5 Docnum file 6. Termnum (dictionary) file 7. Node file 8. Edge file Subdocument file Coded file (direct file) DOC ID checking file TERM ID checking file ~r 1 GB): 1GB 4. 1.1GB 5. 16 6. ~or 1 GB): 3 6 7. 8x20 20 8. 8x5 7.5 -6. 40 ,8. 8x0.75=6 ~s, if sufficient RAM and disk space. For this experiment, No. Two hours of manual labor. Law Text --> Subdocument file ubdocument --> coded file, DOC ID file, TERM ID file, doenum file, termnum (dictionary) file. ~oded, termnum, doenum --> node, edge files. ~ynonymous complex nominal listings; oncept - relation - concept triples MB (synonymous complex nominal listings; ~,i98 MB - WSJ (concept - relation - concept triples) ~ss than one hour for building complex nominal lists based on 50 topic statements. L) Special purpose grammar which is written to extract complex nominals A set of sp~i~a~l~ handlers process tagged and bracketed text based on the knowledge base to extract concept - relation - concept triples. `B. CONSTRU~ON OF INDICES, KNOWLEDGE BASES AND OIlIER DATA STRUCTURES-- STA¶i~CS ON DATA STRUCTURES (Ci SYSThM NAME ~ CITY [ ERIM ~ ThIC ] CHRI Iu~~ ] SEC ] ETH ] TRW Knowledge Bases No None No lexicons No - Total Storage - Number of Concepts represented - Type of Representation Total Computer Build Time ? - Total Manual Time None - Use of Manual lebor (*) _________ __________ ____________ _____________ None - Auxiliary Files Needed (+) _________ __________ ____________ None Routing Structures No None No None No Total Storage (MB) __________ __________ - Total Computer Build Time - Automatic Process - Number of Manual Hours - Description of Method TREC text was ()ther Data Structures No None ___________ compressed No Yes [3] Yes Total Storage (MB) __________ ___________ ____________ 63~3 [1] _________ ___________ [4] 5 Total Computer Build Time (Hours) __________ ___________ ____________ 8 [2] _________ ___________ [5] <24 Automatic Process Yes Yes Yes Huflhian Description of Method coding, word __________________________ based model [6] _______ ) (a) machine-readable dictionary (1)) other (a) Mostly manually built using special interface (c) Initial core manually built to "bootstrap Mostly machine built with manual correction (d) Other CONSTRUCTION OF INDICES, KNOWLEDGE BASES AND OTIER DATA STRUCl'~~S -- STATISTICS ON DATA STRUC'I'URES (CO] `Es: 605.1 MB for compressed text 27.2 MB for auxiliary structures Complete retrieval system, including index, occupies 40% of space required by original unindexed text. 4 hours for compression, plus the 4 hours for indexing; 8 hours total build time. Combination of signatures and non-inverted document descriptions. Experiment: topics 51-100 versus disk 3. signatures: non-inverted document descriptions: normalized inverse document frequencies: document lengths: mapping of features to numbers: 2. Experiment: topics 101-150 versus disks 1 signatures: non-inverted document descriptions: normalized inverse document frequencies: document lengths: mapping of features to numbers: Uncompressing and indexing: c~ 21.5h CPU (all collections of all 3 disks) loading descriptions into access structure: 10 msec.Idocument 169 MB 278 MB 2.1 MB OA MB 4.1 MB and 2. 374 MB 618 MB 2.1 MB OA MB 4.1 MB 6] For each feature occurring in a document description, a bit is set in the signature of the document by applying a hash function to the feature numb signatures are used to determine an approximate RSVO. The documents are ranked according to these RSVO's. Beginning at the top of the ranked exact RSV's are computed using the non-inverted document descriptions. It is not necessary to compute all exact RSV's because documents can ali provided to the user as soon as their exact RSV is bigger than the RSVO of the actually regarded document. [7] For TRW2 (statistical queries), we built a combined word frequency table, phrase frequency table (2 and 3 word phrases), and a special features fi table. These were based on a selected subset of the training database and were used to calculate term weights. They had no direct role in the exe the queries. :A~. ~ ) ~ ii~~R ;)ATA ~TRUc;~URES -- STkHSTICS ON DATA STRUCTURES (C SYSThM NAME J ADS - UI~(' - ~ ~ ~ Dalbousie Knowledge Bases No ~ J MEAl) 1No UIF J IDSR i: _________ No - Total Storage - Number of Concepts represented _____________ - Type of Representation - Total Computer Build Time - Total Manual Time - Use of Manual Libor (*) ______________ - Auxiliary Files Needed (+) query/filter files Yes, Neural Routing Structures No 1 per query ___________ Network 10-500 characters Total Storage (MB) each 15 KB - Total Computer Build Time negligible __________ 5 minutes - Automatic Process No Yes - Number of Manual Hours Description of Method _________________ [3] [5] __________ Yes, classification record offset file Other Data Structures vectors; actually (1 per article in integer arrays ___________________ database) None Yes - Total Storage (MB) A few KB [1] _________________ 770 KB 2 MB Total Computer Build Time (Hours) 4 [2] _________________ 3 seconds CPU 1.5 hours Automatic Process Yes Yes Yes Description of Method [4] [6] ____ ) (a) machine-readable dictionary (1)) other (a) Mostly manually built using special interface (c) Initial core manually built to "bootstrap (1)) Mostly machine built with manual correction (d) Other ~* CONSTRUCTION OF INDICES, KNOWLEDGE BASES AND OIlIER DATA STRUCTURES -- STATISTICS ON DATA STRUCTURES (C OIES: L] 5] Only a few KB for the training sets used for the official scores - we used TOPIC for the actual test. Feature extraction takes of the order of 10 seconds per document - total time for the training data (disk 2 only) was of the order of 4 hours. Ran queries as adhoc queries against the test data (WSJ) then typed in query sequence to he used as filter. Scan data file and save ofl~et for each occurrence of string. The neural network was used to represent the topics. Fast output node is associated with a topic. Each input node is associated with a Roget ca Document frequency for each topic for a list of 1400 candidate features. IC. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND O¶1IER DATA STRUCTURES--DATA FUILT FROM O¶HER SOU! ~TIiM NAME DORTMUND CORNELL BERKEIEY RUTGERS [1] SIEMENS UMASS ily Built Auxiliary Files None ] main Independent or Domain Domain cific Independent Countries & ~ of File cities [3] __ a' Storage (MB) _____________ ____________ _____________ ________________ ______________ <1 rnber of Concepts _____________ ___________ _____________ _______________ ______________ 2 (country, city) `eof Representation _____________ ____________ _____________ _______________ ______________ list of strings `a' Computer Build Time (Hours) ____________ ___________ ____________ <1 [4] _ liputer Time to Modify (Hours) _____________ nual Time to Build 2 [5] _ nual Time to Modify _____________ of Manual Labor (*) ___________ ___________ _____________ _______________ ______________ b lly~uilt Auxiliary Files None Yes ~ of File Woidnet 1.3 6.5 (Nouns Gnly, a' Storage (MB) our format) _______________ 77,656 noun senses in 41,263 nber of Concepts _____________ ___________ _____________ _______________ synonym ______________ e of Representation Semantic net [2] !ostly manually built using special interface lostly machine built with manual correction utial core manually built to `bootstrap" for completely machine-built completion ~ther IC. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OTHER DATA STRUCTURES--DATA BUILT FROM OTHER SOU Because we used the UMASS INQUERY system and its indexing, all of the answers to the questions in this section for our Systems are identical to those fort Closest to a semantic net: sets of synonyms interrelated through lexical relations. bsts of countries and cities, extracted from gazeteer. Built for TREC-1 in <1 hour. Not modified for TREC-2. Built for TREC-1 in - 2 hou~ Not modified for TREC-2. IC. CONSIItUCTION OF INDICES, KNOWLEDGE BASES, AND OilIER DATA STRUCTITES -. DATA BUILT FROM O¶HER SOul ~TEM NAME J CLARIT ] C~RIT ] HNC ] HNC [QUEENS r SYRACUSE [: Word frequen~ lexicon for statistics for Stemming ly Built Auxiliary Files English common English exception list Word pair list Yes n Independent or Domain Domain Domain Domain Stopword n Specific independent independent independent independent file Domain independent I Database of words with frequency 1) lexicon f File lexicon statistics Exception list Word pair list lexicon 2) knowledge bases ~ ;torage(MB) 2MB 2MB 28KB 61KB 0.004 15~ (3] C .r of Concepts >100,000 (1] 139,481 Words 1,300 3700 630 82,669 (4] 1 f Representation Database records Database records Iiist Iiist [5] _ ~mputer Build Time ) N/A 2Omin. N/A N/A 0 2OHours [6] 4 ter Time to Modi~ (Hours) None None I Time to Build N/A None - 96 Hrs. %-120 Hrs. 48 Hours [7] _ I Time to Modify None None b Qexicon & proper Manual LAbor (*) [2] None b b noun KB) ly-built Auxiliary Files None None None None ~ r File torage (MB) _________________ r of Concepts Representation - - - ) Mostly manually built using special interface ) Mostly machine built with manual correction Initial core manually built to `bootstrap" for completely machine-built completion ) Other [C. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OThER DATA STRUCTURES -- DATA BUILT FROM OThER SOUR( `tr 1009000 root forms for English words (`exical items). c CLARIT lexicon was manually constructed using word lists extracted from onAine sources during early phases of the CLARIT research project (1988-1 13.9 MB (for lexicon built from LDOCE) 0.3 MB (for proper noun knowledge base) 1.1 MB (for verb and normalization verb case frame) 43,941 (`exicon) 99889 (`)roper noun knowledge base) 28,839 (case frame) inverted index (`exicon) frames ~roper noun knowledge base) case frames used as rules for concept-relation~oncept triples (case frame) 10 hours Qexicon) 10 hours (proper noun knowledge base) 24 hours (1exicon) 124 hours (proper noun knowledge base) CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OIlIER DATA STRUCTURES -- DATA BUILT FROM OIlIER SOURC ~M NAME ] CITY ] ERIM ] TMC J CITRI ( UCLA J SEC ~ ETh ] One manaully built file ~uilt Auxiliary Files (for each system) None Yes [4] ___________ No ndependent or Mostly very ;pecific Mostly very general [1] __________ __________ _________ general [1] __________ Synonym classes, go Synonym classes, go phrases, stopwords & phrases, stopwords ~le semi-stopwords [2] ________ & semi-stopwords [2] __________ _________ rage(MB) «1MB «1MB Concepts -1500 1200 ~presentation look up table I~k up table Manually built - Manually built nputer Build Time (Hours) "compiled" at runtime, "compiled" at ~1 sec. runtime, ~1 sec. time to_Modily_(Hours) ____________________ _________ _________ ________ ________________ `ime to Build Not recorded [3] __________ __________ _________ Not recorded __________ ime to Modily _____________________ __________ Manual using texi editor Manual using te~ anual labor (*) editor built Auxiliary Files No None No No `~le rage (MB) __________________ ________ ~f Concepts ________________ ~epresentation Mostly manually built using special interface Mostly machine built with manual correction Initial core manually built to `bootstrap" for completely macbine-built completion Other C. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OTHER DATA STRUCTURES -- DATA BUILT FROM OTHER SOURC ~ewhat angled towards American data, but mostly very general. itains synonym Classes (e.g. [child, children]), go phrases (e.g. Des Moines), stopwords and semi~opwords. recorded. Because of disk shortage, System 1 included a number of additional stopwords suggested by high frequencies in the ~IREC data. e manually-built file that contains common business acronyms and abbreviations and abbreviations of organizations. The business acronyms were compiled of organizations is based on entries from the files "un.txt" and "organizations.txt" that are available from Project Gutenberg (the ~.txt files were made availab ice Croft, UMASS). C. CONSTRUCTION OF INDICES, KNOWLEDGE BASES, AND OUTHER DATA STRUCTURES -- DATA BUILT FROM OIlIER SOUR( SYSThM NAME [ ADS ~ UIC ] DAUIOUSIE ~ MEAl) [ UIF ~ IDSR [____ ntemally Built Auxiliary Files None No Semantic I£xicon None Domain Independent or Domain Domain Specific _________ ________ __________ ______ independent ___________ ______ Semantic I£xicon built from 1911 Type of File ________ Roget's Thesaurus ________ Total Storage (MB) __________ ______ 0.5 MB ___________ Number of Concepts _________ ______ 1000 (1] __________ _____ Could be Type of Representation reviewed as rules Total Computer Build Time (Hours) ___________ __________ _____________ 22 Hrs. [2] _____________ _______ Computer time to Modi~ Time (Hours) ___________ __________ _____________ _____________ Manual Time to Build None Manual Time to Modi~ Use of Manual [£bor (*) ___________ __________ (a); See [2] ______________ _______ "xtemally-built Auxiliary Files None No No Stopword list Type of File Total Storage (MB) ___________ _______ _______________ Number of Concepts ________ ________ Type of Representation (*) (a) Mostly manually built using special interface (1)) Mostly machine built with manual correction (c) Initial core manually built to `bootstrap" for completely machine-build completion (d) Other IC. CONSTRU~ON OF INDICES, KNOWLEDGE BASES, AND OTHER DATA STRUCTURES -- DATA BUILT FROM OTHER SOUB OThS: ] There are around 100() semantic categories used. The original 1911 Roget major categories are used by removing the suffix on our semantic cod~ example, the semantic category 12lnv~ is shortened by ignoring nv.3. ] Since the 1911 edition of Roget's Thesaurus became public domain recently, we spent approximately 16 hours creating the software to process ti Thesaurus. Approximately 6 hours of processing time was required to automatically extract 20,000 lexicon entries. £iA. QUERY CONSThU~ON -- AUTOM~CALLY BUILT QUERIES (ADHOC) ~TEM NAME ] Doitmund ~ Cornell ~_Berkeley T_Rutgers ~_Siemens ~ UMASS [_VPI j Fields Mi ___________ _____________ [4] ____________ Computer Thue to Build 0.7 seconds ~) 12h 20 min [1] 6/50 per query per query ___________ _____________ 12.1 ~s Used to Construct [1 Weighting d on topic terms) Yes [2] Yes Yes Yes [5] ____________ Simple noun C' extraction Yes Yes No phrases ___________ Tagging & ctic Parsing noun phrase _____________ No __________ ___________ bracketing ___________ Sense Disambiguation ______________ _____________ No No (adjacent) r Noun Identification capitalized ______________ No words lizer No None ~ Patterns None stic Associations No None ision of Queries Previous Constructed Structure Yes None None Phrases from :11 Structure controlled list natic addition of Boolean ~ctors/Proximity Operators No Yes [6] ___________ Reweighting of terms in a local context [3] `IA. QUERY CONSThUC'flON -- AUTOMA¶1CAILY BUILT QUERIES (ADHOC) ~mputing of query regression co-efficients: 12h 20 min reweighting of query file: 3 sec. ~ry regression, query weighting, polynomial regression document weighting. )ptimally relativised frequencies used to weight stems. ~tle, description, concepts, factors, narrative. Veights initially based on number of occurrences in topic. 3x multiplier for title field, ~ multiplier for concepts field. ~pitalized noun groups, hyphenated words, concept words delimited by commas, words occurring in title. HA QUERY CONSThUCTION - A~MA11CALLY BUILT QUERIES (ADHOC) M NAME ~ GE [ CIARIT [ Lsl ~ HNC ~ NYU ] SYRACUSE [~ QUEEt Turned in various runs Natural languas using everything but , ), Description (), Factors (), Narrative (), Nationality (~ Title () used in automatic part. 5 - 10 minutes a query to select the synonym sets to add an average of 1.2 seconds to automatically process the topic text (1 minute for 50 queries [thei other jobs on the machine!]). Title, Description, Concepts, Factors, Narrative. Addition and deletion of terms selected from the narrative field. Three sets of queries were constructed: one pnorm boolean query set and two vector query sets, one longer than the other. They are called pnorm, lon~ and short vector queries below. Ml query sets: title, description, concepts pnorm and long vector: Narrative; long and short vector: Defmitions. Domain knowledge of computer system expert, limited use to compensate for omissions in topic descriptions. 0] Boolean operators were assigned equal weights (P-values) for the pnorm queries, P-values of 1.0, 1.5 and 2.0 were used for different evaluations of the C HB. QUERY CONSTRUCIION--MANUALLY CONSTRUCTED QUERIES (ADHOC) ~`TEM NAME ~ GE ~ CLARIT ~ LsI [ HNC [ NYU ~_SYRACUSE [ ~ Constructed Queries _________________ _________ N/A No No elds Title, desc, narr, con Build Time (minutes) 2-5 min. per que~ ____________ _____________ I~uery Builder (*) _________________ b [1] ______________ _______________ _______________ _____________ ~ (+) None Used in Construction Weighting ___________ ____________ _________ _________ __________ in Connectors lity Operators _________________ on of Terms of Terms Yes [2] ______________ ~ain expert (1)) computer Systems expert rd frequency list (1))knowledge base browser (c) other lexical tools IIB. QUERY CONSTRUCTION--MANUALLY CONSTRUCTED QUERIES (ADHOC) -- but this is merely the fact of the matter: the people who rendered judgments about the CLARIT-nominated query terms happened to be computer experts not require special expertise of any sort. Lual queries are constructed by correcting the query terms nominated via automatic query construction. Team members analyzed the output of the automat leleted/corrected any parser errors, (2) adjusted Importance coefficients in situations where the weighting heuristics fail, (3) corrected errors in the applicati( ition heuristic, and (4) added additional relevant terms where appropriate. (New or additional terms were rarely added.) HB. QUERY CONSThUC¶1ON--MANUALLY CONS'T1~U~D QUERIES (ADHOC) SYSThM NAME ~ CITY ~ ERIM__~__TMC ~ CITRI ~ U(~A ~ SEC ~ Fill ] TRW Manually Constructed Queries No manually (Adhoc) _______________ None _________ _________ N/A built queries ___________ Topic Fields Any ________ _______ ________ _________ _________ Average Build Time (minutes) Varies Three faculty, two Type of Query Builder (*) research students _________________________ (Inf. Sci.) _________ _________ _________ ___________ ___________ ToolsUsed(+) None Methods Used in Construction - Term Weighting Yes [1] __________ ___________ ___________ - Boolean Connectors Yes Yes (adj, same - Proximity Operators sentence) _________ ________ _________ ___________ ___________ - Addition of Terms Yes Searchers knowledge, -- Source of Terms records seen during search,... - Other [2] - - - - - ) (a) domain expert ~) (a) word frequency list (0) computer systems expert (0) knowledge base browser (c) other lexical tools HB. QUERY CONSThUCfloN~~MANUALLY CONS'I1LUCI'ED QUERIES (ADHOC) Es: The system does this. It was possible for searchers to influence the weights, but I believe none of them did so (they were not told how to). The default operator (when searcher didn't speci~ an op) was a "best-match" opeator utilizing Robertson/Sparck-Jones F4 with a within~document 1 frequency factor and an overall document length factor. HB. QUERY CONSTRUC'TlON--MANUALLY CONSIltUCIED QUERIES (ADHOC) SYSIEM NAME ~ ADS [ UIc ~ DAIJIOUSIE ~ MEAD ~ UIF ] IDSR Manually Constructed Queries (Adhoc) N/A No _______ Topic Fields Yes Average Build Time (minutes) 15 minutes Type of Query Builder (~) b Tools Used (+) No Methods Used in Construction - Term Weighting No - Boolean Connectors Yes - Proximity Operators No - Addition of Terms No -- Souroe of Terms - Other None *) (a) domain expert *`+) (a) word frequency list (0) computer systems expert (0) knowledge base browser (c) other lexical tools UC. QUERY CONSTRU~ON EEEDBACK (ADHOC) YSThM NAME [ Dormund ~ Cornell ]_Berkeley ~_Rutgers ~ Siemens ~- UMASS ~ vpI J k (adhoc) Not used None None N/A ~uery Manual or tic person doing feedback (*) Time for complete feedback time (cpu seconds) k time (minutes) number of iterations rage number of Documents Iteration in number of Iterations m number of Iterations termines end of Iteration k methods used omatic Term Reweighting ______________ omatic Query Expansion (+) ______________ er automatic methods iual Methods terms in relevant documents added nain expert (1)) only top X terms added (c) user selected terms added (1)) syst~n expert BC. QUERY CONSThU~ON -- FEEDBACK (ADHOC) ~TEM NAME ~ GE ~ CLARIT [ 151 ~ HNC ~ NYU ~ SYRACUSE (adhoc) No No iery Manual or ~ Automatic ~rson doing feedback (*) b ____________ rime for complete feedback time (cpu seconds) < 1 second (only did one iteration C time (minutes) 10 min. riumber of iterations 1 age number of Documents teration 20 n number of Iterations N/A ~ number of Iterations N/A termines end of Iteration N/A k metbods used omatic Term Reweigbting ________________ _______________ _______________ _______________________ omatic Query ECpansion (+) ______________ _____________ ____________ Adding relevant [er automatic methods document context vectors in the query context vector nual Methods N/A [1 terms in relevant documents added ~) only top X terms added (c) user selected terms added HC. QUERY CONSlItUCIION -- Fl~EDBACK (ADHOC) PSThM NAME I ~ ERIM ~ ThIC [__CITRI ~ UCLA [~ SEC ~ ~iII I: (adhoc) None No feedback _______________________ ____________ __________________ _________ methods Liely Manual or Manual Automatic C (See uB) ____________ ______________ ___________ (See lIA) ___________ _____________ Searchers: ~rson doing feedback (~) Searchers 1 faculty ____________________ (as IIB) _________ ___________ _________ 1 Ph.D. student __________ rime for complete feedback Varies, about 20- Information not time (cpu seconds) 300 [1] available ____________ Information not time (minutes) 10-120 available 2.0 (initial query + 1 lumber of iterations 2.0 feedback iteration) __________ _____________ ige number of Documents ~eration 13.3 17.8 I number of Iterations 0 2 I number of Iterations 4 2 When 10 relevant documents were ermines end of Iteration Searcher decides found, or when 20 documents had been examined. methods used inatic Term Reweighting ________________ ___________ _____________ matic Query Expansion (+) 2, 3 [2] ____________ _____________ ___________ 2,3 (3] __________ _____________ ~r automatic methods ual Methods terms in relevant documents added ~) only top X terms added (c) user selected terms added `IC. QUERY CONSThUCIION -- FEEDBACK (ADHOC) aries widely, about 20300. Mean not recorded but probably about 80. rm pool consisted of all terms from relevant documents and all terms entered by searcher in search statements. Terms were weighted using Robertson~ archers' terms were given a bonus as if they had occurred in four out of five hypothetical relevant records. The top 20 terms were selected using the crit selection~value= term~weight * rels_containing~termIrels erm pool consisted of all terms from relevant documents. The top 10 terms as determined by w(p-q) were chosen for expansion and were searched togeth~ ~eiy terms. Terms were weighted using Robertson/Sparck-Jones F4. HC. QUERY CONSlItUCIION -- FEEDBACK (ADHOC) SYSThM NAME ~ ADS ~ UIc [ DAUIOUSIE ~ MEAD ~ UIF [ IDSR Feedback (adboc) N/A Not used No Initial Query Manual or ~utomatic [ype of person doing feedback (*) _____________ ~verage Time for complete feedback - cpu time (cpu seconds) ______________ - clock time (minutes) _______________ ~erage number of iterations - Average number of Documents Per Iteration ~inimum number of Iterations ~aximum number of Iterations Vilat determines end of Iteration Eeedback metbods uaed - Automatic Term Reweighting ______________ _______________ - Automatic Ouery E~pansion (+) _____________ ______________ - Other automatic methods - Manual Methods (a) all terms in relevant documents added (a) domain expert (1,) only top X terms added (c) user selected terms added (1)) system expert I'D. QUERY CONSTRU~ON~~AUTOMATICALIN BUILT QUERIES ~OUTING) [ SYSThM NAME ~ Dortmund [ Cornell [_Berkeley J_Rutgers ~ Siemens ~ UMASS i:~~ F Automatically Built Queries (routing) Yes None [1] NM Topic Fields Used Everything except Everything except ____________________________________ Defmitions Defmitions All See [1) ____ 212/50 (crnlRl) Total Computer Time to Build 28/50 for each 50/50 (crnlCl) 0.7 seconds per 4623 for (cpu seconds) query per query query 50 queries Me~hods Used in Building Query - Terms Selected from (*) 1, 3 1, 3 1 3 - Term Weighting with weights based on (*) 1, 2 1, 2 1 3 - Phrase Extraction from (*) 1, 3 1, 3 No No - Syntactic Parsing of (*) No No - Word Sense Disambiguation using (*) ________________ No No - Proper Noun Identification Algorithm from (*) No No - Tokenizer from (*) _______________ _______________ No No -- Which Patterns? - Heuristic Associations to add terms from (*) No No History of term History of term - Expansion of Queries Using occurrence in occurence in No No Previously Constructed Data Structure relevant docs relevant docs -- Which Structure? - Automatic Addition of Boolean Connectors Proximity Qperators Using Information from (*) No No Additional term specific weights None - Other added for 2nd ________________ routing run (1) Topic (2) All training Documents (3) Documents with relevance judgements lID. QUERY CONSTRUCTION-AUTOMATICMIY BUILT QUERIES (ROUTING) The general approach was to use the automatic method of query formation, and then apply relevance feedback to the results obtained from runnin automatic query on a different IREC collection. lID. QUERY CONSTRUCTION~~AUTOMATICALLY BUILT QUERIES (ROUTING) SThM NAME j------ThEj CLARIT f I-SI ] HNC r NYU [SYRACU ially Built Queries (routing) Yes [5] _____________________ `Isiri" - all "1sir2" - no topic info; used only text of Ids Used relevant documents Same as topic, desc _____________________________ Title, dese, narr, con (from training) Same as adhoc acihoc narr, con 10-300 seconds less than iputer Time to Build (depending on amount second ____________________________ _____ 180 ids) _____ seconds per quer~1] - 0.1 - 0.2 sec/query [6] of training data) 3.2 (Stage 1) LJsed in Building Que~ 1 ("Isiri') ~lected from (*) 1, 3 [2] 3 ("lsir21) 1 2 1 1, 2 ~ighting with weights based on (*) _____ 1, 2, 3 [3] 3 ("Isiri" & "lsir2") No 2 (Stage 2) ~~~traction from (*) No 1,2 1,2 Parsingof(*) 1,2,3 [4] _________________ No 1,2 1 Rule based syntactic mse Disambiguation using (*) category disambiguation No 1 ~un Identification Algorithm Unilnown words treated 1 ~artial) _____ as proper nouns No 2 (partial) 1 r from (*) No No 1 ~.atterns? Dates Associations to add terms _________________________ No No 2 No n of Queries Using y Constructed Data Structure Yes No The first-order clusters from `tructure? thesaurus described in I. training data ic Addition of Boolean ~rs Proximity Operators Yes ~rmation from (*) No No ______________ (Stage 1) ~ Booleon operators on concepts portion of No topics No (2) All training Documents -~) Documents with relevance judgments N AUTOMATICALLY BUILT QUERIES (ROUTING) lID. QUERY CONSTRUCTIO- seconds per query -- encompassing the parsing of all known relevant documents, thesaurus extraction, and merging of the thesaurus and query terms. Thesaurus extraction over the set of relevant documents results in a set of terms to add to a query. The Importance Coefficient (descrileed ahove) is derived from the `discourse location' of terms in the topic. Statistics for IDFfUF calculation are derived from the training documents. All terms extracted from known relevant documents via thesaurus extraction are assigned an Importance Coefficient of 0.5. Yes. (This is necessary in order to extract statistics.) submitted two routing runs. Isiri simply used the text of the query topics to construct a routing filter. lsi~ - used the text of all relevant documents for routing filter for that topic. me required to take vector sum of terms in query (1sirl) or relevant documents (1sir2). Yes Subset of 1 GB from WSJ1, APi, DOE, FRI, ZIFi. Subset of them. HD. QUERY CONSTRU~ON-.AUTOMATICAIiY BUILT QUERIES (ROUTING) YSThM NAME ~ CITY [ ERIM ~ TMC ~ CITRI [ UCLA ~ SEC ~ ETh tically Built Queries (routing) N/A Concept strings; jelds Used None Nationality strings All _________ ________ Ml All 8 hours for [)mputer Time to Build Time for initial queries; `oDds) individual queries 1 - 1.5 seconds 2 hours for 10-20 not known optimization 600 msec.Iquery ~ Used in Building Query ____________ Selected from (*) 3 (1] 1 1 1 1 (stoplist and _______ suffix stripping) 3, weights were Weighting with weights based Robertson/Sparck- 1 (nif); ~ Jones P4 2 _______ ______ 2 2 (nidf) E[traction from (*) 1 1,2 No 1, using simple identification of ~c Parsing of (*) ________________ concepts ________ 1,2 No Sense Disambiguation using (*) ________________ ________ ________ No r Noun Identification Algorithm (*) 1 No izer from (*) __________________ __________________ ________________ __________ _________ No Ii Patterns? ~c Associations to add terms ~*) ______________ _________ ________ ________ No sion of Queries Using usly Constructed Data Structure ________ ________ No ~ Structure? latic Addition of Boolean `ctors Proximity Qperators Information from (*) No Yes (2] No liD. QUERY CONSTRUCTION--AUTOMMATICAILY BUILT QUERIES ~OUTING) ~11 non-stop terms were extracted from all known relevant docs for each query (disks 1 & 2). For cityri, the top 20 terms by the above selection criterion .~r cityr2, the same term pool and weights were used, but the number of terms varied between 3 and 31, based on the results of a long series of runs agai .~r cityri, a simple best match operator was used (doc weight = sum of term weights). Ek)r cityr2, the operator used was based on the results of experiments with the training set. For some queries, it was as for cityri, for others, as in Section liD. QUERY CONSTRUCTION--AUTOMATICAILY BUILT QUERIES (ROUTING) YSTEM NAME [ ADS [~ UIc r DAUIOUSIE ~ MEAl) [ UIF ~ I ~ically Built Queries (routing) Yes Not used Yes Just the coneept lelds Used narr, eon, def None [2] _______________ ____________ section _____ )mputer Time to Build less than 30 onds) seconds [1] 70 seconds 1 second Used in Building Query Selectedfrom(*) 1,3 3 _____________ __________ 1(4) _____ Weighting with weights based _________________________ No None 1 (4) ______ Extraction from (*) No No jic Parsing of (*) No No Sense Disambiguation using (*) No No r Noun Identification Algorithm (*) No No ixer from (*) No ___________ No b Patterns? ~c Associations to add terms *) No No ~ion of Queries Using usly Constructed Data ure No Yes Semantic lexicon in h Structure? ___________ Section I~ latic Addition of Boolean `ctors Proximity Operators Information from (*) No ___________ No Term weighting with weights based on terms No in topics ~ic (2) Mi training Documents (3) Documents with relevance judgments lID. QUERY CONSTRUCTlON-AUTOMTICAILY BUILT QUEUES (ROUTING) akes less than 30 seconds to build the CART tree - this does depend on the size of the training set. one. Queries were constructed by finding all word pairs within three words that were found only in the relevant docwnen~ ilE. QUERY CONSTRU~ON MANUAlLY CONSTRUCTED QUERIES (ROUTING) ~M NAME ~ Dortmund ] Cornell ~_Berkeley ~_Rutgers [ Siemens ~ UMASS [_vPI Constructed Queries None 3 sets _______________________ _______________ ____________ _____________ __________________ (6] ________ Ids Used All See adlioc _____ 125 min. for ime to Build (minutes) combined queries up to 15 min. 15 min [1] [5] __________________ _________ ~ery Builder __________ ___________ system expert system expert ______________ system ~ ~ for Building Query (*) _______________ ____________ ____________ a, C [2] C. indirectly ________________ ________ none, other than ci to Build Query (+) searchers b, WordNet N/A _____________________ knowledge interface Used to Build Query ____________ _________ pnorm: Veighting _______________ ____________ Yes [3] Yes assigned pnorm: ri Connectors Yes AND, C ity Operators Yes No ri of Terms Yes Yes Yes WordNet synonym of Terms [4] sets None ~ency list (1,) knowledge base browser (c) other lexical tools (d) machine analysis of training documents iling topic ~) from all training documents (c) from documents with relevance (d) from other sources judgements lIE. QUERY CONSTRU~ON -- MANUAILY CONSTRUCTED QUERIES (ROUTING) r combined queries, 125 minutes (65 minutes for the five individual queries, 60 minutes for translating them into INQUERY format, including incorporatio~ ecifically, the performance of query formulations on the training set was used to determine their weights in the scheme used for routing. the sense that each individual query formulation within the combined query was weighted, and as implemented by INQUERY. )m the personal knowledge of those forming the searches. e process of selecting synsets was iterated based on the effectiveness of the previous selection. Some topics probably took more than 15 minutes total over tI rations. Creation of query from text averaged 0.7 seconds (35 seconds for 50 queries, less competition on the machine than in the adhoc case!). mbination of automatic query formation from Part lID and a manually modified query using the method from Part uB. ain, three query sets were constructed as per adhoc manual. title, description, concepts ig vector, pnorm: Narrative ig and short vector: Defmitions main knowledge of computer system expert, limited use to compensate for omissions in topic descriptions. HE. QUERY CONS¶1tU~ON -- MANUAlLY CONSTRUCThD QUERIES (ROUTING) TEM NAME ~ GE ~ CLARIT [ LsI f HNC [ NYU [SYRACUSE] QUEEN y Constructed Queries No No No No ) ______________________________ _________________________________________ _______________________ ______________________ _________________________ _______________________ _________________ elds Used All Title, desc, narr, con Time to Build (minutes) About 60 [1] 2 - 5 min. per query ___________ ___________ ____________ ___________ Query Builder System expert Yes [3] ___________ ___________ ed for Building Query (*) b, c [2] a, b, C, d (4] ___________ ___________ ____________ ____________ ;ed to Build Query (+) a, b None Used to Build Query ____________ ________________ _________ _________ __________ _________ IDFJIF and Importance Weighting Yes, ranking only Coefficients an Connectors Yes No nity Gperators No ion of Terms Yes Yes Additional terminology Corpus, relevant nominated by team ~ of Terms documents members as appropriate to topic ~ Yes [5] ____________ ____________ _____________ _____________ _________ - - - - rd frequency list Ii training topic (1)) knowledge base browser (c) other lexical tools (d) machine analysis of training documents ~) from all training documents (c) from documents with relevance (d) from other sources judgments HE. QUERY CONS¶1tUCTION -- MANUAlLY CONSIItUCThD QUERIES (ROUTING) ~bout 60 (1,ut this is a tough question because these were adhoc queries before). b) Word frequencies. c) Topic weights (for ranking, manual correction). (es--but this is merely the fact of the matter: the people who rendered judgments about the CLARIT-nominated query terms happened to be computer exp~ ocs not require special expertise of any sort. a) Terms extracted from topic by naturalAanguage processing. b) Statistics from the training documents used for IDF[11~ scoring. c) Terms extracted from relevant documents by natural-language processing and thesaurus extraction. d) Additional terminology nominated by team members as appropriate to the topic. with manual adhoc topics, manual routing queries reflect hand-correction of the output of the automatic query generating process. Correction took place prio if the query vector with thesaurus terminology. "K QUERY CONSTRUCTiON -- MANUAlLY CONSTRUCIED QUERIES ~OUInNG) SYSThM NAME I c~ ~ ERIM ~ TMC Jcriiu ~ UCLA [ SEC ~ ~wi 11 lidly Constructed Queries ing) None (Proximity queries) (Stat Fields Used Title, description, narrative, concepts ige Time to Build (minutes) 250 -5~ of Query Builder System expert Systei Used fbr Building Query (*) b, C ( docur a, b, C relev~ Used to Build Query (+) a, alo word d [1] specia ds Umd to Build Query n Weighting Implicitly derived from Lean~~~Conriectors ________ query rankings Yes Yes Yes Cperato~ Yes Yes ition of Terms roe of Terms r Counting operators (term frequency in documents) antecedent - reference recognition rd frequency list (1)) knowledge base browser U training topic (1') from all training documents (c) other lexical tools (d) machine analysis of training documents) (c) from documents with relevance (d) from other sources judgments HE. QUERY CONS¶1~UC'TlON -- MANUAlLY CONSTRUCIED QUERIES ~OUllNG) ach topic was represented by a set of queries. Individual queries were parameterized invocations of a single template query. &lectivity vs. paaameter value as used to rank different queries. 5 seconds for automatically generated first cut. 5 minutes average tweek time per topic. 5 days to develop special features queries used across all topics. HE. QUERY CONSIItUCliON -~ MANUAILY CONSIItUCTED QUERIES ~OU'flNG) YSTEM NAME ~ ADS [ UIc ~ DAUIOUSIE ~ MEAl) ] UIF ~ ID lily Constructed Queries dom, title, de~ ng) N/A ______________ _________________ ______________ None def (and head Fields Used Yes :e Time to Build (minutes) 15 minutes 10 minutes f Query Builder _____________ ____________ System expert ___________ _____________ System expert Jsed for Building Query (*) _______________ ______________ a, b, C ______________ _______________ a, C (used won LJsed to Build Query (+) No a `Is Used to Build Query _____________ ____________ i' Weighting No _________________________ _________________ _______________ ___________________ _______________ ________________ Yes lean Connectors Yes No Directly adja~ imity Operators No words ition of Terms No rce of Terms From other toJ Manual assessi r None relationships b Drd frequency list ~rn training topic (1)) knowledge base browser (c) other lexical tools (d) machine analysis of training documents) ~) from all training documents (c) from documents with relevance (d) from other sources judgments ilL SEARCHING SThM NAME ~ Dortmund ~ Cornell f_Berkeley ~_Rutgers ] Siemens [ UMASS ]~i 632/50 to 825/50 to mputer Time to Search 11%/SO per 8568/50 per 4-7 )nds) query [1] query [2] [5] _____________ ______________ topic 16 sec/query for 1 cpu Sec per val Time (cpu seconds) any collection 13.8 to 19.5 query term [8] _________________________ _______________ _______________ separately ______________ [6] ______ included in search 2 cpu sec for ng Time (cpu seconds) time N/A [7] top 1000 ______ Used in Machine kg ____________ r Space Model Yes Yes No Yes Yes Probabilistic balistic Model Yes Yes inference nets _________________ [3] (INQUERY) Yes r Searching No m Matching No an Matching No logic No Yes fext Scanning _____________ No il Networks No ptual Graph Matching ______________ No Yes [4] _____________ _____ ilL SEARCHING 32150 cpu seconds per query if no query expansion (dortVI). 1196/50 cpu seconds per query if expand by 20 terms (dortpl). ~/50 (i.e., 16.5) seconds for each query for crnlV2. 8568/50 seconds for each query for crnl12 (involves re-indexing from scratch the top 1750 docs for each que! ~h indexed local component against the query). 3273/50 seconds for each query for crnlRl on a Sparc 1 with 12MB of memory (all other work was done C ~ Mbytes). 2928/50 seconds for each query for crnlCl on a Sparc 1 with 12MB of memory. robabalistic searching based on linked dependence assumption and logistic regression. n equation derived by logistic regression was used to estimate a probability of relevance for each query and document. ince our experiments were done on a time-shared system, we present here both the mean cpu time of seven runs of each experiment, as an estimate of search a realistic multi-user environment; and, the minimum cpu time of the seven runs, as an upper bound on search and sorting time for a single-user system. I ere using, we were unable to separate search and sorting time and therefore, present the total cpu time for producing the sorted lists of 1000 retrieved ite ur timing results are reported as follows: first for the routing topics, using all 50 topics, giving total cpu time for the experiment (mean and minimum) and me )u time per topic (our rutcombx run); then for the adhoc topics, in which we used only 25 topics, the same figures (our rutcombl run). For the rutfined run ~h having five individual searches, the resulting lists being subsequently combined), we estimate the upper bound of total time as being five times the minimu r runcombi, plus the time required for combining the lists. These are presented as total cpu time for the experiment, and mean cpu time per topic. itcombx Total: 981.55 (973.55) Per Topic: 19.631 (19.471) itcombl Total: 1226.225 (1208.325) Per Topic: 49.049 (48.333) itmedf Total: 5799.96 + 373.2 = 6173.16 Per Topic 241.665+15.55 = 257.215 3£ cpu seconds on average for routing queries against test documents (691 cpu seconds for 50 queries). 17A cpu seconds on average for unexpanded adhi `cuments on disks 1 & 2 (869 cpu seconds for 50 queries). 19.5 cpu seconds on average for expanded adhoc queries versus documents on disk 1 & 2 (974 q ~eries). ot applicable; list of top 1000 maintained during search. cpu second per query term on per gigabyte. Queries averaged about 45 terms each. pproximately 4 - 7 minutes per topic for combination runs (multiple queries) per topic, for a given collection. ombination of results from both pnorm (flizzy logic) and vector (vector space model) queries. I I I I I STEM NAME GE CLARIT I-SI HNC NYU SYRACUSE Can compute ahout mputer Time to Search 60,000 query-document ~nds) [1] similarities per minute when vectors are in _________________ memory Total time (cpu & N/A (there is I/O) search and val Time (cpu seconds) adhoc - -10 min. no unranked ranking is ahout 1 30 - 4 Hrs. (2] routing - - 0.1 sec [3] document list minute per query ____________ qu Te du Done in comparison 300 seconds sul ng Time (cpu seconds) routine, included in ahove (for disks 1 an 12 min. time & 2) _________________ _____________ (ci Used in Machine Ig Yes - cosine A modified vector space r Space Model distance measure model [4] Yes Yes Yes (Stage 1) ___ balistic Model No Y~ r Searching No m Matching No ________________ ____________ an Matching Yes No Yes (Stage 1) ___ logic No Sc fext Scanning No Li Networks Yes Y ptual Graph Matching No Yes (Stage 2) I [ HL SEARCHING is a batch routing system9 with no pre-indexing. The system processes about 1,000 texts per minute. querying program processed all 50 topics (either routing or adhoc) in parallel on four machines. Processing took approximately four hours. adhoc, need to compare a query to ALL documents, soit takes about 10 minutes. For routing, we compare each new document to each of the 50 routing flit `it 0.1 seconds. iodified vector space model. Fewer dimensions than typical vector space - e.g., 200 dimensional real valued term and doc vectors. .60 per query without soft-boolean (combine 2 methods). 200 per query with soft-boolean (combine 3 methods). HL SEARCHING STEM NAME I__CITY_[ ERIM ~ Th!C f CIIRI ~ UCLA J_SEC_f ~i11 ~ TEWi il rnputer Time to Search )nds) _________ _______________ ___________ _________________ __________ ______ [9] ____________ Individual 5 topics; 27~ sean Ranged from 3-10 sec. Mean 11,000 sec.; topic Mean 12-24 hours for From less depending upon was about Average - 4. hral Time (cpu seconds) was 84 running all than 1 sec selected 84 sec [7] topic: 3300 all 5 sec. [1] queries [2] to 5 sec [4] parameters _________ _______ _______ sec. topic Typical times 0.05 ig Time (cpu seconds) Included were from 1500 seconds Included ________________________ in above to 2000 cpu sec. total [5] Negligible [6] above 0 [8] < 1 Used in Machine ~ Cosine rule using Yes, Space Model Yes tf idf weightings _________ ______ See lB alistic Model Yes Yes Yes Searching ________ I Matching __________ Yes [3] ___________ __________________ __________ _______ ________ Yes Yes ri Matching ________ ______________ __________ _______________ _________ ______ _______ Yes Yes -ogic _____ _________ `It Scanning _____ ________ ______ _________ _____ ____ ____ Yes Yes Networks tual Graph Matching ________ ____________ Retrieval of pages of text rather than whole documents - - - - - - - III. SEARCHING pends on number of terms, etc., etc. For the automatic adhoc run (cityan), the mean was 84 cpu seconds including ranking. rd to give a meaningftil number. The system was divided across 7 or 8 different SUN workstations of differing speeds. The total elapsed time for running the adhoc task) ranged from 12 to 24 hours, depending on distribution of machine load. Lr system uses n-gram matching of multiple query strings while scanning the original free text. It uses a weighting scheme on the different query strings based 0 Lhe original topic text. ~ than a second for a small query (3 terms) and 5 seconds for a query with 80 terms. 5 seconds total, 0.03 for local ranking, 0.02 for global ranking (among the cm5 processors). gligible -- heap used to extract top r accumulators, time is included above. pends on the number of terms etc., etc. For the automatic adhoc run with 5 come out tagged by query. Individual queries are already ranked, so hits our approach, retrieval time and ranking time cannot be separated. Experiment: topics 51-100 versus disk 3 First rank lOOOth rank # of #of avg. (median) avg. (median) docs feats [sec. CPU] (sec. CPU] 78'325 142 1.84 (1.12) 4.67 (4.58) 6'708 181 0.21 (0.15) 0£0 (0.80) 90'253 126 1.30 (1.27) 4.78 (4.60) 100'000 26 1.04 (0.97) 1.78 (1.67) 61'021 120 1.01 (0.88) 3.74 (3.75) 101-150 versus disks 1 and 2 #of #of docs feats APi 84'677 141 AP2 79'923 138 DOE1.1 100'000 42 DOB1.2 100'000 43 DOE1.3 26'087 43 FRi 26'207 166 FR2 20'108 170 wSJi 98'627 121 WSJ2 74'520 115 ZIFFi 75'180 107 ZIFF2 56'920 103 automatic query expansion (uclaa2), the mean was 84 wall-clock sec including emerge. AP3 PATN3 SJM3 ZIFF3.i ZIFF3.2 Fxperiment: topics First rank avg. (median) [sec. CPU] 1.47 (1.50) 1.38 (1.37) 1.42 (1.43) 1.41 (1.47) 0.41 (0.42) 0.68 (0.55) 0.57 (0.42) 1.64 (1.68) 1.17 (1.20) 1.27 (1.23) 0.90 (0.88) lOOOth rank avg. (median) [see. CPU] 5.18 (5.02) 4.96 (4.92) 2.93 (2.18) 2.98 (2.15) 0.92 (0.90) 2.22 (1.98) 1.77 (1.63) 5.96 (5.55) 3.57 (3.62) 4.22 (3.80) 2.90 (2.80) HL SEARCHING ~ThM NAME ~ ADS ~ UIc I DAUIOUSIE__]__MEAl) ~ UIF ~ I] 3 - 5 seconds nputer Time to Search per document nds) 2 - 5 minutes [1] ______________________ ____________________ ____________ for routing. _________ 18,000 tot CPU time) ~ Time (CPu seconds) After indexing, 110 CPU seconds decompre: _______________________ approximately 60 seconds test data I No ranking, items are g Time (cpu seconds) ordered by date (all 600 total _____________________ __________________ 1 second items satisf~ query). ____________ CPU time) Used in Machine , , Space Model __________ alistic Model Yes `Searching _____________ ________ I Matching ______________________ n Matching __________________ LDgic __________ ___________ __________ ______ _______ e'xt Scanning ___________ _____________ Yes _____ Networks 4 ~tual Graph Matching _________________ Shortest paths among query words as they occur Binary~classification in documents (although trees mapped into only direct matches used TOPIC query trees in official results). HL SEARCHING Ve are not able to give separate answers for these - using the disk 3 data indexed under TOPIC, it takes between 2-5 minutes to perform a complete search f( depends on the complexity of the topic. ilL SEARCHING (CON'¶) YSThM NAME [ Dortinund [ Cornell ~_Berkeley ~_Rutgers [ Siemens ~ UMASS ~ in Ranking _________ ___________ See INQUERY both does and Frequency Yes Yes Yes queries Yes I tried but query terms only ~ Document Frequency Yes Yes discarded Yes ~ Term Weights Yes Yes Yes, see "Cther" Yes [3] Yes (Query) ~ 1 tic Closeness No Yes [4] No ~ )n in Document [1] _______________ _______________ No ~ jic Clues No No ~ tried but [`ity of Terms Yes Yes discarded Yes ~ lation Theoretic Weights Yes No Yes [7] ~ Yes, to calculate nent I£ngth Yes Yes relative Yes [5] No [8] n _______________ frequencies leteness No No ~ n Frequency No No ~ specificity _____________ No Yes [6] No ~ Sense Frequency No No ~ r Distance No No ~ Yes [2] HL SEARCHING (CON'1) em occurrence frequencies in titles were doubled in some collections. sriables used were: optimally relativized query and document stem frequencies, global relative frequency of stem in all document texts, 2nd routing run: stem-1. nk type weight that reflects relative importance of different lexical relations; set to 0.5 for these experiments. sofar as terms, closely related to selected synonym sets are added to query. ~ne normalization of weights in both documents and queries. sed subjectively when choosing synsets to add. [utual Information Measure determines how phrases are evaluated, which indirectly affects the rank. ~ use maximum Th though, which seems to be correlated with document length. [odified per SMART ann ranking given above. idirectly via use of maximum term frequency. HL SEARCHING (CON'¶) ;YSTEM NAME J GE ] CLARIT ] 151 ~ HNC ~ NYU ~ SYRACUSE i: Rank is determined by the in Ranking __________ _____________ cos (query, doc) in 151-space _____________ __________ TF~AUG (0.5 Used in the term~ocument input I Frequency +0.5 * to the SVD, and also in the ________________ Yes TE/MAX IF) query weight Yes Yes Yes (Stage 2) Ye `se Document Used in the term~ocument input ~ency to the SVD, and also in the _________________ Yes Yes query weight Yes Yes Yes [3] ___ Wi Yes, fr~ r Term Weights relevance col] weights for Term fr~ ________________ routing Yes [1] ______________________________ _______________ similarities Yes (Stage 1) wo The cosine between a query and ntic Closeness a document (in 151-space) _______________ represents their derived similarity ____________ Yes (Stage 2) on in Document ctic Clues Yes [4] ____ rility of Terms ilation Theoretic ~ts Yes, as a part of the cosine ment length Yes measure Yes [3] Ye~ ~eteness Yes [5] ____ m Frequency ___________ specificity ________ Sense ency Yes Distance Document Semantic relation- vector distance ship between Yes [2] to query vector concepts (Stage 2) ___ HL SEARCHING (CON'T) I importance coefficient (with a value of 0.5, 1,2, or 3) was assigned based on position of the term in the topic statement. Assignments were subject to corr~ manual processing. rnilarity scores are calculated over sub~ocuments; the maximum score for any sub~ocument is assigned as the score for the whole document. ~ (Stage l's subject field code module) ~ (Stage 2) es (the existence and nature of relationship between concepts are important - Stage 2) es (Stage l's proper noun, complex nominal and text structure boolean criteria matching module) `es (Stage 2). III. SEARCHING (CON~ `EM NAME [ C~ ] ERIM [ JMC [_CIIIU_~ UCLA ~ SEC ] ETH [ TRW1 Ramking __ ___ Normalized `equency Sometimes Yes Yes Sometimes feature frequency Yes ______ Normalized Document Yes Yes inverse cy documents Yes Yes Yes _________ _______ frequency Term ~ From Based on relevant From derived sin Weights relevance (training) documents relevance analysis information retrieved by a give~i information databas when available term when available relevan Closeness Based inversely on the distance in paragraphs in Document from the beginning of the document lAxical reoognition of some antecedent- Clues reference ______________ oombinations Adjacent words are ~ of Terms used for phrases Yes ion Theoretic Eudidean length of document it length Sometimes Yes Sometimes vector ~ess ~equency `dficity __________ ~ise Frequency _______________ ~tance ________________ __________________ [1] ________________________ `IL SEARCHING (CON'¶) STEM NAME [ ADS ~ UIC ~ DAUIOUSIE_[~__MEAD [ UIF n Ranking Frequency No 4 Yes Yes, within suhset of documents Document Frequency No containing at least one query word. 4 Term Weights No ~tic Closeness No Yes, semantic net distance is used. )n in Document No jic Clues No Not a weight but a threshold of 3 word Two nity of Terms No positions is used in initial processing. nation Theoretic Weights No ment ~ngth No )leteness No in' Frequency No I specificity No I Sense Frequency No £r Distance No Statistical estimates Topi Order in data colle ~r of misclassification rate of the classifier. file priol Iv. MACHINE SPECIFICATIONS ~TEM NAME ~ Dortmund ~ Cornell ]_Berkeley ~_Rutgers ~ Siemens [ UMASS ~ De Sparc 2 Sparc 2 Deestation Sun 5o( )f Machine Used Adhoc: Sparc 10 [1] 5000/125 SunSparc 10/51 Sparc 10/41 Sparcseiver 690 ~ itofRAM 64MB;Adhoc:160MB 64MB 48MB 96MB 128MB 128MB 40. Rate of CPU 25MHz 50MHz 40MHz Unknown 25~ W. MACHINE SPECIFICATIONS ~pt actual retrieval runs for routing were done on Sparc 1 with 12 MB RA~ W. MACHINE SPECIFl~ONS ~ThM NAME ~ GE ~ C~RIT ~ 151 ~ `INC ~__NYU ~__SYRCUSE SUN 4 x DEC 3000/400 workstations (ALYHA AXP) Running (several Df Machine Used Sparc 10 DEC 0SF/i Sparc 10 Sparc 10 Sparc 2 different kinds) 1@ 128 MB nt of RAM 64 MB 1@ 64 MB One machine had 128 _________________ 2@ 32 MB MB; another had 384 MB 512 MB 96 MB 133.33 MI~ (11owever, effective performance with currently available compilers Rate of CPU is approximately 2 times slower than the clock rate 28.5 would suggest). 40MI~ MIPS W. MACHINE SPECIFICATIONS ;ThM NAME ~ CITY ~- ERIM ~ TMC ~ CUIRI [ UCLA ~ SEC [ ~iii I ~ Sun Model 512 SPARC Machine Used SPARC 2 Sparcstations & (one processor PC compatible MP 690 FDF-3 _____________ mainly 6 Sparcstation 2's A 64 node CMS only) SunSPARC 2 486/25 Model 41 search 32 MB/node or t of RAM 40 MB some Each machine had 2GB for the 160 MB 32 MB 12 MB 128 MB N/A of the time at least 32 MB configuration ______________ _____________ __________ FDF-3 ~ate of CPU Don't know 33 M~ for the 40 MHz (Sparc 50 MIP Don't know 25 MHz 40 MHz TREC Sparc 2 (I think) chip) 3-3.5 1 Iv. MACHINE SPECIFICATIONS SThM NAME ~ ADS [ UIC J__DAUIOUSIE ] MEAL) ~ UIF Tree building primarily done on a Sparc 1, routing IBM 3090-300J for initial f Machine Used resUlts generated using a indexing, then RS~000 _______________________ Sparc 10. for weighting. Sparcstation 10 Sparcserver 690 MP IBM 3090 running CMS allows a maximum of 16 it of RAM 16 MB Sparc 1 MB of CPU as a virtual 32 MB Sparc 10 machine. RS-6000 had 128 MB 128 MB ___________________ __________________ 128 MB of RAM Unknown - but in any case, most of the computing was Rate of CPU done with networked data IBM 3090: 69 MHz servers so clock rates don't RS-6000: 66 MHz 40 MHz 4 Cypress CY 6050 tell us anything. processors V. SYSThM COMPARISON 4 NAME [ Dortmund ~_Cornell J Berkeley [ Rutgers [ Siemens ~ UMASS IVPi None except for "Our" system is Basic system w~ the probabilistic essentially version of SMA logic. The For the data lusion SMART; many enhanceti Berkeley system is part, approximately SMART has INQUERY is a pnorm query p! ch Several years Several an experimental 60 hours, for the been well- research added from pre years prototype only, query combination engineered with system. About before TRE~ mg" went programmed as a parts, approximately a primary goal 10 person years TREC consist~ minimal 150 hours. of flexibility, not went into its of outside prog lent? modification of the raw speed. development mergmg combir SMART system. Modifications prior to these results from in~ made by the experiments. SMART retrie~ SMART group adding support at Cornell for query and inde last years TREC during a single were used in these runs. Yes, see discussion Use of inverted If the feature in SMART's enough disk sp )propriate vectors for the documentation: have significani S, could query terms were SMART is "not For the data lusion Yes, at least a the retrieval tir tem be stored in a cache, Of course strongly optimized part, by a factor of factor of 2. `multiple retrie ) Run query regression for any one 8; for the other added to SMA By How would take 20-30% particular use." parts, unknown. restricted by th less time. The Berkeley SMART code system has roughly have been ma~ the same efficiency implemented a characteristics as retrieval systen SMART. being fitted int code. For the data lusion part, a lookup No feature Might benefit from procedure to Phrase identifi recognition a conflator, convert raw scores Word finder. matching; pro~ atures are (eg., thesaurus, to ranks, based on (An onAine identification. that would company disambiguator, and the training set. concept ~~our system? names, the use of many This is necessary for association geographical other clue types. true routing as database). locations, opposed to batch dates, scoring of "routing" amounts of queries. money. V. SYSThM COMPARISON NAME ] GE ~ CLARIT J 151 ] HNC ] NYU ] SYRACUSE_1__ The system used for The 151 system was Severai years Quite a lot of code Not much. It Q' ~IREC-2 processing was built as a research rewriting was done was the first rej developed as a prototype to look at to adjust NIST prototype to university-research human interface system to handle the testing with reE ch prototype. It is issues, and designed large index (8 times the current rei engineered for to work on much larger than without functionalities ini ing" went Zero robustness and smaller databases. compound terms). fil~ flexibility, rather than I'd guess that about lent? speed. Most of the 1-2 person years components of the were spent on system are less than two various aspects of years old. The the system. research-prototype code (essentially all C) is not _________ ____________________ production-quality. For routing, it could be We anticipate at least Yes, 20%AO% Base IR system is Yes, with Y~ a bit faster. For an order of magnitude searching much better than it careful design cc retrieval, it is speed improvement in entire was during TREC-1. of the data trE' )propriate compatible with any the system within the database. However, second structures and P~' S, could inverted indexing next six months. This Many orders phase of index elimination of us Lem be strategy. will be possible due to of magnitude building is still slow the features M Run (1) re~ngineering of the faster with and fragile. added for su By How system and (2) the use document experimental n~ of optimization utilities clustering Qiad purposes, the cc sold for the DEC an order of 15 speed can be si, AU~HA platform. ~e speed-up on a improved current 0SF compiler foreign significantly, does not optimize code corpus.) at least by an appropriately for the order of AlPHA (64 bit) magnitude. architecture, with _________ _____________________ disappointing results). [1] _____________ __________________ _____________ This approach is very The CLARIT TREC-2 Better tokenization, Word sense - There is still a lot ~ simple and has no fancy system did not take including proper disambiguation of room for di atures are features. Better advantage of several noun identification, (already in improvement of b~ Lhat would tokenization, special processing options that phrases, and early NIiP programs. p] ~ur purpose query handling, may have given perhaps some better development). A feedback ~ proximity, and negation, improved results, treatment of "nots." Document mechanism would for example, could help including tokenization, Precision enhancing cluster (speed be helpful. a lot, as would better subAexicon discovery methods would also up retrievals). - Faster indexing ranking. over training sets, and help some. EQ~ass discovery for I thesaurus terms. ~___________________ ____________________________________ [2] __ V. SYSThM COMPARISON (a) I'd guess that simple database and matching operations could be speeded up by a factor of 2-3 with a rewrite of the software to do what ~ than what we thought we might want when we started). (I)) Most of the initial analysis time is spent in computing the SVD decomposition of the term~ocument matrix. The sparse-iterative algorithm orders of magnitude faster than the dense algorithm we used 2 years ago. We might find additional impmvements of 2-3 times by using mor precision arithmetic. Parallel algorithms might help, but again, only by a factor of 2. This analysis is a one4ime cost for relatively stable (c) Que~ processing is slow. Although the LSI vectors have many fewer dimensions than standard vector representations, the vectors are de~ related to every document; it's just a matter of how much. Thus, we cannot take advantage of efficient inverted indices or other structu~ trivial to match queries to documents in parallel. Improvements here are limited only by the nuriber of processors we have! We are als heuristic methods for finding near neighbors in high~imensional spaces. There are many features that have not been included in the system because it was the very first prototype. As in the paper, many improvements are und in reducing errors in text processing, reducing complexity of representation, improving quality of knowledge bases, and improving the time and spa~ redesign of the data structures and implementations. V. SYSTEM COMPARISON Very little ;;C-specific in ~ leEISIM ~ Our sy:em;s a ~ A ;:na1ble ~ U;m~: ~ SEC ~ 2;0m~3~ ~ About 1 we~tWo d the search system, although than a research prototype, amount. We as hours. for TRW1 test run' V," went TREC has spurred the week it took two person have been CITY About 1 month to development of some months to build. interested in tools for TRW2 te' it? additional features. It is a algorithmic - Underlying FDF ~ generalized bibliographic aspects-- been developed ov retrieval system which has speed, and at TRW and PARi undergone continual memory and modification to meet the disk requirements of a number of requirements research projects since 1983, involving many thousands of person-hours. Disk I/O is the most serious By con- Yes, it scales Same A lot Sure. We expect sul bottleneck in searching and in verting linearly with the 30%?? as performance increa~ outputting documents. it to a size of the CM5, CITY single FD-3 unit (3-1 `opriate Keeping entire database in true re- but even on the variety of hardware could core would speed real time by trieval current size, the firmware improveme n be > order of magnitude, CPU system, retrieval time could releases of FDF soft Wn by much less. Perhaps 3 GB I would be speeded up by a include features to ~ ~ How of core is too much to expect guess factor of 2 or 3. harness multiple H) yet. More practically, faster that we Memory parallel running the disks and faster bus. Indexing could optimization is the queries. Performan( on the other hand is CPU improve main limiting improve linearally w bound most of the time. This the factor for now. number of FDF's u~ could be distributed over N system Incorporating ideas processors. ~me would speed TREC experiments behave approx. like A + B/N by at automatic query geti + CN for a given database, least 20- will reduce query si~ where typically B > A, and B fold improve system perf » C. Such tools are unde development. There is Normalization System has a Same The an (document length, relatively as major itemize cosine basic user CITY feature d list in normalization, interface. that is ures are my etc.), more No designed at would paper. elaborate use of mechanism but not ~r the proximity for feedback yet imple- information and or other mented is clustering of terms, query the use of query expansion, modification. feedback. stemming. Use of [~ relevance feedback. ~TL1~ C()MPARI~ON ~ust need something, because other Systems do better! We've not done much with phrases: it seems likely that extensive use of phrases would help. The sy~ m already. What we need is a good method of phrase discovery. We were attracted by the idea of treating paragraphs as documents this time, but didn't Needs more elaborate database model. are more directly incorporating term weighting into our system. Better query construction, evaluation and refinement tools are under development. We ~ ncorporate several ideas developed in this exercise into automatic query generation and refinement tools. Incorporation of these ideas should improve the formance of our system. Some specific improvements include: Better integration of term weights. Better tools for initial query construction Better stemming and stop-word elimination Evaluation of search term independence Better document similarity metrics. V. SYSIEM COMPARISON NAME ~ ADS ~ UIc I DAUIOUSIE ]_MEAl) ~ UIF ~ I Approximately 4 person TREC-2 upgrades inVolved Strictly research Approximately 280 hours 4 pen ich weeks to "clean-up" last approximately one person- prototype. Retrofit of of programming have been monti e years TREC-1 month. adhoc interactive system used to develop the neural devel( ring" went experimental code - both to process filters. network. The system is systeti the CART algorithm and Upsize to work on the implemented in C and uses Kient? the tool we used to large item sets generated a scanner for text convert CART trees into from the large data files. processing. The scanner is TOPIC trees were "off- borrowed from the QA the- shelf." system built for NASA~ _____ Undoubtedly - if we had With parallel processing, an Yes, 100-200% faster. Yes. Our system can easily With started out intending to order of magnitude increase in Being a prototype run in parallel. The disk 5 use TOPIC as the actual speed would be expected. optimization of processing time can be could ~propriate test environment, we Without parallel processing, searching for multiple approximately reduced by decon ~, could would have designed a improvements on the order of terms was not the following ratio [EQN data I Lem be system that made use of 100% would be expected from implemented and lots of "center [#R [time required beforE Run TOPIC's data optimizations on current messages, including one usmg one CPU] over [#R runs, By How preparation utilities - sofrware. Restructuring data per record, are still [total number of CPUs]]]]. sever~' giving us an order of representation probably results displayed on the screen per te magnitude speed in tree in an order of magnitude as the system churns building. increase in a serial processing away. __________ _______________________ mode. We still have not Shortest path algorithm needs Functions to screen for The following sofrware Be experimented with to be implemented. For primatives other than improvements would mc external resources such TREC-2, only direct pair simple strings, such as benefit the retrieval rel as part~I-speech taggers matches were involved. dates and names. performance: be ~atures are and lexicons that might Identif~ing indirect paths is Automatic analysis of fe~ hat would be used to both expand under development. Tests with contents of the data files 1. Adding inverted index Be our the feature set and the several topics after official to assist the user in and term frequency fe~ complexity of the CART results were submitted showing recognizing patterns that information for term exi trees; nor have we that use of indirect paths `may' be usefiiil when weighting. roi experimented with using results in improvements to searching that particular 2. Using larger and more (sy low-level topics as capture of relevant documents file. accurate Semantic ph features - all of these at 93%. Weighting, however, lexicon. A could be expected to give needs flirther enhancements 3. Using training text to th~ improved results. because improvements are at improve the neural pe: 45% for the top 1000 network performance. document cutoff.