APPENDIX C This appendix contains the supplemental forms filled out by each group about their system. These forms are meant to supplement the papers and contain a standarded and formatted description of system features and timing aspects. 435 System Summary and Timing City University, London General Cominents The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also he reasonably accurate. This sometimes will be difficult, such as getting total time f(ir document indexing of huge text sections, or mailually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of die fi)lk)wing were used to build your data structures? 1. stopword list yes a. how Inaily words in list? 126 general stop words + 6 function words. Excluded from indexes and (lueries. Semi-st()pw()rd list of 256 words and phra~es. These are not used in query expansion following relevance feedl)ack unless they occur in the original query. 2. is a controlled v(~abulL.uy used'? No. But see I C I I). 3. stelnining yes a. st~uid~'u'd stelninilig ~`LIgon' thins A moderately weak suffixing algorithm Ilased on M. F. Porter, "An algorithm for suffix stripping." Program, 14(3), Jul 1980, 130-137. We also use a degree (~f British/American spelling contlation. b. morphological ~alysis no 4. terin weighting No. Query terms are weighted, l)ut not index terms. 5. phrase discovery 110 6. syntactic parsing no 7. word sense disainbigu£~tion 110 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spellin~ correction no 11. proper noun identific~'ition al'~orithm no 12. tokenizer (reco(2nizes dates, phone numbers, coininon pattenis) no 13. are the manually-indexed terins used'! ~() B. Statistics on &L'ita structures built ~in TREC text (please fill out each applicable section) 1. inverted index ti. total ainount of storage (megabytes) 810 b. t()t~'~l computer tilne to build (approxil nate number of hours) 43 c. is the ~1*()(:C55 completely £`iutomatic? yes d. ~`u'e terin positions within d(~uments stored'? No. Insufficient disk space to do this. C. single telins only'? Single terms and pre-specified phr~ses (see I C I 1) l)elow) C. Data built from sources other th~ui the input text 1. interii~'illy-built auxiliary files One manually-Iluilt tile. a. domain independent or domain specific (if two separate files, please fill out one set of (juestions for each file) Loosely domain-dependent b. type of file (thesaurus, knowledge base, lexicon, etc.) Small quasi-thesaurus containing synonym classes, prefixes, go phrases, 436 st()pw()rds, function words and semi-stopwords (see I A 1 a for semi-st()pw()rds). C. total ainount of StOra~LTC (megabytes) 0.013 d. total number of concepts represented About 1500 e. type of representation (fr~es, seinailtic nets, rules, etc.) Simple f. total computer t~e to build (approximate number of hours) Manually built. Structured at runtime, time negligible. g. total mai1u(~l tilne to build (approximate nuinber of hours) Perhaps 8 person.h()urs. Several iterations, based on fre(luency counts from indexing runs, other similar files, TREC (lueries and documents. li. use of manu£~l labor (4) o~er (describe) Manually built using text editor 2. extenially-built auxiliary file no lookup table II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) 1. topic fields used Concepts. Other tields were tried but gave overall (though not uniformly) worse results. 2. total computer tilne to build query (cpu seconds) 0.02 secondsI(~uery to parse topic and extract 3. which of the ft)llowin~ were used? j. other (describe) Concept terms processed and weighted. Term weight = constant * log(((r+c)I(R.r+1-c)) I ((n.r+c)I(N-n.R+r+1-c))) where N is the nullll)er of indexed documents, n the number of documents containing the term, R the nunll)er of known relevant documents, r the nullll)er of known relevant documents containing the term, c = 0.5. Weights rounded to nearest integer. B. Manually constructed queries (ad h(~) 1. topic fields used Any: searchers' free choice. 2. average tilne to build query (minutes) About 40 minutes (often including trial searches) 3. type of query builder Six searchers were used. None was a domain expert. Two might be described as experts on the search system. 4. t(iols used to build query c. other lexical tools (identify) Trial lookups giving fre(luency. Trial searches. 5. which of the f~)llowing were used? a. terin weighting As in II A 3 j above b. Boolean connectors (AND, OR, NOT) All available. AND and OR were used in a number of searches. d. addition of terins not included in topic (1) source of terms Searchers' world knowledge and terms from relevant documents found in trial searches. C. Feedback (ad hoc) 1. initial query built by metliod 1 or meth(XI 2? Method 2 2. type of person doing feedback Searchers were Masters students in Information Science and two people working on the TREC project. 437 3. aver~ige tilne to do complete feedback a. CPU time (total CPU .`;econd.~ for all iterations) About 20 seconds b. ck~k tilne from initial Construction of query to completion of final query (minutes) About 20 minutes 4. average number of iteratk)ns One a. average nwnber of d(~uInCnts exainined per iteration About 20 5. minimum number of iterations One 6. maximum number of iterations One 7. what determines the end of an iteration? Searchers were rec~~mmended to stop after assessing 20 documents or when they had found 10 relevant documents. These guidelines were not always adhered to. 8. feedback methods used b. automatic query expansion from relevant d(Kuments (2) only top X tefins added (what is X) Term p(x)l was all ~~uery terms + all non-semi-stop terms from relevant documents. The former were given an R-value of R + 3 and an r-value of r + 2. Top 20 terms were used, selected in descending order of (term~weight * r) and weighted using the formula given previously. II II II See section 11 A 3 j & I A I a for "R , r , semi-stop". D. Automatically built quefles (routing) 1. topic fields used Concepts 2. total computer tilne to build query (cpu seconds) Depended strongly on number of known relevant documents in training set and their length. Average perhaps 10 minutes. 3. which of the followin(j were used in building the query? a. terms selected from (1) topic (3) only docuineuL'; with relevance judginenis b. tenri weighting As in II A 3 j above except R = R + 10 and r = r + 10 for concept terms. III. Searching A. Total computer tilne to search (cpu seconds) 1. retrieval tilne (total cpu secon(1~ between when a query enters the system until a list of document numbers are obtained) Typical figure for 12-term search producing output list of 350,000 document identifiers: 45 seconds (note that in an interactive production system we would use a weight threshold which would reduce this by perhaps 50%). 2. ranking time (total cpu seconds to sort d&~ument list) For list above: about 65 seconds (weight threshold would reduce by 50-90%). B. Which metliods best describe your machine searching methods? 2. probabilistic model C. What factors are included in your ranking? 2. inverse document frequency Inverse document frequency and relevance information when available (see above for weighting scheme). IV. What machine did you conduct the TREC experilnent on? Sun SPARC Server 41330 with Sun IPC as fileserver How much RAM did it have? 16 megabytes 438 What w~'L~ the clock rate ot' ~e CPU? Not specified. Sun claims 16 Mips. V. Some systems are research prototypes and others are com'nerci£~. To help compare tliese systems: 1. How much "soflware en(~iI'eering" went into the development of your system? System is non-commercial. It has undergone continual modification since 1982 to meet the requirements of a number of different research projects, mainly on end-user bibliographic search mg. 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? Faster hardware would of course increase speed. Main bottleneck is disk & network `10. Very large amounts ~`f RAM (of the order of a gigabyte per process--or could be shared between processes searching the same database) would greatly reduce `10 dependence. On the software side, earlier versions of the system were often optimised f('r speed at s(~me cost in added complexity and reduced flexibility. This optimisation has been removed from the version produced for TREC, partly because interactive searching by general users was not envisaged. It is impossible to give definite estimates on speed improvements. It would not be unreasonable to expect an order of magnitude improvement within current hardware and software c(~nstraints. 3. What features is your system missing that it would benefit by if it had them? Given enough disk we would have stored positional information in the indexes, and probably used it to m(~dify document weights, perhaps by giving weight bonuses for term proximity. This would have increased inversion storage overheads to a little over 1(~~% of bibliographic tile size. (This is not really a "missing feature," because the system d(~s have the capability.) We might have considered some form of weight adjustment for document length. This would involve a modification of the index structure which might just have been feasible within the disk constraints. Other possibilities worth investigating include phrase discovery and term dependency statistics. 439 System Summary and Timing University of Pittsburgh General Commenis The timings should be the time to replicate rujis from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time for document indexing of huge text sections, or manually building a knowledge base. Please do y~~ur best. I. Construction of indices, knowledge b~'tses, and other data structures (please describe all data structures that your system needs for searching) A. Which of the ft)llowing were used to build your data structures? 1. stopword list a. how many words in list'? 2,529 words on the list, including digital (0-9). 3. stemmin~2 a. standard stemming algorithms We use Porter stemming algorithm and it was implemented by C. Fox, using C programs. 4. telin weighting 5. phrase discoveiy 6. syntactic parsin&' 7. word sense dis~'unbi(2uation 8. heun's+~ic ~`L';sociations 9. spellint~ checking (with manual correction) 10. spelling correction 11. proper noun identification algoritlim 12. tokenizer (recognizes dates, phone numbers, common patterns) 13. are the manually-indexed terins used'! 14. other techniclues used to build d£~ta structures (brief description) B. Statistics on data structures built from TI{EC text (please fill out each applicable section) 1. inverted index a. total ainount of 5tOra~~~C (me~abytes) For the storage space information, only the data on dLsk one is available. Following table provides the data in Megabytes unit. DOE AP ZIFF WSJ Inverted tiles 162.3 199.8 143.7 223.4 Indexed tiles 2.2 2.4 2.1 Address tiles 4.3 1.7 1.7 2.6 Note: * Data on FR is also n(Jt available (loaded into tapes); * Address tiles are the indexed tiles which include document numl)ers and their offspring in the text tiles where the document stored. b. total computer time to build (approximate number of hours) Please refer t(~ TahIe 1 in (~ur paper. c. is the pr~icess completely automatic? yes 440 d. £tre teun po.~iti()n.'; witliiu d(~umeflts ~tored? no e. .`;ingle tennN only? yes C. Data built from sources other thaii the input text --no II. Query construction (please fill out a section fi)r each query construction method used) A. Automatically built quenes (ad hoc) 1. topic fields used Training queries: Title and concepts are used. However, the nationality might l)e included if it's necessary to meet the narrative item. Routing queries: The routing queries are the final converged queries from the training queries. There is 110 moditication. Ad hoc queries: Title, concepts are used, and some keywords from narrative items are added. 2. total computer time to build query (cpu seconds) Computing time to l)uild queries is not availaille. 3. which of the ft)llowiIl~ were used,? a. term weighting with weights b£~sed on te~s in topics Term weighting for queries is assigned ily the system. It is our research t(~pic on term weights modification. Note the stemming algorithm used on document processing was aLso used on query terms. Training queries: All term weights were assigned automatically hy the system and also adjusted l)y the system using feedl)ack information. Routing queries: The term weights are those from the last generation of the training queries. No changes are applied. Ad hoc queries: For one query individual the term weights were assigned manually hy the researchers. The other query individuals' term weights were generated hy the system. (Note: our system uses 10 query individuals searching documents simultaneously.) Also the term weights were adjusted hy using the feedl)ack information. C. Feedback (ad hoc) 1. initial query built by method 1 or Ineth(xl 2? Initial queries were l)uilt Ily method I (automatic). 2. type of ~C~5()fl doing feedback Evaluation is done l)y ()U~ researchers. 3. average time to do complete feedback Please refer to TaIlle 2 in our paper. 4. average number of itera(i()ns 3 iterations on average. 5. minimum number of iterations 0 6. maximum nuinber of iterations 9 7. what determines the end of an iteration? No more relevant documents are retrieved or it is not valuahle to do more feedl)ack due to the time constraint. 8. feedback methods used Query terms are automatically modified hy the system using the genetic algorithm in our system. III. Searching A. Total computer time to search (cpu seconds) 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) Please refer to TahIe 2 in our paper. 2. ranking time (total cpu seconds to sort d(lcument list) not availaille 441 B. Which rneihod.~ best descnbe y~~ur machine se£irching methods? 1. vector space m(xiel A distance function (Lp metric) is used as similarity measurement. C. What factors ~ire included in your railking? 15. other (.`;pecify) Document ranking bases on the distance. The shorter the distance, the high the rank. That is, the document with the shortest distance is put on the top of the list. IV. What machine did you conduct the TREC experilnent on? How much RAM did it have? What was the clock rate of the CPU? Two types of systems are used. Sun-670: 32 MB RAM and 40 MHz CPU clock rate; Sun SPARC/IPC: 24 MB RAM and 25 MHz clock rate. V. Some systems are research prototypes and others are commercial. To help compare these systems: 1. How much "software engineerin(~" went into the development of your system? 2. Given appropnate resources, could your system be made to run f~';ter? By how much (estimate)? If our system can be implemented on a parallel machine, the retrieval could be 10 times faster. 3. What features is your system missin{' that it would benefit by if it had them? There are a lot (~ parameters which can l~ adjusted to make our system more flexible and more adaptive. We need to build a go(~ user interface on which several parameters can be controlled and manipulated by the users. 442 System Summary and Timing Cornell University Run 1: Single term automatic ad hoc run (global/local match) General Comments The timings should be the time to replicate fulls trom scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time for d(lcument indexing of huge text sections, or manually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and other data structures (please describe all data structures that your system needs for searching) A. Which of the following were used to build your data structures? 1. stopword list a. how many words in list'? 570 2. is a controlled v(~abulary used'? no 3. stelnining yes a. standard steniming algorithms which ones.? SMART 4. tenn weighting In docs, tf * idt; cosine n(~rmalization (ntc) In (lueries, tf * idt; Cosine normalization (ntc) In sentences, tf * idI, no normalization (ntn) 5. phrase discovery 6. syntactic parsing 7. word sense disainbiguation 8. heuristic associations 9. spelling checking (with manual correction) 10. spelling correction 11. proper noun identification algorithm 12. tokenizer (recognizes dates, phone numbers, COmlflOfl pattenis) 13. are the manually-indexed terins used'? 14. other techniques used to build data structures (brief description) B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 690 b. total computer time to build (approximate number of hours) 4.7 hours to create doc vectors from text 0.7 hours to reweight doc vectors and produce inverted file c. is the ~f(lcC55 completely automatic? yes d. are term positions within d(~uments stored'? no e. single terms only'? yes 5. other dat~t structures built from TREC text (what'?) Map from d(lc'id to text location (also gives title for each doc) a. total amount of st()r~y&1c (megabytes) 68 MI)ytes. b. total computer time to build (approximate number of hours) Time to create included in inverted file creation al)ove. c. is the ~f(~C55 completely automatic? Automatic 443 d. brief descflpti()u of methods used other data structures built from TREC text (what?) Map from internal concept to token string a. total (i[n()unt of st()r~ge (megabytes) 18 Ml)ytes b. total computer tizue to build (approxu nate number of hours) Time to create included in inverted file creation above. C. is the pr(~ess completely automatic? Automatic d. brief desaiption of methods used C. Data built from sowces other th~~ the input text None, other than st()pword file. II. Query construction (please fill out a section for each query construction method used) A. Automatically built queries (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. toLd computer tilne to build query (cpu seconds) 1.5 seconds for all (lueries 3. which of the ft)llowing were used? a. term weightin(~ with weights b£Lsed on terms in topics (idf) III. Searching A. Total computer t~e to search (cpu sec()nds) 1465 seconds (includes retrieval + ranking + indexing 50() docs per (luery). 1. renieval tilne (total cpu sec()nds between when a query enters the system until a list of document numbers are obtained) 2. ranking time (total cpu seconds to sort d(icument list) B. Which methods best describe your machine searching methods? 1. vector space rn(xlel C. What factors are included in your raiiking? 1. term frequency 2. inverse d(~ument frequency 7. proxilnity of terms within sentence ne~ded for local sim. 8. information theoretic weights 9. doc~ent length IV. What machine did you conduct die TREC experilnent on? Sun SPARC 2 How much RAM did it have? 64 MB What wL~ flie clock rate of the CPU? 4(~ MHz V. Some systems are research prototypes and others are commercial. To help compare these systems: 1. How much "software engineerin(!" went into the development of your system? About 3 person-years fi)r the SMART system itself 2. Given appropriate resources, could y~~ur system be made to run f~~ter? By how much (estimate)? Of course! Retrieval local similarity needed to index 5O(~ docs per (luery; this could all be done in advance if a single l(~al appr()3ch had been decided on. 444 Reduce retriev~l tinie by a factor of 5. A 6 machine di~tri1)uted version of SMART should be faster by a factor of 3 for 1)0th indexing and retrieval. 3. What featuve.~ is your system missing that it would benefit by if it had them? Distributed versi~,n has not fully been implemented yet. 445 System Summary and Timing Cornell University Run 2: Phrase automatic ad hoc (Cornell Global/Local) General Comments The timings should be the lime to replicate runs from scratch, not including Irial runs, etc. The times should also be reasonably accurate. This Solnetimes will be diff~cult, such as getting total time for document indexing of huge text Sections, or mailually building a kilowledge base. Please do your best. I. Construction of indices, knowledge bases, and oIlier data structures (please describe all data structures that your system needs for searching) A. Which 1. of the following were used 10 build your data structures? stopword list a. how many words in list,! 570 2. is a conlrolled v(~abulary used? Not t')r single terms. A phrase list was automatically constructed from phrases occ~rflng 25 times or more in the first d(~ set (Dl). Only those phrases were used. 3. stemming yes a. staildard stemming algorithms which ones? SMART 4. tenn weighting In docs, tf * idt; cosine normalization over length of single terims (ntc) In queries, tt. * idi; cosine n(~rmalization over length of single, terms (ntc) In sentences, tt. * idt; no normalization (ntn) Phrases weighted using their natural tt~idl, cosine normalized by length of single terms, and divided by sqrt(2). [Phrase match worth 0.5 of single term match] 5. phrase discovery a. what kind of phrase? Adjacent non-stopwords, components stemmed, that occurred at least 25 times in the Dl document set. 6. syntactic parsing 7. word sense disainbiguation 8. heuristic associations 9. spelling checking (with mailual correction) 10. spelling correction 11. proper noun identification algorithm 12. tokenizer (recognizes dates, phone numbers, common pattenis) 13. are the mai1ually~indexed terms used? 14. other techniques used 10 build data structures (brief description) B. Statistics on data structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (me(Jabytes) 840 b. total computer lime to build (approximate number of hours) 9.7 hours to create doc vectors from text 0.9 hours to reweight doc vectors and produce inverted file c. is the pr(~ess completely automatic? yes d. are tenn positions within d(~uments stored? no 446 e. single terins only'? no 5. other data structures built from TREC text (what'?) Map from d(~cid to text location (also gives title for each doc) a. tot£~l ~`un()unt of storage (megabytes) 68 Ml)ytes. b. tot~Ll computer time to build (approximate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic? Automatic other data structures built from TREC text (what'?) Map from internal concept to token string a. total ainount of storage (megabytes) 25 Ml)ytes b. total computer time to build (approximate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic'? Automatic other data structures built from TREC text (what'?) Phrase dicti~~nary (controlled vocal)ulary) Phrases were adjacent n()n-st()pw()rds, components stemmed, that (~curred at least 25 times in the Dl document set. a. tot~'tl amount of stora~e (me~abytes) 14 Ml)ytes to store dictionary. b. total computer time to build (approximate number of hours) It took 5.8 CPU hours to index Dl, tinding ~ phrases and their collection stats. Of those phrases l58,(~(~(~ occurred at least 25 times. C. is the pr(~ess completely automatic'? C. Data built from sources other than the input text None, other than st()pword tile. II. Query constructirm (please fill out a section for each query construction method used) A. Automatic£tlly built quenes (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. total computer time to build query (cpu seconds) 2.7 seconds for all 50 (lueries 3. which of the following were used? a. term weighting with weights b~'i~ed ~ terms in topics (idf) b. phrase extraction from topics yes, using controlled list of phr~ses III. Searching A. Total computer time to search (cpu seconds) 2405 seconds (includes retrieval + ranking + indexing 5(~~ docs/query). 1. retrieval time (total cpu seconds between when a query enters the system until a list of document numbers are obtained) 2. ranking time (total cpu seconds to sort d(~ument list) B. Which Ineth(XIs best describe your machine searching methods'? 1. vector space m(XIel C. What factors are included in your railking? 1. tenn frequency 2. inverse d(~ument frequency 7. proximity of terms fi)r phrases and fi)r local similarity hetween sentences 9. document length 447 IV. What machine did you conduct tlie TREC expeI.ilnellt on? Sun SPARC 2 How mUch RAM did it have? 64 MB What wa~~ the clock rate of flie CPU? 4(~ Mhz V. Some Systems are research prototypes and others are commercial. To help compare these systems: 1. How mu~ "software engineen.n(7" went into the development of your system? AI)()ut 3 per.%()n-years f~)r the SMART system itself 2. Given appropriate resources, could your system be made to run faster? By how much (estimate)? Of course! 3. What features is your system missing that it would benefit by if it had them'? 448 System Summary and Timing Cornell University Run 3: Automatic routing (Cornell Ide f~edback) General CominenLs The timings should be the time to replicate runs from scratch, not including trial runs, etc. The ti'nes should also be reasonably accurate. This solnetilnes will be difficult, such as gettilig total time for document indexing of huge text sections, or `naiiually buildilig a ~owledge base. Please do your best. I. Construction of indices, knowledge b~'L~s, and other data structures (please describe your system needs for searching) all data structures that A. Which of the following were used to build your data structures? 1. stopword list a. how many w(Mds in list? 570 2. is a controlled v(~abulary used? no 3. stei~ing yes a. st~uidard stemming algorithms which ones? SMART b. mo~hological ~ui~ilysis 4. tenn wei~hting In docs + (1UeriC5, tf ~ idt; cosine normalization (ntc) (in docs idf is hased on c(~llection frequency within d(~ set Dl only) 5. phrase discovery 6. syntactic pL~5ing 7. word sense dis(~Tlbiguati()n 8. heuristic associations 9. spelling checking (with manual correction) 10. spelling correction 11. proper noun identification £..Llgori~m 12. tokenizer (recognizes dates, phone numbers, coi~on pattellis) 13. are the maimally-indexed terms used? 14. other techniques used to build ~ita structures (brief description) B. Statistics on ~ita structures built from TREC text (please fill out each applicable section) 1. inverted index a. total amount of storage (megabytes) 275 b. total computer t~e to build (approxilnate number of hours) 1.9 hours (not including time to index Dl to o1)tain collection frequency info) c. is the pr(~ess completely automatic? yes d. are term positions within documents stored? no C. sin(TlC terins only? yes 5. other data structures built from TREC text (what?) Map from d(K:id to text location (also gives title for each doc) a. total ainount of storage (megabytes) 24 MI)ytes. b. total computer time to build (approx~ate number of hours) Time to create included in inverted file creation al)ove. c. is the pr~~ess completely automatic? Automatic other data structures built from TREC text (what?) 449 Map from internal concept to token string a. total (linount of stora~'e (me&'abytes) 13 M1)ytes b. total CompUter time to build (approximate number of hours) Time to create included in inverted tile creation f~~r Dl C. is the pr(~ess completely automatic? Automatic C. Data built from sources other thaji the iuput text None, other than st()pword tile. II. Query construction (please fill out a sectioll l;()r cich query construction method used) D. Automatic~'tlly built queries (routing) 1. topic fields used Topic, Nationality, Narrative, ConcepL~, Factors, Description 2. total computer time to build query (cpu seconds) 3(tti 3. which of the ftillowing were used in building the query? a. terms selected from (1) topic (3) only documents with relevance judgments b. term weighting (1) with weights based on terms in topics (2) with weights kised oil terms in all training documents (3) with weights based on terms from documents with relevance judgments 1. expansion of quelies using previ()usly-constructed data structure (from part I) (1) which structure? 3(~ hest terms from relevant docs III. Searching A. Total computer time to se~irch (cpu seconds) 293 seconds (includes retrieval + ranking). 1. reliieval time (total cpu scc()nds between when a query enters the system until a list of document numbers uc obtained) 2. ranking time (t()t£d cpu seconds to sort d()(:ument list) B. Which methods best describe your machine searching methods'! 1. vector space m(KIel C. What factors are included ill ~()w. railking'? 1. term frequency 2. inverse d(icument frequency 9. document length IV. What machine did you conduct the TRLC experiment on'? Sun SPARC 2 How much RAM did it have'! 64 MB Wh~it was the clock rate of the CPU? 4(J Mhz V. Some Systems are rese(~ch prototypes and others are commercial. To help compare these systems: 1. How much ~software enL'ineering~' went into the development of your system? AI)out 3 person-years for the SMART system itself 2. Given appropriate resources, could your system be made to run f~ster? By how much (estimate)? Of course! 450 3. ~hat feature~ j~ ~()U~ .~y~teIn mi~~ing thai it would beiiefit by if ii had fliem? 451 System Summary and Timing University of California, Berkeley General Cominents The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be difficult, such as getting total time fi)r d(~ument indexing of huge text sections, or manu(dly building a knowledge base. Please do your best. I. Construction of indices, knowledge bases, ~md other data structures (please describe all your system needs for se£~chin~) data structures that A. Which of the f~)ll()wing were used to build your data structures? 1. stopword list Yes, augmented SMART stoplist a. how many words in list? AI)()ut 6(~0 2. is a controlled v(~abul~~y used? no 3. steI~ing a. st~~d~'trd stelnining (~g()n.tl1ms yes which ones? SMART system (Version 10) stemmer b. morphological (uldysis flOlle 4. tenn weighting yes. Weights deterniin~d fi~()m various frequency statistics by logistic regression 5. phrase discovery none 6. syntactic p~~sin~ none 7. word sense dis~~nbiguati()n flofle 8. heuristic associations none 9. spelling checking (with inahual colTection) none 10. spelling correction none 11. proper noun identification £dgon.thm n(~ne 12. tokenizer (recognizes dates, phone numbers, coininon pattenis) none 13. are the m~~1ually-indexed terms used? no 14. other techniques used to build d~ta structures (brief description) B. Statistics ()~ d~ta structures built fiom TREC text (please fill out each applicable section) 1. inverted index a. total ~unount of stor£ige (megabytes) Ranges from 7E~ to 18(~ ml) for each of the five collections b. total computer time to build (approximate number of hours) Ranges from 6 to 14 hours t')r each of the five collections c. is the pr(~ess completely automatic? yes d. are term positions within documents stored? no C. single terms only? yes C. Data built from s~)urces other thui the input iex~ --no II. Query construction (please fill out a section ft)r each query construction method used) A. Automatically built queries (ad hoc) 452 1. topic fields used all 2. toLtl computer time to build query (cpu seconds) around 3 seconds total per (luery 3. which of the f~)llowin(~' were used? a. term weightiug with weights b~~';ed on te~s in topics j. other (describe) Al)5()Iute and relative frequency of each stem in query were used to weight the stems, using a f~)rmula ol)tained l)y logistic regression from the WSJ relevance data. III. Searching A. Total computer tilne to 5CL'I£Ch (cpu seconds) 1. retrieval t~e (total cpu seconds between when a query enters the system until a list of document numbers ~`ire obtained) 2. ranking time (tot~Ll cpu seconds to sort d~~ument list) B. Which methods best describe your machine se£u-chi'ig methods? 2. probabilistic model Yes, prol)al)ilistic searching l)ased on linked dependence assumption and two stages of logistic regression as descril)ed in Proceedings ACM/SIGIR Copenhagen June 1992. C. VVhat factors are included in your rL'U~king? 1. telin frequency 2. inverse d()(2ument frequency 3. other terin weights (where do they come from'?) see 15. l)elow 5. position in document stem occurrence frequencies in titles were douhled in some collections. 9. docwnent length 15. other (specify) variahles used were: al)s()lute and relative frequency of stem in query al)solute and relative frequency of stem in document inverse document frequency of stem in collection glohal relative frequency of stem in all document texts document length measured in stem-occurrences. IV. What machine did you conduct the TREC experilnent on'? Three ditTerent machines: 1. DECStation 5(K)()1125 with 16 Megal)ytes RAM for most work. 2. DECStation 5(~X~125 with 64 Megal)ytes RAM for a little. 3. IBM Model 3O9(~ for the logistic regression analysis. How much RAM did it have? What was the clock rate of the CP~~? 25 MHz for the 16 Megal)yte DECStation. This was used for the timed retrieval runs. 40 MHz f~)r the 64 Megabyte DECStation. V. Some systems are research prototypes and others ~`u-e commercial. To help compare these systems: 1. How much "software engineering't went into the development of your system? None except for the novel two-stage probabilistic logic. The Berkeley system is an experimental prototype only, programmed as a minimal modification of the SMART 453 system. 2. Given appropliate re%.~ource.%.', could your system be made to run f~ter? By how much (estimate)? Yes, see discussion in SMART's documentation: SMART is "not strongly optimized f(~r any ofle particular use." The Berkeley system has roughly the same efficiency charactenstics as SMART. 3. What features is your system missing that it would benefit by if it had them? Would pr(~l)al)ly l)eflefit from a conflator, a thesaurus, a disaml)iguator, phrase discovery, stem proximity detection, etc. The Berkeley system is a l)are-bones design, intended only to explore the workal)ility of staged logistic regression. 454 System Summary and Timing Universitaet Dortmund Single term automatic ad hoc run (Fuhi. 1ea~ing) General CoininenLs The timin~,'s should be the time to replicate runs from scr~itch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be dimcult, such as getting total time f~)r d(~ument indexing of huge text sections, or mailually building a kliowledge base. Please do your best. I. Construction of iiidices, knowledge b~~es, and other data structures (please describe all data structures that your system needs for se~'irchin~) A. Which of the following were used to build your data structures? 1. stopword list a. how many words iii lisi? 57(~ 2. is a controlled v(~abuI~Lry used? no 3. ste1111nifl~,' yes a. st.'uid~ird ste[nlnint? aigon. thins which ones? SMART b. morphological aiialysis 4. tenn weighting In docs, linear c()nll)inati()n of several factors In ~iueries, tf * i(1t; COsIfle nornlalizati(~n (ntc) 5. phrase discovery no 6. syntactic PlrS~Il~r1 flO 7. word sense dis~~nbiguation no 8. heuristic ass&~iatk)ns no 9. spelling checking (with manual correction) no 10. spelling correction no 11. proper noun identification algorithm no 12. tokenizer (recognizes dates, phone numbers, coininon patterns) no 13. are the maiiually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients for linear coml)inations used in weighting were determined automatically using QI,Dl4udgrnenis of QI (~n Dl. This to~~ 1.7 hours (not including 2.6 hours to index Q1,DI). B. Statistics on data structures built. from TREC text (please fill out each applicable section) 1. inverted index a. to~Ll ~unount of stor~ige (me~~Lbytes) 69(~ b. total computer time to build (approximate number of hours) 4.7 hours to create doc vectors from text 1.7 hours to reweight doc vectors and pr(KIuce inverted flle c. is the pr(~ess completely ~`~utomatic? yes d. ~ term position5 within d(~uments stored? no e. single terms only? yes 5. other data structures built from TREC text (what?) Map from d(~id to text location (als(~ gives title for each doc) a. total ainount of stor~ige (megabytes) 68 MI)ytes. b. total computer time to build (approximate number of hours) 455 Tinie to creite included in inverted file creation al)()Ve. C. is the pr~~ess completely automatic'? yes other data structures built from TREC text (what'?) Map froni internal concept to token string ~`i. total ~`un()uIlt of stor~'ige (ineg~'ibytes) 18 MI)ytes b. total computer tilue to build (approx~ate number of hours) TIme t(~ create Included in inverted file creation ahove. C. is the pRXess completely £`~utomatic? yes C. Data built from sources other th~'ui the input text None, other than .`;t()pw()rd file. II. Query construction (please fill out a section for e£~h query construction method used) A. Automatic(~ly built queries (ad hoc) 1. topic fields used T()pk, Nationality, Narrative, Concepts, Factors, Description 2. total computer tilne to build query (cpu seconds) 1.5 seconds 3. which of the f()llowiIl(T were used'? a. term weighting with weights b~'~sed on tenus in topics (idf) III. Searching A. Total computer tilne to seardi (cpu seconds) 383 seconds (includes retrieval + ranking). 1. retrieval tilne (to~1 cpu seconds between when a query enters the system until a list of document numbers are obtained) 2. ranking tune (tot~d cpu seconds to sort d(~ument list) B. Which methods best describe your machine se~'irching methods'? 1. vector sp£'lce m(idel 2. probabilistic model C. What factors are included ill your ranking? 1. tenn frequency 2. inverse d(icument frequency 8. infonnation theoretic weights 9. docuinent length IV. What machine did you conduct the TREC experilnent on'? Sun SPARC 2 How much RAM did it have? 64 MB What was the clock rate of the CPU? 40 Mhz V. Some Systems are researdi prototypes and otliers are commercial. To help compare these systems: 1. How much "software engineerinLY" went into the development of your system? Aliout 3 person-years f~)r the SMART system itself 2 person-weeks ft)r the Fulir weighing code 2. Given appropriate resources, could your system be made to run f£'ister? By how much (estimate)'? Of course! A 6 machine distril)uted version of SMART should he faster hy a factor of 3 for hoth indexing and retrieval. 456 3. What feature~ i.~ your ~ysIe1n Ifl~~S~fl(tT that it would beuefit by if it had them? DI.%tflI)uted version has not fully I)een implemented yet. 457 System Summary and Timing Universitaet Dortmtind Phrase automatic ad hoc (Fuhr leariling) General Cominent.~ The fiming.~ .~hould be the time to replicate run~ from scratch, not including trial runs, etc. The times should also be reasonably accurate. This sometimes will be ditTicult, such as getting total time ft)r d()(:ument indexing of huge text sections, or mailually building a ~owledge base. Please do your best. I. Construction of indices, kfl()wlCdLTe bases, and other data structures (please describe all data structures that your system needs fi)r se£ircliin~) A. Which of the following were used to build your data structures? 1. stopword list a. how many words in list? 570 2. is a coiltrolled v(~abul~iry used? Not f~)r single tern's. A phrase list was aut(~rnatically c(~nstructed from phrases occurring 25 times or more In the first doc set (Dl). ouly those phrases were used. 3. ste1~ing yes a. st~d~ud stelnining algorithms which ones? SMART b. morphological alialysis 4 tenn weighting In docs, linear c()ml)inati()n of several factors In (lueries, tf * idf, c(jsine normalization (ntc) 5. phrase discovery a. what kind of phrase? Adjacent n(Jn-st()pwords, comp()nenLs stemmed, that occurred at least 25 times In the Dl document set. b. using statistical meth(ids c. using syiltactic methods 6. syiltactic p(~sing no 7. word sense disambiguation no 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spelling correction ~ 11. proper Iloull identification algorithm no 12. tokenizer (reco(2nizes d£ttes, phone numbers, cominon patterns) no 13. ~ire the m~~ually-indexed terms used? no 14. other techniques used to build data structures (brief description) Coefficients f~)r linear c(~mhinati()Ils used in weighting were determined automatically using QI,DI,judgments of QI (~n DI. This took 2.4 hours (not including 5.6 hours to index QI,Dl). B. Statistics on data structures built from TREC text (please fill out each applicable section) I. inverted index a. total amomit of storage (megabytes) 840 b. total computer time to build (approximate number of hours) 9.7 hours to create doc vectors from text 458 2.9 hours to reweiglit doc vectors and pr(KIuce inverted tile C. is the pr(~ess coiiipletely aut()In£'Itic? yes d. ~`ire term positions wi~in d(X'ulnellts stored? no e. single terins only? Ilo 5. other data structures built from TREC text (what?) Map from d(~id t(~ text location (also gives title f(~r each dE)c) a. total ainoulit of storuge (niegabytes) 68 Ml)ytes. b. total computer tilne to build (approxu nate number of hours) Time t(~ create included in inverted tile creation al)()ve. C. is the pr(xess completely aut()Jn£itic? yes other data structures built from TREC text (what?) Map from internal concept to t~~ken string a. total ~unount of stor£ige (megabytes) 25 Ml)ytes b. total computer tilne to build (approxiznate number of hours) Time to create included in inverted tile creation ahove. C. is the pr(~ess completely automatic? yes other data structures built from TREC text (what?) Phrase dictionary (controlled v(~al)ulary) Phrases were adjacent n()n-stopw()rds, components stemmed, that occurred at least 25 times in the Dl document set. ~i. total unount of stor~ge (me~abytes) 14 Ml)ytes to store dictionary. b. total computer tillie to build (approx~~'ite number of hours) It took 5.8 hours to index Dl, finding ~ phrases and their collection stats. Ot those phrases l58,(~()() (~curred at least 25 times. C. is the ~r(icC55 completely automatic? C. Data built from source5 other thul the input text None, (~ther than st()pw()rd tile. II. Query construction (please fill out a section for each query construction method used) A. Autx)lnatically built queries (ad hoc) 1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description 2. total computer tilne to build query (cpu seconds) 2.7 seconds 3. which of the f~)llowing were used? a. term weighting with weights b~~~ed on terms in topics (idf) b. phrase extraction from topics yes, using controlled list of phra~es III. Searching A. Tot~~ computer tilne to search (cpu seconds) 374 seconds (includes retrieval + ranking). 1. retrieval tilne (total cpu seconds between when a query enters the system until a list of document numbers al-c obtained) 2. railking time (total cpu seconds to sort d('cument list) B. Which methods best describe y~~ur machine searching methods? 1. vector space m(XIel 2. probabilistic model C. What factors cLrC included in your ranking? 459 1. tenn frequency 2. inverse d(~urnent frequency 7. proxilnity of terins (for phra.%'e'%*.) 8. infoi~ation theoretic weiglit.~ 9. document lengili IV. What machine did you conduct tlie TREC experilnent on'? Sun SPARC 2 How much RAM did it have? 64 MB VVhat was the clock rate of die CPU? 4(~ MHz V. Some systems £`tre rese£'~ch prototypes and others £`irC coin'nerci£'~. To help coinp~'ire diese systems: 1. `low much "soflw~ire en~ineering" went into the development of your system.? AI)oUt 3 person-years for the SMART system itself 2 person-weeks for the Fulir weighing code 2. Given appropriate resources, could your system be made to run faster? By how much (esti'n£'tte)? Of course! Can speed up phrase indexing l)y 4~~% l)y algorithm change (speed up has l)een done for single terms, l)ut not for phrases) 3. What features is your system missing that it would benefit by if it had them? 460 System Summary and Timing Universitaet Dortmund Automatic routing (RPI feedback) General Coininents The timings should be the time to replicate runs from scr'~tch, not including trial runs, etc. The tilnes should also be re~~onably accurate. This sometilnes will be difficult, such as gettilig total time ft)r d&~ument jiidexilig of huge text sections, or m~uiually buildilig a kiiowledge base. Please do your best. I. Construction of indices, knowledge bases, and other dattt structures (please describe all data structures that your system needs t;()r searching) A. Which of the ft)llowin(T were used 10 build your data structures? 1. st()pword list a. how many words in list? 57(~ 2. is a controlled v(~abul~u~y used? no 3. stelnilling yes a. st£~idard stemming ~`tlg()n thins which ones? S~~ART b. m()1~h()l()gic£'1l ~~alysis 4. 1dm weighting In docs + queries, tt. * idt; cosine normalization (ntc) (in docs idf is l)ased on collection frequency within doc set Dl only) 5. phrase discovery no 6. syntactic parsing no 7. word sense dis~~nbiguation n(~ 8. heuristic associations no 9. spelling checking (with manual correction) no 10. spelling correction no 11. proper noun identification algorithm n(i 12. tokenizer (rec()L'nizes d~tes, phone numbers, CoifliflOli patterils) no 13. are the m£'uiu£.illy-indexed terins used? Ilo 14. other techniques used to build ~ta structures (bnef description) no B. S~itistics on data structwes built from Tl~EC text (please fill out each applicable section) 1. inverted index a. total ~`uli()unt of stor£'ige (ineg~ibytes) 275 b. totil computer tilne to build (approxilnate number of hours) 1.9 hours (not including tllue to index Dl to o')tain collection frequency info) c. is the pr(x:ess completely ~`Lut()ln'Ltic? yes d. (`ne term positions wi~in (1(iculnents stored? Ilo e. single terms only? yes 5. other dali structures built from Tl~EC text (wh~'it?) Map from dodd to text location (also gives title for each doc) ~`i. total £`ui'ount of st()r£'ige (megabytes) 24 M')ytes. b. t()~l computer tilfle to build (approxilnate number of hours) Tinie t(~ create included in inverted tile creation ahove. c. is the pr~'cess completely (`tutomatic? yes other data structures built from TREC text (what?) 461 Map from int~rnal concept to token string a. total aiflOulit of st()r~1~e (`ne~~'~bytes) 13 Ml)ytes b. tot~'Il co'npu ter time to build (approxli~ate number of hours) TIme to create included in inverted tile creation of Dl. C. is the pr(~ess completely automatic? yes C. Data built from sources other th~~i ~e input text NE~Ile, other than st()pw()rd tile. II. Query Construction (please fill out a section for each query construction method used) D. Aut()Inatic~illy built queries (routing) 1. topic fields used all 2. toL~ computer tilne to build query (cpu seconds) 1300 seconds, not including time to Index Dl (3.0 hours) 3. which of the fi)llowiIlg were used in building the query? a. terms selected from (1) topic b. tefln weighting (1) with weights based on terms in topics (2) with weights b~ised on terins in all training documents (3) with weights ~sed on terms from documents with relevance judgments III. Searching A. Tot£~ computer tilne to se~irch (cpu seconds) 312 seconds (includes retrieval + ranking). 1. retrievLil tilne (t()~l cpu seconds between when a query enters the system until a list of document numbers aic ()bL~inCd) 2. ranking time (tot~'il cpu seconds to sort d(~uInenI list) B. Which methods best describe your in~~chii'e se~irching methods? 1. vector space m(xlel 2. probabilistic model C. What fac(()rs ~Lre included ill ~()w. raiiking? 1. terin frequency 2. inverse d(~uInent frequency 8. infonnation theoretic weights 9. document length IV. What machine did you conduct the TREC experilnent oil'? Sun SPARC 2 How much RAM did it have? 64 MB What w£L~ the clock rate of the CPU? 40 MHz V. Some systems are research prolotypes ~ind others £ue c()mmerci~d. To help compare these systems: 1. how much "software engineering" went into tl)e development of your system? AI)out 3 person-years f~)r the SMART system itself 2. ("iv en appropriate resources, could your system be made to run f~'tster? By how much (Cstim~ite)? Of course! Due to algorithm flaw, CPU time f~)r constructing routing ~iuery is al)out a factor of 462 5 t()( much (Algoritlilil tound I)e%'t terfll% to expand I)y even though we had re(1uested expan%~i()n I)y () terni). 3. What feature.~ i~ Y()U~ ~y~tein mi~~iflg that it would benefit by if it had them? 463 System Summary and Timing University of Illinois at Chicago General Coininents The timin~~s should be the tilne to replicate fulls from scratch, not including trial fulls, etc. The tilnes should also be reasonably accurate. This sometilnes will be difficult, such as getting total time for d&~ument indexing of huge text sections, or maiiually building a kuowledge base. Please do yoUr best. Construction of indices, knowledge b~'i~es. and other data structures (please describe all data structures that your system needs for se~'irchi'ig) Each document is represented as a set of word pairs. Pairs were formed from all adjacent words, plus all words separated Ily ()flC and two intermediate words. Documents were the unit of ()rgani7~ti(~n f(~r the data structure. If a pair occurred only once in a document it was dropped from the data structure for that document ~~nly. A sample record is as f(JII()ws: MULTIMEDIA ENCYCLOPEDIA 2 WSj88081S-E~14 The numl)er of times the pair occurred in tile document appears in the third field, just l)efore the document id. A. Which of the following were used to build your data structures'? 1. stopword list The stopword list fr(~m SMART versi(~n 10 was used. Some additional stop words from TREC markup codes were used. a. how many words in list'? The total size of the stoplist was 631 words. 2. is a controlled vocabul£'iry used'? none 3. stelnining none a. st~'rnd~'~d stemming algorithms which ones'! Some small stemming experimenLs were later perf(wmed using the code from SMART versi~~n 10 and three training (lueries. For (~uery 002 stemming had n(~ effect, while t')r (~uery (~6 it resulted in a 43% increase in recall, and f(~r (~uery (~9 a 73% impr(~vement in recall. b. InoIi,li()loI~ic~'~l ~`ui~'ilysis Ilolle 4. tenn weighting None. Weighting w~~' planned Ilut could not l)e implemented given limitations that arose. 5. phrase discovery a. what kind of phr~'tse? Word pairs occurring within three word positions Of one another. b. usin(g st~'~istical ineth(Kls All such pairs were identified. c. usin~ sylitaclic methods 6. syntactic p~'u'sint,' none 7. word sense dis~nbiguation fl()flC 8. heuristic aNsociations a. short definition of these £`L';s()ci~'~ti()IIs Only the l)asic pairing ass(iciatioIL~ used. 9. spellinL' checkinLY (with manu£'il correction) none 10. spellin~ correction none 464 11. proper noun identific~~ti()n (dg()n~rn n~~ne 12. tokenizer (rec()~nize.~ d~te~, phone nuiuber.~, coifliflon patteni.';) Il()~C 13. aic the rn~~u~~ly-iiidexed tenn~ used? none 14. other techniques used to build ~ita structwes (brief description) flofle B. Statistics on d~~ita structures built from Tl~EC text (please fill out each applicable secti()n) 1. inverted index Based only on pairs, not individual ternis. a. total ainount of storage (megabytes) 819 ~egahytes b. total computer tilne to build (approxu nate number of hours) 1(11) hours C. is the pr('cess completely automatic? yes d. £ire terin positions witlim d(icuments stored? no e. single terms only? flolie 2. n-grains, suffix (~ays, signature files See BI. C. Data built from sonices other th~ ~e input text --no II. Query construction (please fill out £,i section flir each query c()nsti~ucti()n method used) A. Automatically built queries (ad hoc) 1. topic fields used Title, Description, Narrative, and Concepts (only tirst two.) 2. to~l computer time to build query (cpu seconds) (1.26 SeCond. 3. which of the following were used? fiolle D. Automatically built queries (r()utin(~) 1. topic fields used Title, Description, Narrative, Concepts (first two). 2. total computer tune to build query (cpu seconds) 55 SeConds 3. which of thc following were used in building ~e query? c. phrase extrLlcti()n (2) from ~~Lll trunilig documents Word pairs occurring in the relevant training documents for the query 1)ut not in the irrelevant documents were used. III. Searching A. Total computer tilne to se(u~di (cpu seconds) 1. retiiev~tl tilne (total cpu seconds between when a query enters tlie system until a list of document numbers are ()bt~1ined) This was not optimized f~)r the current experiments. Run time was approximately 2(1 minutes per search. 1~r()per ()ptiniizati(~n will reduce this tinie. 2. r~~mkin~ time (total cpu seconds to sort d('cument list) .22 seconds B. Which metliods best describe your michine searching metliods? 4. n-gr£~ matching C. What f~~ct()rs aic included in y~~ur r~~tnking? 11. li-grun fiequency IV. What m£~hine did you conduct ~e TRF£C experiment on? IB~1 3(19()/3()(lj How much RAM did it h(tve? 16 Meg f(~r a virtual machine. Wh~~it w~Ls the clock rate (if ~e Cl~~ 1? 14.5 nanoseconds, or 69 MH7. V. Some systems aic research prototypes (md others ~ue c()Inmerci£(1. To help c(imp£ue ~ese systems: 465 1. How much ".`;oflw~ire ellgilleerillg" went into the development of your sy.~tem? 4(J h(~urs (~f new developmellt, 1)eyond using word pairing tools that were developed earlier over a peri(~ of years. 2. Given appropn'aie reN()urce.~, could your sy.~Iein be made to run f~~ter? By how much (e.';Iiinale)? Yes, search time could l)e reduced, hut a reliahle estimate of how much cannot he made at this time. 3. What features is your system missing (hal it would benefil by it' it had them? Phrase weighting Term weighting and auxiliary single term search Stemming Removal of pair order etTects Shortest path network search 466 System Summary and Timing Belicore General CoinmexiLs The timings should be the time to replicate runs from scratch, not including trial runs, etc. The t~nes should also be reasonably accurate. This soinetilnes will be difficult, such `~ gettin~ total time ft)r document indexing of huge text sections, or manually buildin(2 a k'iowledge base. Please do your best. I. Construction of indices, knowledge b~~es, and other data structures (ple£~~e describe all data your system needs for se~irching) structures that A. Which of the following were used to build your data structures? 1. stopword list yes (though SoniC experiments without stoplist) a. how many words in list? n=439; standard SMART list, I think 2. is a controlled v(~abul~uy used? no 3. steinining fl()flC (except truncation at 20 character.'~wd) 4. tenn weighting yes, l()g(tt)*(1~entr()py) 5. phrase discovery no 6. syntactic p~'irsing no 7. word sense disainbiguation no 8. heuristic £~ssociations no 9. spelling checking (with manual correction) ~() 10. spelling correction no (not directly, l)ut the LSI analyses does some of this fi)r free 11. proper noun identification ~dg()rithIn Ilo 12. tokenizer (recognizes dates, phone numbers, coininon pattenis) 13. are the manually-indexed terms used? no 14. other techniques used to build ~ta structures (brief description) LSIISVD analysis of term~l)y-d()cument matrix. Takes raw term-hy-doc matrix; transforms entries using log entr(~py term weightings; calculates hest "reduced-dimensi()nal" approximation to transformed matrix using SVD. Numl)er of dimensions 250-350. Does all (1uery-doc matching in this reduced-dimension vector space. B. Statistics on data structures built from TREC text (please fill out each applicable section) 5. other data structures built from ~~REC text (what?) LSIISVD uses reduced-dimensi()nal vectors (see l)elow fi)r description of how they are derived). The numl)er of dims was I)etween 235 and 250. There is one such vector for each term and fi)r each d(K:ument. Queries are also represented as vectors and compared to every document. a. total ainount of st()r£ige (Ine(2~.Ibytes) All reduced dimensional vectors are stored in a hinary datahase. Datahase c(~sists ~ a vector fi)r every doc and every term occurring in more than one doc. The vectors currently consist (~f single precision real values. For TREC, we huilt (jne datahase fi)r each collection. Approx. 50000 docs are sampled. Terms that occur in more than one of these documents are used in the SVD analysis. The remaining docs are added to the database. DOEI - docs: 226(~7, terms: 42221, ndim: 250-> 262 meg dl) 467 wSjI - docs: 99111, terms: ndim: 250 API - docs: 8493(~, terms: 78167, ndim: 25(~ ZWFI - docs: 7518(), terms: 6()565, ndim: 250 FRI - docs: 26207, terms: 54713, ndim: 25() W5j2 - docs: 7452~), terms: ndim: 235 AP2 - docs: 79923, terms: 82997, ndim: 235 ZIFF2 - docs: 5692(~, terms: 72197, ndim: 235 FR2 - docs: terms: 48728, ndim: 235 -> 169 meg dl) -> 163 meg dl) -> 135 meg dl) -> 80 meg dl) -> 141 meg dl) -> 153 meg dl) -> 121 meg dl) -> 64 meg dl) Used 25() dims fi)r routing and 235 dims for ad hoc (~uerjes In general, database size will be: (ndocs+nterms)*ndim*4 The totals here are 1288 meg (750000 docs and 585(~00 terms). If a single database had been used, the total would have been smaller becauSe of term overlap--currently, many of the terms are represented in more than oliC datal)ase; there are only 2000(M) Uni(Jue terms. b. t()t~l Computer tilne to build (~~ppr()xilnate iiuinber of hours) Four main stages: I. indexing (extracting keys; calculating wts; etc.) 2. SVD (number ~)f d1111C1151()fl5 extracted ranged from 235-310) NOTE 1: only 235-25() dims were actually used fi)r retrieval. I don't have timing data for extracting only this smaller numl)er of dimensions, but I'd estimate that the numbers t~)r APi, ZIFFi and FRi could l)C reduced by about 20%. NOTE 2: initial indexing and SVD are typically done on a subset of 50()00 docs and uterms 3. various i/o translations (much (~f this will g(j away soon) 4. adding new docs to dl)a5e (if sul)-sampled for SVD). SVI) done oil 5(~(H)() docs; the remaining docs are indexed and added to the datal)ase after the SVl). all times in ~1INUTES (SVD DOEI - index: 49 SVD: 1219 io: wSj1 - index: 241 SVD: 1474 i(): APi - index: 271 SVD: 1644 i(): ZIFFi - index: 241 SVD: 1359 i(): FRi - index: 241 SVD: 939 io: WSJ2 - index: 427 SVD: 1382 io: AP2 - index: 338 SVD: 1210 i(): ZIFF2 - index: 260 SVD: 1452 i(): FR2 - index: 187 SVD: 486 io: run on DECS()()(); rest on SPARC2) 194 add: 591 SUM: 2053 mins 174 add: 4()4 SUM: 2293 mins 214 add: 455 SUM: 2584 mins 156 add: 352 SUM: 2108 mins 133 add: 0 SUM: 1313 mins 22(J add: 461 SUM: 2490 mins 218 add: 273 SUM: 2(~9 mins 2()8 add: 0 SUM: 1920 mins loS add: 0 SUM: 778 mins C. i.~ the pr(~es.~ Completely ~ut()Jn~'ttiC? YES d. brief deNcription of Ineth()d.~ u.~ed LSI/SVD analysis (~f document collection 1. creates raw term-l)y.d()c matrix; transf~)rms entries using log entropy term weightings 2. calculates beSt "red uced-dimensional" approximation to transformed matrix using SV1). Number of dimensions in the SVD calculations ranged fi~()m 235 to 3I(~. BUT, only 235 ()~ 250 were used f(Jr the comparisons. Fewer dims could have been calculated, So Some reported SVD times are higher than necessary. I'd estimate about 2()% reductions in SVD times for API, ZIFFI, and FRI. 3. perf~)rm various datal)ase translations. Current SVD program outputs vectors in a different f(~rmat and order than we need for the database. It 468 Will eventually output vectors in the appropriate datahase format, and this entire step can l)e omitted. 4. SVD calculations usually run on -5(),()(M) docs x nterm%' matrices. The remaining docs (if any) were indexed and added to the datal)ase here. C. Data built from sources other th~ui tlie input text --no II. Query c()i'structioil (please till out a sectioll f()r each query consti-uction method used) A. Automatic~illy built quefles (ad hoc) yes 5u1)Initted two sets of ad hoc (1ueries; (1ueries were the same in hoth c~%'es; only difference was how information from diflerent sul)-c()llecti()ns was coml)ined 1. topic tields used all (except NO manually indexed terms used) 2. to~l computer tilne to build query (cpu seconds) Queries are vect(~r sums (~f constituent term vectors Separate query vector created fi~r matching against each of 9 datal)ases (DOE, WSJI, API, FRi, ZIFFI, WSJ2, AP2, FR2, ZIFF2) Time = .4 secI(1ueryldatal)ase -> 3.6 secs/(~uery NOTE: These times simulate handling each query separately (so there is no ilo l)utfering). There are l)ig improvements if you initially read in all the term vectors and create all the ad hoc queries at once. 3. which of the following were used? a. term weighting wi~ weights b~~ed on teims in topics term weighting, but weights based on term usage in document c(~llections Ii. expalision of quenes usin(~ previ()usly-constructed dr~ta structure (from p£irt I) not really D. Automatic~-dly built queries (routin{',) yes submitted two sets of routing queries. Both were automatically created from I) the text of the topics and 2) the relevant documents 1. topic tields used all (except NO manually indexed terms) for 1)0th 1) and 2) 2. total computer tilne to build qucly (cpu seconds) Queries are vector sum~. of constituent term vectors [case I)] ()~ document vectors (case 2)]. Separate query vector created for matching against each of 4 (WSJ1, APi, FRI, ZIFFI) Time = .4 seclqueryldatal)ase in case I) -> 1.6 secs/query Time = .1 sec/query/database in case 2)-> ().4 secsI(~uery NOTE: These times simulate handling each query separately buffering). separate databases 3. which of the ft)ll()win~ were used in buildin~ ~e query? a. terms selected from (1) topic case I) (3) only documents with relev~~ice judgments case 2) b. telin weighting (2) with wei{',hts b-~sed oil terms in all training d()c~ents (so there is no i/o III. Searching 469 A. Tot£~ coinpuler tune to se£'uch (cpu seconds) 1. retrieval tilne (to~Il CPU seconds between When a query enters tlie system Until a list of document numbers ~ire obtained) Time = -5()(~)(~ query-doc c()mparisonsiminute when all vectors are pre-loaded. Currently, we c~~mpare ALL d(~cs t(~ each query. For ad hoc queries, the time to c~~mpare a query to the 75(~K~~ docs is -12 minutes For r(~uting queries, the time to c~~mpare a query (new doc) to the profiles (50 profiles in each (~f 4 datal)ases) is al)out .3 Sec 2. rankini2 time (tot£'Ll cpu seconds to sort d(~Ument list) none; it's included in the times given in 1. Currently 1)0th comparisons and ranking are done in the same routine B. Which methods best describe ~()U~ `u~~chine se~'irching me~ods? 1. vector sp~'~e m(KIcl C. What factors £`ire included in y()w r~'uiking? Hum, not sure I get this. Similarity l)etWCeIl a query and a document is the cosine l)etween the query vector and the document vectE)r. This cosine determines the rank. Term weights are used to determine the location ~)t. the query vector. The query is located at the weighted vector sum (~f i~s constituent terms. 1. tenn fiequency l()g(tf)*(1~efltr~~py) term weight; s(~ there's a tf part 3. other tenn weights (where do they come from?) log entropy; weight come fi~om training docs (diski) for routing queries, and from both the training and test docs for ad hoc queries 4. se'n£uitic Closeness (as in semantic net distance) sort of; if you think of term vector I()cati()ns as reflecting semantic ~ssociations. But these locations are auto derived from the SVD analysis 8. infonnation theoretic weights IE)g (tt) * (1-entropy) IV. What machine did you condUct tlie TREC experiment on? How much RAM did it have? What was the clock rate of ~e CPU? SVDs run on DEC5t)t)() wi --4(~(~ meg; clock is ??? MHz all else run on Si~ARC 2 WI 384 meg; clock is 25 MHz (I think) V. Some systems ~we rese~~~ prototypes md others LrC c()Inmerci~'tl. To help C()InP~C tliese systems: 1. How much "softw(tre eIl'~illeenn~' went into the development of your system? Real hard. The system was huilt as a research prototype to l(x)k at many different issues. I'd say aboUt 1-2 person-years, l)ut this is much more than would have heen required if specs had l)eell fixed at the beginning. 2. (~`iven appr()pfl~1te rcs()Urces. coUld yoUr system be made to run f£%';ter? By how much (estim£'~le)? The existing tools were used pretty much as is for TREC, even though they were devel(~ped t(~ work with much smaller databases. Also, there are far more parameters and options than we typically use. Alm(~t no effort went into re-engineering for large databases ()~ to more efficiently handle what we now use as default parameters. Time in query c(~nstructi()n and retrieval are spent: 1) seeking for vectors in a single large database of term and doc vectors. The database could easily be split. 2) many calculations (scalings (~t. various sorts) are done on the fly. This could be eliminated if one knew that users wanted to retrieve only 470 documents, f~~r example. Currently l)()th terms and docs can he retrieved with the same pr~~grams and scaling isn't done until we see thit the user wants retrieved. 3) all calculations are done in tl()ating point. Could he done with integers. 4) each ad hoc (~uery was compared to EVERY d(~ument. This can he speeded up hy 5()~C document clustering algorithms that we have looked at. This can also he speeded up tremendously hy using more than one machine or hy using a parallel machine. All vectors are independent, so it's trivial to split query processing. I'd guess that improvements (~f a factor of 2-5 could he (Jl)tained just hy tweaking items 1), 2) and 3). Parallel query matching is the way to go. For example, we got speed-ups of 5()-1(M) times using a MasPar for query storage and processing with no attempt to optimize. In terms of pre-processing and SVD analyses: I) ahout 1(~% (~f the time is spent in unnecessary `10 translation (hecause we've patched together pre-existing t()()ls). Much of this will eventually g(i away. 2) more than 5(J% of the time is spent in the SVD. These alg()flthms get hetter and faster all the time (the algorithm we n~iw use is ahout I(X~ times faster than what we used initially). There are speed-memory trade()ff%' in different SVD algorithms, so time can pr()hal)ly he decreased hy a factor of 2 ()~ 3 hy using more memory. Parallel alg(~rithms will help Some, hut pr()hahly only hy a factor or 2 ()~ 3. These are (~Ile-tinle costs f~)r relatively stahle domains. We've found that new items can he added to the existing solutions without redoing the scaling f~)r a while. Others ??? 3. VVhat features is your system inissin~ th£'it it would benefit by if it had them? Precision would prol)ahly he increased hy many of the standard things--phrases, proper noun identitication, tokenizer (f~~r dates, phone nuinhers, addresses, etc.), and some hetter handling of negation and union. S(~me form (~f literal string matching might he useful to use in *comhinati()n with LSI for some types of queries. Others ??? 471 System Summary and Timing Queens College, CUNY General Conunents The fimings should be the tilne to replic~ite ruiis from scratch, not including trial runs, etc. The tilnes should also he reasonably accur~'~te. This sometilnes will be difficult, such as getting total time for document indexing of huge text sections, or `n~~ually building 1 kilowledge base. Please do your best. I. Constructioll of indices, knowledge b~ises, £Lnd other dat~'~ structures (please describe your system needs for se~'ucliin~) all da~ structures that A. Which of the following were used to build your data structures? 1. st()pw()rd list yes a. how many words in list'? 595 2. is a c()ntR)lled v(~abul~lry used'! no 3. stelninilltT a. st£uid~ird steimnilig algon thins yes which ones'? I~()rter's Algorithm b. In()1~h()l()gical (`u1'Llysis 11(~ 4. telin weighting yes 5. phr~'ise discoveiy n~j 6. syntactic p~u;sinL' no 7. word sense dis'unbitjuati()n 11(~ 8. heuristic associations n(j 9. spelling checki'i~ (with manual collection) no 10. spellitiLl colTection Ilo 11. proper noun idCntifiC~tti()Il L'il~()ri thin no 12. tokellizer (reco~flizes d~'LtCs, phone nujnbers, COiThflOfl patterus) no 13. £`u'e the in~tiiually-ii'dexed tenns used'! 110 14. other techniques used 10 build d~it~i structures (brief descuption) A tal)le of 396 manually created 2-word phrases. When these are identifled in adjacent positions in documents or (~ueries, they are used as additional index terms. B. St£~tistics on d:ita structures built fiom TREC text (please fill out each applicable section) 1. uiverted index a. total ~~n()unt of storage (megabytes) 378 b. total computer tune to build (approxilnate number of hours) 95+11+2=11)8 fi~r 5(1(1MB. clock tilne c. is the process completely automatic? Yes, if sutlicient disk. Not in this experiment. if not, ~tppr()xiinL'Itely how many hours of manual labor? (1.5 d. ~`ue term positions within d('cuments stored'? No, Ilut sentence yes. Call modify to capture word positions. C. single terms only'? Yes, except t~)r I.A.14. 4. special routing structures (wh~t?) See I.B.5 Network node, edge tiles. Routing using network node and edge files is straightforward. £1. total £un()unt of st()r:lge (ine~abytes) Node tile: 4x7.5 Edge tile: 4x4 Netw(~rk segmented int(~ 4, Ilecause (~f insufficient ram. b. t()L~I computer tune to build (appr()xilnL~te number of hours) 472 4(~+5+1+4x().2=46.8, starting from text tile. C. is the pr~}cess colupletely £~utornatic? yes if sufticient rain and disk space. (1. brief descriptioii of methods used 1. Process (old) collection A. 2. Process (lueries against collection A. 3. Process new collection B as if they were (lueries--to make use of collection A statistics. 4. C()ml)ine (iuerles, (old) dictionary and collection B into network for retrieval. 5. other data structures buill from TREC text (what?) 1. Suhd(~cun1ent file 2. C(~ed tile 3. D~~id checking file 4. Termid checking tile 5. Docnum tile 6. Termnum (dictionary) file 7. Direct tile 8. Index to direct tile 9. N('(le tile lo. Edge file a. totil ~un()uflt of st()r(~ge (me~aby(es) 1.481 2.324 3.7 4.4 5.11 6.6 7.372 8.19 9. 4x14 lo. 4x9 System was developed for experimental research, with tlexil)ility to generate other data. Some of the tiles are not necessary for retrieval. b. total computer tillie to build (approxil nate number of hours) 1. 1.5 2,3,4,5,6. 95 7,8. 11 9,1(). 4x(~.25=1 C. is the pr('cess completely aut()m£~tic? Yes Ir sutTicient RAM and disk space. For this experiment, no. if not, ~ipproxim~itely how many hours of m~uiual labor? 2 d brief description of methods used ra~v text -.> sul)d()cunlent tile sul)d()cuIllent --> c(~ded tile, dodd file, termid tile, docnum (dictionary) file. Zipf-law prograni truncates dictionary via user assigned limits. Coded, terninuni --> direct file with index direct -> inverted file direct, inverted --> node, edge tiles. C. Data built from sources other th('w ~e input text 1. inte~~illy-built auxiliL~y files a. domain independent ()~ d()m£lin specific (if two of questions for each file) phrase file b. type of file (thesaurus, knowledge b~';e, lexicon, C. total ainount of st()r£ige (meg~'iby(es) (~.E)E)5 d. total number of concepts represented 396 f. tOtL~ computer tilne to build (approxiluate number of hours) (~ (this is a tile created via editor). 473 file, termnum sep£Lrate files, please fill out one set etc.) word pair g. total m~u1u('d tilne to build (approximate number of hours) 16 h. use of m(~nu('1l l'!b()r (4) other (describe) Search for WSJ terminology in 1il)rary and from topics. 2. externally-built auxili('uy file ~() II. Query construction (please fill out a section for each query construction method used) A. Automatic~tlly built queries (ad hoc) 1. topic fields used