APPENDIX C

  This appendix contains the supplemental forms filled out by each group about their system.  These forms
are meant to supplement the papers and contain a standarded and formatted description of system features and
timing aspects.


                                   435

System Summary and Timing

City University, London

General Cominents

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
he reasonably accurate. This sometimes will be difficult, such as getting total time f(ir document indexing of huge
text sections, or mailually building a kilowledge base. Please do your best.


I.      Construction of indices, knowledge bases, and other data structures (please describe all data structures that
        your system needs for searching)

        A. Which of die fi)lk)wing were used to build your data structures?
               1. stopword list yes
                      a. how Inaily words in list?
                                126 general stop words + 6 function words. Excluded from indexes and
                                (lueries. Semi-st()pw()rd list of 256 words and phra~es. These are not used
                                in query expansion following relevance feedl)ack unless they occur in the
                                original query.
               2. is a controlled v(~abulL.uy used'?   No. But see I C I I).
               3. stelnining yes
                      a. st~uid~'u'd stelninilig ~`LIgon' thins
                                A moderately weak suffixing algorithm Ilased on M. F. Porter, "An
                                algorithm for suffix stripping." Program, 14(3), Jul 1980, 130-137. We also
                                use a degree (~f British/American spelling contlation.
                      b. morphological ~alysis no
               4. terin weighting No. Query terms are weighted, l)ut not index terms.
               5. phrase discovery 110
               6. syntactic parsing no
               7. word sense disainbiguŁ~tion 110
               8. heuristic associations no
               9. spelling checking (with manual correction)    no
               10. spellin~ correction no
               11. proper noun identific~'ition al'~orithm no
               12. tokenizer (reco(2nizes dates, phone numbers, coininon pattenis) no
               13. are the manually-indexed terins used'!    ~()

        B. Statistics on &L'ita structures built ~in TREC text (please fill out each applicable section)
               1. inverted index
                      ti. total ainount of storage (megabytes)     810
                      b. t()t~'~l computer tilne to build (approxil nate number of hours) 43
                      c. is the ~1*()(:C55 completely Ł`iutomatic? yes
                      d. ~`u'e terin positions within d(~uments stored'? No. Insufficient disk space to do this.
                      C. single telins only'? Single terms and pre-specified phr~ses (see I C I 1) l)elow)

        C. Data built from sources other th~ui the input text
               1. interii~'illy-built auxiliary files One manually-Iluilt tile.
                      a.   domain independent or domain specific (if two separate files, please fill out one set
                                of (juestions for each file)  Loosely domain-dependent
                      b. type of file (thesaurus, knowledge base, lexicon, etc.)
                                Small quasi-thesaurus containing synonym classes, prefixes, go phrases,


                                                       436

                          st()pw()rds,  function words and semi-stopwords (see I A  1  a for
                          semi-st()pw()rds).
                    C. total ainount of StOra~LTC (megabytes) 0.013
                    d. total number of concepts represented About 1500
                    e. type of representation (fr~es, seinailtic nets, rules, etc.) Simple
                    f. total computer t~e to build (approximate number of hours)
                          Manually built. Structured at runtime, time negligible.
                    g. total mai1u(~l tilne to build (approximate nuinber of hours)
                          Perhaps 8 person.h()urs. Several iterations, based on fre(luency counts from
                          indexing runs, other similar files, TREC (lueries and documents.
                    li. use of manuŁ~l labor
                          (4) o~er (describe) Manually built using text editor
             2. extenially-built auxiliary file no

                                                                        lookup table

II. Query construction
             (please fill out a section for each query construction method used)

      A. Automatically built queries (ad hoc)
             1. topic fields used
                    Concepts. Other tields were tried but gave overall (though not uniformly) worse
                    results.
             2. total computer tilne to build query (cpu seconds)
                    0.02 secondsI(~uery to parse topic and extract
             3. which of the ft)llowin~ were used?
                    j. other (describe)
                          Concept terms processed and weighted.
                          Term weight = constant * log(((r+c)I(R.r+1-c)) I ((n.r+c)I(N-n.R+r+1-c)))
                          where N is the nullll)er of indexed documents, n the number of documents
                          containing the term, R the nunll)er of known relevant documents, r the
                          nullll)er of known relevant documents containing the term, c = 0.5. Weights
                          rounded to nearest integer.

      B. Manually constructed queries (ad h(~)
             1. topic fields used Any: searchers' free choice.
             2. average tilne to build query (minutes) About 40 minutes (often including trial searches)
             3. type of query builder
                    Six searchers were used. None was a domain expert. Two might be described as
                    experts on the search system.
             4. t(iols used to build query
                    c. other lexical tools (identify) Trial lookups giving fre(luency. Trial searches.
             5. which of the f~)llowing were used?
                    a. terin weighting As in II A 3 j above
                    b. Boolean connectors (AND, OR, NOT)
                          All available. AND and OR were used in a number of searches.
                    d. addition of terins not included in topic
                          (1) source of terms
                                 Searchers' world knowledge and terms from relevant documents
                                 found in trial searches.

      C. Feedback (ad hoc)
             1. initial query built by metliod 1 or meth(XI 2? Method 2
             2. type of person doing feedback
                    Searchers were Masters students in Information Science and two people working on
                    the TREC project.


                                        437

             3. aver~ige tilne to do complete feedback
                   a. CPU time (total CPU .`;econd.~ for all iterations) About 20 seconds
                   b. ck~k tilne from initial Construction of query to completion of final query (minutes)
                   About 20 minutes
             4. average number of iteratk)ns One
                   a. average nwnber of d(~uInCnts exainined per iteration About 20
             5. minimum number of iterations One
             6. maximum number of iterations One
             7. what determines the end of an iteration?
                   Searchers were rec~~mmended to stop after assessing 20 documents or when they had
                   found 10 relevant documents. These guidelines were not always adhered to.
             8. feedback methods used
                   b. automatic query expansion from relevant d(Kuments
                         (2) only top X tefins added (what is X)
                         Term p(x)l was all ~~uery terms + all non-semi-stop terms from relevant
                         documents. The former were given an R-value of R + 3 and an r-value of
                         r + 2.  Top 20 terms were used, selected in descending order of
                         (term~weight * r) and weighted using the formula given previously.
                                                       II II II
                         See section 11 A 3 j & I A I a for "R ,  r , semi-stop".

      D. Automatically built quefles (routing)
             1. topic fields used Concepts
             2. total computer tilne to build query (cpu seconds)
                   Depended strongly on number of known relevant documents in training set and their
                   length. Average perhaps 10 minutes.
             3. which of the followin(j were used in building the query?
                   a. terms selected from
                         (1) topic
                         (3) only docuineuL'; with relevance judginenis
                   b. tenri weighting
                         As in II A 3 j above except R = R + 10 and r = r + 10 for concept terms.

III. Searching

      A. Total computer tilne to search (cpu seconds)
             1. retrieval tilne (total cpu secon(1~ between when a query enters the system until a list of
                   document numbers are obtained)
                   Typical figure for 12-term search producing output list of 350,000 document
                   identifiers: 45 seconds (note that in an interactive production system we would use
                   a weight threshold which would reduce this by perhaps 50%).
             2. ranking time (total cpu seconds to sort d&~ument list)
                   For list above: about 65 seconds (weight threshold would reduce by 50-90%).

      B. Which metliods best describe your machine searching methods?
             2. probabilistic model

      C. What factors are included in your ranking?
             2. inverse document frequency
                   Inverse document frequency and relevance information when available (see above
                   for weighting scheme).

IV. What machine did you conduct the TREC experilnent on?
             Sun SPARC Server 41330 with Sun IPC as fileserver
             How much RAM did it have? 16 megabytes


                                       438

             What w~'L~ the clock rate ot' ~e CPU? Not specified. Sun claims 16 Mips.

V. Some systems are research prototypes and others are com'nerciŁ~.
             To help compare tliese systems:

             1. How much "soflware en(~iI'eering" went into the development of your system?
                    System is non-commercial. It has undergone continual modification since 1982 to
                    meet the requirements of a number of different research projects, mainly on
                    end-user bibliographic search mg.
             2.  Given appropriate resources, could your system be made to run faster? By how much
                    (estimate)?
                    Faster hardware would of course increase speed. Main bottleneck is disk & network
                    `10. Very large amounts ~`f RAM (of the order of a gigabyte per process--or could
                    be shared between processes searching the same database) would greatly reduce `10
                    dependence. On the software side, earlier versions of the system were often optimised
                    f('r speed at s(~me cost in added complexity and reduced flexibility. This optimisation
                    has been removed from the version produced for TREC, partly because interactive
                    searching by general users was not envisaged.
                    It is impossible to give definite estimates on speed improvements. It would not be
                    unreasonable to expect an order of magnitude improvement within current hardware
                    and software c(~nstraints.
             3. What features is your system missing that it would benefit by if it had them?
                    Given enough disk we would have stored positional information in the indexes, and
                    probably used it to m(~dify document weights, perhaps by giving weight bonuses for
                    term proximity. This would have increased inversion storage overheads to a little
                    over 1(~~% of bibliographic tile size. (This is not really a "missing feature," because
                    the system d(~s have the capability.)
                    We might have considered some form of weight adjustment for document length.
                    This would involve a modification of the index structure which might just have been
                    feasible within the disk constraints.
                    Other possibilities worth investigating include phrase discovery and term dependency
                    statistics.


                                        439

System Summary and Timing

University of Pittsburgh

General Commenis

The timings should be the time to replicate rujis from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as getting total time for document indexing of huge
text sections, or manually building a knowledge base. Please do y~~ur best.


I.      Construction of indices, knowledge b~'tses, and other data structures (please describe all data structures that
        your system needs for searching)

        A. Which of the ft)llowing were used to build your data structures?
              1. stopword list
                     a. how many words in list'? 2,529 words on the list, including digital (0-9).
              3. stemmin~2
                     a. standard stemming algorithms
                            We use Porter stemming algorithm and it was implemented by C. Fox, using
                            C programs.
              4. telin weighting
              5. phrase discoveiy
              6. syntactic parsin&'
              7. word sense dis~'unbi(2uation
              8. heun's+~ic ~`L';sociations
              9. spellint~ checking (with manual correction)
              10. spelling correction
              11. proper noun identification algoritlim
              12. tokenizer (recognizes dates, phone numbers, common patterns)
              13. are the manually-indexed terins used'!
              14. other techniclues used to build dŁ~ta structures (brief description)

        B. Statistics on data structures built from TI{EC text (please fill out each applicable section)
              1. inverted index
                     a. total ainount of 5tOra~~~C (me~abytes)
                            For the storage space information, only the data on dLsk one is available.
                            Following table provides the data in Megabytes unit.

                                       DOE      AP      ZIFF    WSJ

                     Inverted tiles    162.3    199.8   143.7   223.4

                     Indexed tiles              2.2     2.4     2.1

                     Address tiles     4.3      1.7     1.7     2.6

                            Note: * Data on FR is also n(Jt available (loaded into tapes);
                                     * Address tiles are the indexed tiles which include document
                                     numl)ers and their offspring in the text tiles where the document
                                     stored.
                     b. total computer time to build (approximate number of hours)
                            Please refer t(~ TahIe 1 in (~ur paper.
                     c. is the pr~icess completely automatic? yes


                                            440

                    d. Łtre teun po.~iti()n.'; witliiu d(~umeflts ~tored? no
                    e. .`;ingle tennN only? yes

      C. Data built from sources other thaii the input text --no

II. Query construction
             (please fill out a section fi)r each query construction method used)

      A. Automatically built quenes (ad hoc)
             1. topic fields used
                    Training queries: Title and concepts are used. However, the nationality might l)e
                    included if it's necessary to meet the narrative item.
                    Routing queries: The routing queries are the final converged queries from the
                    training queries. There is 110 moditication.
                    Ad hoc queries: Title, concepts are used, and some keywords from narrative items
                    are added.
             2. total computer time to build query (cpu seconds)
                    Computing time to l)uild queries is not availaille.
             3. which of the ft)llowiIl~ were used,?
                    a. term weighting with weights bŁ~sed on te~s in topics
                          Term weighting for queries is assigned ily the system. It is our research
                          t(~pic on term weights modification. Note the stemming algorithm used on
                          document processing was aLso used on query terms.
                          Training queries: All term weights were assigned automatically hy the
                          system and also adjusted l)y the system using feedl)ack information.
                          Routing queries: The term weights are those from the last generation of the
                          training queries. No changes are applied.
                          Ad hoc queries: For one query individual the term weights were assigned
                          manually hy the researchers.  The other query individuals' term weights
                          were generated hy the system. (Note: our system uses 10 query individuals
                          searching documents simultaneously.) Also the term weights were adjusted
                          hy using the feedl)ack information.

      C. Feedback (ad hoc)
             1. initial query built by method 1 or Ineth(xl 2?
                    Initial queries were l)uilt Ily method I (automatic).
             2. type of ~C~5()fl doing feedback Evaluation is done l)y ()U~ researchers.
             3. average time to do complete feedback Please refer to TaIlle 2 in our paper.
             4. average number of itera(i()ns 3 iterations on average.
             5. minimum number of iterations 0
             6. maximum nuinber of iterations 9
             7. what determines the end of an iteration?
                    No more relevant documents are retrieved or it is not valuahle to do more feedl)ack
                    due to the time constraint.
             8. feedback methods used
                    Query terms are automatically modified hy the system using the genetic algorithm
                    in our system.

III. Searching

      A. Total computer time to search (cpu seconds)
             1. retrieval time (total cpu seconds between when a query enters the system until a list of
                    document numbers are obtained) Please refer to TahIe 2 in our paper.
             2. ranking time (total cpu seconds to sort d(lcument list) not availaille


                                         441

      B. Which rneihod.~ best descnbe y~~ur machine seŁirching methods?
             1. vector space m(xiel A distance function (Lp metric) is used as similarity measurement.

      C. What factors ~ire included in your railking?
             15. other (.`;pecify)
                    Document ranking bases on the distance. The shorter the distance, the high the
                    rank. That is, the document with the shortest distance is put on the top of the list.

IV. What machine did you conduct the TREC experilnent on?
             How much RAM did it have?
             What was the clock rate of the CPU?
             Two types of systems are used.
             Sun-670: 32 MB RAM and 40 MHz CPU clock rate;
             Sun SPARC/IPC: 24 MB RAM and 25 MHz clock rate.

V. Some systems are research prototypes and others are commercial.
             To help compare these systems:

             1. How much "software engineerin(~" went into the development of your system?
             2.  Given appropnate resources, could your system be made to run f~';ter? By how much
                    (estimate)?
                    If our system can be implemented on a parallel machine, the retrieval could be 10
                    times faster.
             3. What features is your system missin{' that it would benefit by if it had them?
                    There are a lot (~ parameters which can l~ adjusted to make our system more
                    flexible and more adaptive. We need to build a go(~ user interface on which several
                    parameters can be controlled and manipulated by the users.


                                        442

System Summary and Timing

Cornell University
Run 1: Single term automatic ad hoc run (global/local match)

General Comments

The timings should be the time to replicate fulls trom scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as getting total time for d(lcument indexing of huge
text sections, or manually building a kilowledge base. Please do your best.


I.     Construction of indices, knowledge bases, and other data structures (please describe all data structures that
       your system needs for searching)

       A. Which of the following were used to build your data structures?
              1. stopword list
                     a. how many words in list'? 570
              2. is a controlled v(~abulary used'? no
              3. stelnining yes
                     a. standard steniming algorithms
                            which ones.? SMART
              4. tenn weighting
                     In docs, tf * idt; cosine n(~rmalization (ntc)
                     In (lueries, tf * idt; Cosine normalization (ntc)
                     In sentences, tf * idI, no normalization (ntn)
              5. phrase discovery
              6. syntactic parsing
              7. word sense disainbiguation
              8. heuristic associations
              9. spelling checking (with manual correction)
              10. spelling correction
              11. proper noun identification algorithm
              12. tokenizer (recognizes dates, phone numbers, COmlflOfl pattenis)
              13. are the manually-indexed terins used'?
              14. other techniques used to build data structures (brief description)

       B. Statistics on data structures built from TREC text (please fill out each applicable section)
              1. inverted index
                     a. total amount of storage (megabytes) 690
                     b. total computer time to build (approximate number of hours)
                            4.7 hours to create doc vectors from text
                            0.7 hours to reweight doc vectors and produce inverted file
                     c. is the ~f(lcC55 completely automatic? yes
                     d. are term positions within d(~uments stored'? no
                     e. single terms only'? yes

              5. other dat~t structures built from TREC text (what'?)
                     Map from d(lc'id to text location (also gives title for each doc)
                     a. total amount of st()r~y&1c (megabytes) 68 MI)ytes.
                     b. total computer time to build (approximate number of hours)
                            Time to create included in inverted file creation al)ove.
                     c. is the ~f(~C55 completely automatic? Automatic


                                         443

                     d. brief descflpti()u of methods used

                other data structures built from TREC text (what?)
                      Map from internal concept to token string
                     a. total (i[n()unt of st()r~ge (megabytes) 18 Ml)ytes
                     b. total computer tizue to build (approxu nate number of hours)
                             Time to create included in inverted file creation above.
                     C. is the pr(~ess completely automatic? Automatic
                     d. brief desaiption of methods used

      C. Data built from sowces other th~~ the input text
                      None, other than st()pword file.

II. Query construction
             (please fill out a section for each query construction method used)

      A. Automatically built queries (ad hoc)
             1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description
             2. toLd computer tilne to build query (cpu seconds) 1.5 seconds for all (lueries
             3. which of the ft)llowing were used?
                     a. term weightin(~ with weights bŁLsed on terms in topics (idf)

III. Searching

      A. Total computer t~e to search (cpu sec()nds)
               1465 seconds (includes retrieval + ranking + indexing 50() docs per (luery).
             1. renieval tilne (total cpu sec()nds between when a query enters the system until a list of
                     document numbers are obtained)
             2. ranking time (total cpu seconds to sort d(icument list)

      B. Which methods best describe your machine searching methods?
             1. vector space rn(xlel

      C. What factors are included in your raiiking?
             1. term frequency
             2. inverse d(~ument frequency
             7. proxilnity of terms within sentence ne~ded for local sim.
             8. information theoretic weights
             9. doc~ent length

IV. What machine did you conduct die TREC experilnent on? Sun SPARC 2
             How much RAM did it have? 64 MB
             What wL~ flie clock rate of the CPU? 4(~ MHz

V. Some systems are research prototypes and others are commercial.
             To help compare these systems:

             1. How much "software engineerin(!" went into the development of your system?
                     About 3 person-years fi)r the SMART system itself
             2.   Given appropriate resources, could y~~ur system be made to run f~~ter? By how much
                     (estimate)?
                     Of course!
                     Retrieval local similarity needed to index 5O(~ docs per (luery; this could all be done
                     in advance if a single l(~al appr()3ch had been decided on.


                                         444

      Reduce retriev~l tinie by a factor of 5.
      A 6 machine di~tri1)uted version of SMART should be faster by a factor of 3 for
      1)0th indexing and retrieval.
3. What featuve.~ is your system missing that it would benefit by if it had them?
      Distributed versi~,n has not fully been implemented yet.


                           445

System Summary and Timing

Cornell University
Run 2: Phrase automatic ad hoc (Cornell Global/Local)

General Comments

The timings should be the lime to replicate runs from scratch, not including Irial runs, etc. The times should also
be reasonably accurate. This Solnetimes will be diff~cult, such as getting total time for document indexing of huge
text Sections, or mailually building a kilowledge base. Please do your best.


I.     Construction of indices, knowledge bases, and oIlier data structures (please describe all data structures that
       your system needs for searching)

       A. Which
              1.

               of the following were used 10 build your data structures?
               stopword list
                    a. how many words in list,! 570
             2. is a conlrolled v(~abulary used?
                    Not t')r single terms.
                    A phrase list was automatically constructed from phrases occ~rflng 25 times or more
                    in the first d(~ set (Dl). Only those phrases were used.
             3. stemming yes
                    a. staildard stemming algorithms
                           which ones? SMART
             4. tenn weighting
                    In docs, tf * idt; cosine normalization over length of single terims (ntc)
                    In queries, tt. * idi; cosine n(~rmalization over length of single, terms (ntc)
                    In sentences, tt. * idt; no normalization (ntn)
                    Phrases weighted using their natural tt~idl, cosine normalized by length of single
                    terms, and divided by sqrt(2). [Phrase match worth 0.5 of single term match]
             5. phrase discovery
                    a. what kind of phrase?

                           Adjacent non-stopwords, components stemmed, that occurred at least 25
                           times in the Dl document set.
              6. syntactic parsing
              7. word sense disainbiguation
              8. heuristic associations
              9. spelling checking (with mailual correction)
              10. spelling correction
              11. proper noun identification algorithm
              12. tokenizer (recognizes dates, phone numbers, common pattenis)
              13. are the mai1ually~indexed terms used?
              14. other techniques used 10 build data structures (brief description)

       B. Statistics on data structures built from TREC text (please fill out each applicable section)
              1. inverted index
                    a. total amount of storage (me(Jabytes) 840
                    b. total computer lime to build (approximate number of hours)
                           9.7 hours to create doc vectors from text
                           0.9 hours to reweight doc vectors and produce inverted file
                    c. is the pr(~ess completely automatic? yes
                    d. are tenn positions within d(~uments stored? no


                                         446

                    e. single terins only'? no
              5. other data structures built from  TREC text (what'?)
                    Map from d(~cid to text location (also gives title for each doc)
                    a. totŁ~l ~`un()unt of storage (megabytes) 68 Ml)ytes.
                    b. tot~Ll computer time to build (approximate number of hours)
                           Time to create included in inverted tile creation ahove.
                    C. is the pr(~ess completely automatic? Automatic

                other data structures built from TREC text (what'?)
                    Map from internal concept to token string
                    a. total ainount of storage (megabytes) 25 Ml)ytes
                    b. total computer time to build (approximate number of hours)
                           Time to create included in inverted tile creation ahove.
                    C. is the pr(~ess completely automatic'? Automatic

                other data structures built from TREC text (what'?)
                    Phrase dicti~~nary (controlled vocal)ulary)
                    Phrases were adjacent n()n-st()pw()rds, components stemmed, that (~curred at least
                    25 times in the Dl document set.
                    a. tot~'tl amount of stora~e (me~abytes)  14 Ml)ytes to store dictionary.
                    b. total computer time to build (approximate number of hours)
                           It took 5.8 CPU hours to index Dl, tinding ~ phrases and their
                           collection stats. Of those phrases l58,(~(~(~ occurred at least 25 times.
                    C. is the pr(~ess completely automatic'?

       C. Data built from sources other than the input text
              None, other than st()pword tile.

II. Query constructirm
              (please fill out a section for each query construction method used)

       A. AutomaticŁtlly built quenes (ad hoc)
              1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description
              2. total computer time to build query (cpu seconds) 2.7 seconds for all 50 (lueries
              3. which of the following were used?
                    a. term weighting with weights b~'i~ed ~ terms in topics (idf)
                    b. phrase extraction from topics yes, using controlled list of phr~ses

III. Searching

       A. Total computer time to search (cpu seconds)
              2405 seconds (includes retrieval + ranking + indexing 5(~~ docs/query).
              1. retrieval time (total cpu seconds between when a query enters the system until a list of
                    document numbers are obtained)
              2. ranking time (total cpu seconds to sort d(~ument list)

       B. Which Ineth(XIs best describe your machine searching methods'?
              1. vector space m(XIel

       C. What factors are included in your railking?
              1. tenn frequency
              2. inverse d(~ument frequency
              7. proximity of terms fi)r phrases and fi)r local similarity hetween sentences
              9. document length


                                                    447

IV. What machine did you conduct tlie TREC expeI.ilnellt on? Sun SPARC 2
             How mUch RAM did it have? 64 MB
             What wa~~ the clock rate of flie CPU? 4(~ Mhz

V. Some Systems are research prototypes and others are commercial.
             To help compare these systems:

             1. How mu~ "software engineen.n(7" went into the development of your system?
                     AI)()ut 3 per.%()n-years f~)r the SMART system itself
             2.  Given appropriate resources, could your system be made to run faster? By how much
                    (estimate)? Of course!
             3. What features is your system missing that it would benefit by if it had them'?


                                        448

System Summary and Timing

Cornell University
Run 3: Automatic routing (Cornell Ide f~edback)

General CominenLs

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The ti'nes should also
be reasonably accurate. This solnetilnes will be difficult, such as gettilig total time for document indexing of huge
text sections, or `naiiually buildilig a ~owledge base. Please do your best.

I.      Construction of indices, knowledge b~'L~s, and other data structures (please describe
        your system needs for searching)

all data structures that

      A. Which of the following were used to build your data structures?
             1. stopword list
                    a. how many w(Mds in list? 570
             2. is a controlled v(~abulary used? no
             3. stei~ing  yes
                    a. st~uidard stemming algorithms
                             which ones? SMART
                    b. mo~hological ~ui~ilysis
             4. tenn wei~hting
                    In docs + (1UeriC5, tf ~ idt; cosine normalization (ntc) (in docs idf is hased on
                    c(~llection frequency within d(~ set Dl only)
             5. phrase discovery
             6. syntactic pL~5ing
             7. word sense dis(~Tlbiguati()n
             8. heuristic associations
             9. spelling checking (with manual correction)
             10. spelling correction
             11. proper noun identification Ł..Llgori~m
             12. tokenizer (recognizes dates, phone numbers, coi~on pattellis)
             13. are the maimally-indexed terms used?
             14. other techniques used to build ~ita structures (brief description)

      B. Statistics on ~ita structures built from TREC text (please fill out each applicable section)
             1. inverted index
                    a. total amount of storage (megabytes) 275
                    b. total computer t~e to build (approxilnate number of hours)
                             1.9 hours (not including time to index Dl to o1)tain collection frequency info)
                    c. is the pr(~ess completely automatic? yes
                    d. are term positions within documents stored? no
                    C. sin(TlC terins only? yes
             5. other data structures built from TREC text (what?)
                    Map from d(K:id to text location (also gives title for each doc)
                    a. total ainount of storage (megabytes) 24 MI)ytes.
                    b. total computer time to build (approx~ate number of hours)
                             Time to create included in inverted file creation al)ove.
                    c. is the pr~~ess completely automatic? Automatic

                other data structures built from TREC text (what?)


                                          449

                     Map from internal concept to token string
                     a. total (linount of stora~'e (me&'abytes) 13 M1)ytes
                     b. total CompUter time to build (approximate number of hours)
                           Time to create included in inverted tile creation f~~r Dl
                     C. is the pr(~ess completely automatic? Automatic

      C. Data built from sources other thaji the iuput text
             None, other than st()pword tile.


II. Query construction
             (please fill out a sectioll l;()r cich query construction method used)

      D. Automatic~'tlly built queries (routing)
             1. topic fields used Topic, Nationality, Narrative, ConcepL~, Factors, Description
             2. total computer time to build query (cpu seconds) 3(tti
             3. which of the ftillowing were used in building the query?
                     a. terms selected from
                           (1) topic
                           (3) only documents with relevance judgments
                     b. term weighting
                           (1) with weights based on terms in topics
                           (2) with weights kised oil terms in all training documents
                           (3) with weights based on terms from documents with relevance judgments
                     1. expansion of quelies using previ()usly-constructed data structure (from part I)
                           (1) which structure? 3(~ hest terms from relevant docs


III. Searching

      A. Total computer time to se~irch (cpu seconds) 293 seconds (includes retrieval + ranking).
             1. reliieval time (total cpu scc()nds between when a query enters the system until a list of
                     document numbers uc obtained)
             2. ranking time (t()tŁd cpu seconds to sort d()(:ument list)

      B. Which methods best describe your machine searching methods'!
             1. vector space m(KIel

      C. What factors are included ill ~()w. railking'?
             1. term frequency
             2. inverse d(icument frequency
             9. document length

IV. What machine did you conduct the TRLC experiment on'? Sun SPARC 2
             How much RAM did it have'! 64 MB
             Wh~it was the clock rate of the CPU? 4(J Mhz

V. Some Systems are rese(~ch prototypes and others are commercial.
             To help compare these systems:

             1. How much ~software enL'ineering~' went into the development of your system?
                     AI)out 3 person-years for the SMART system itself
             2.   Given appropriate resources, could your system be made to run f~ster?  By how much
                     (estimate)? Of course!


                                            450

3. ~hat feature~ j~ ~()U~ .~y~teIn mi~~ing thai it would beiiefit by if ii had fliem?


                            451

System Summary and Timing

University of California, Berkeley


General Cominents

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as getting total time fi)r d(~ument indexing of huge
text sections, or manu(dly building a knowledge base. Please do your best.

I.      Construction of indices, knowledge bases, ~md other data structures (please describe all
        your system needs for seŁ~chin~)

data structures that

       A. Which of the f~)ll()wing were used to build your data structures?
              1. stopword list Yes, augmented SMART stoplist
                     a. how many words in list? AI)()ut 6(~0
              2. is a controlled v(~abul~~y used? no
              3. steI~ing
                     a. st~~d~'trd stelnining (~g()n.tl1ms yes
                            which ones? SMART system (Version 10) stemmer
                     b. morphological (uldysis flOlle
              4. tenn weighting
                     yes. Weights deterniin~d fi~()m various frequency statistics by logistic regression
              5. phrase discovery none
              6. syntactic p~~sin~ none
              7. word sense dis~~nbiguati()n flofle
              8. heuristic associations none
              9. spelling checking (with inahual colTection) none
              10. spelling correction none
              11. proper noun identification Łdgon.thm n(~ne
              12. tokenizer (recognizes dates, phone numbers, coininon pattenis) none
              13. are the m~~1ually-indexed terms used? no
              14. other techniques used to build d~ta structures (brief description)

       B. Statistics ()~ d~ta structures built fiom TREC text (please fill out each applicable section)
              1. inverted index
                     a. total ~unount of storŁige (megabytes)
                            Ranges from 7E~ to 18(~ ml) for each of the five collections
                     b. total computer time to build (approximate number of hours)
                            Ranges from 6 to 14 hours t')r each of the five collections
                     c. is the pr(~ess completely automatic? yes
                     d. are term positions within documents stored? no
                     C. single terms only? yes

       C. Data built from s~)urces other thui the input iex~ --no


II. Query construction
              (please fill out a section ft)r each query construction method used)

       A. Automatically built queries (ad hoc)


                                          452

               1. topic fields used all
               2. toLtl computer time to build query (cpu seconds) around 3 seconds total per (luery
               3. which of the f~)llowin(~' were used?
                      a. term weightiug with weights b~~';ed on te~s in topics
                      j. other (describe)
                            Al)5()Iute and relative frequency of each stem in query were used to weight
                            the stems, using a f~)rmula ol)tained l)y logistic regression from the WSJ
                            relevance data.


III. Searching

       A. Total computer tilne to 5CL'IŁCh (cpu seconds)
               1. retrieval t~e (total cpu seconds between when a query enters the system until a list of
                      document numbers ~`ire obtained)
               2. ranking time (tot~Ll cpu seconds to sort d~~ument list)

       B. Which methods best describe your machine seŁu-chi'ig methods?
               2. probabilistic model
                      Yes, prol)al)ilistic searching l)ased on linked dependence assumption and two stages
                      of logistic regression as descril)ed in Proceedings ACM/SIGIR Copenhagen June
                      1992.

       C. VVhat factors are included in your rL'U~king?
               1. telin frequency
               2. inverse d()(2ument frequency
               3. other terin weights (where do they come from'?) see 15. l)elow
               5. position in document stem occurrence frequencies in titles were douhled in some collections.
               9. docwnent length
               15. other (specify)
                      variahles used were:
                            al)s()lute and relative frequency of stem in query
                            al)solute and relative frequency of stem in document
                            inverse document frequency of stem in collection
                            glohal relative frequency of stem in all document texts
                            document length measured in stem-occurrences.

IV. What machine did you conduct the TREC experilnent on'?
               Three ditTerent machines:
                      1. DECStation 5(K)()1125 with 16 Megal)ytes RAM for most work.
                      2. DECStation 5(~X~125 with 64 Megal)ytes RAM for a little.
                      3. IBM Model 3O9(~ for the logistic regression analysis.
               How much RAM did it have?
               What was the clock rate of the CP~~?
                      25 MHz for the 16 Megal)yte DECStation. This was used for the timed retrieval
                      runs.
                      40 MHz f~)r the 64 Megabyte DECStation.

V. Some systems are research prototypes and others ~`u-e commercial.
               To help compare these systems:

               1. How much "software engineering't went into the development of your system?
                      None except for the novel two-stage probabilistic logic. The Berkeley system is an
                      experimental prototype only, programmed as a minimal modification of the SMART


                                           453

       system.
2.  Given appropliate re%.~ource.%.', could your system be made to run f~ter? By how much
       (estimate)?
       Yes, see discussion in SMART's documentation: SMART is "not strongly optimized
       f(~r any ofle particular use." The Berkeley system has roughly the same efficiency
       charactenstics as SMART.
3. What features is your system missing that it would benefit by if it had them?
       Would pr(~l)al)ly l)eflefit from a conflator, a thesaurus, a disaml)iguator, phrase
       discovery, stem proximity detection, etc. The Berkeley system is a l)are-bones design,
       intended only to explore the workal)ility of staged logistic regression.


                          454

System Summary and Timing

Universitaet Dortmund
Single term automatic ad hoc run (Fuhi. 1ea~ing)

General CoininenLs

The timin~,'s should be the time to replicate runs from scr~itch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be dimcult, such as getting total time f~)r d(~ument indexing of huge
text sections, or mailually building a kliowledge base. Please do your best.


I.      Construction of iiidices, knowledge b~~es, and other data structures (please describe all data structures that
        your system needs for se~'irchin~)

        A. Which of the following were used to build your data structures?
               1. stopword list
                      a. how many words iii lisi? 57(~
               2. is a controlled v(~abuI~Lry used? no
               3. ste1111nifl~,' yes
                      a. st.'uid~ird ste[nlnint? aigon. thins
                            which ones? SMART
                      b. morphological aiialysis
               4. tenn weighting
                      In docs, linear c()nll)inati()n of several factors
                      In ~iueries, tf * i(1t; COsIfle nornlalizati(~n (ntc)
               5. phrase discovery no
               6. syntactic PlrS~Il~r1 flO
               7. word sense dis~~nbiguation no
               8. heuristic ass&~iatk)ns no
               9. spelling checking (with manual correction) no
               10. spelling correction no
               11. proper noun identification algorithm no
               12. tokenizer (recognizes dates, phone numbers, coininon patterns) no
               13. are the maiiually-indexed terms used? no
               14. other techniques used to build data structures (brief description)
                      Coefficients for linear coml)inations used in weighting were determined automatically
                      using QI,Dl4udgrnenis of QI (~n Dl. This to~~ 1.7 hours (not including 2.6 hours
                      to index Q1,DI).

        B. Statistics on data structures built. from TREC text (please fill out each applicable section)
               1. inverted index
                      a. to~Ll ~unount of stor~ige (me~~Lbytes) 69(~
                      b. total computer time to build (approximate number of hours)
                            4.7 hours to create doc vectors from text
                            1.7 hours to reweight doc vectors and pr(KIuce inverted flle
                      c. is the pr(~ess completely ~`~utomatic? yes
                      d. ~ term position5 within d(~uments stored? no
                      e. single terms only? yes
               5. other data structures built from TREC text (what?)
                      Map from d(~id to text location (als(~ gives title for each doc)
                      a. total ainount of stor~ige (megabytes) 68 MI)ytes.
                      b. total computer time to build (approximate number of hours)


                                           455

                              Tinie to creite included in inverted file creation al)()Ve.
                      C. is the pr~~ess completely automatic'? yes

                 other data structures built from TREC text (what'?)
                      Map froni internal concept to token string
                      ~`i. total ~`un()uIlt of stor~'ige (ineg~'ibytes) 18 MI)ytes
                      b. total computer tilue to build (approx~ate number of hours)
                              TIme t(~ create Included in inverted file creation ahove.
                      C. is the pRXess completely Ł`~utomatic? yes

        C. Data built from sources other th~'ui the input text
               None, other than .`;t()pw()rd file.

II. Query construction
               (please fill out a section for eŁ~h query construction method used)

        A. Automatic(~ly built queries (ad hoc)
               1. topic fields used T()pk, Nationality, Narrative, Concepts, Factors, Description
               2. total computer tilne to build query (cpu seconds) 1.5 seconds
               3. which of the f()llowiIl(T were used'?
                      a. term weighting with weights b~'~sed on tenus in topics (idf)

III. Searching

        A. Total computer tilne to seardi (cpu seconds) 383 seconds (includes retrieval + ranking).
               1. retrieval tilne (to~1 cpu seconds between when a query enters the system until a list of
                      document numbers are obtained)
               2. ranking tune (tot~d cpu seconds to sort d(~ument list)

        B. Which methods best describe your machine se~'irching methods'?
               1. vector spŁ'lce m(idel
               2. probabilistic model

        C. What factors are included ill your ranking?
               1. tenn frequency
               2. inverse d(icument frequency
               8. infonnation theoretic weights
               9. docuinent length

IV. What machine did you conduct the TREC experilnent on'? Sun SPARC 2
               How much RAM did it have? 64 MB
               What was the clock rate of the CPU? 40 Mhz

V.  Some Systems are researdi prototypes and otliers are commercial.
               To help compare these systems:

               1. How much "software engineerinLY" went into the development of your system?
                      Aliout 3 person-years f~)r the SMART system itself
                      2 person-weeks ft)r the Fulir weighing code
               2. Given appropriate resources, could your system be made to run fŁ'ister?        By how much
                      (estimate)'?
                      Of course!
                      A 6 machine distril)uted version of SMART should he faster hy a factor of 3 for
                      hoth indexing and retrieval.


                                             456

3. What feature~ i.~ your ~ysIe1n Ifl~~S~fl(tT that it would beuefit by if it had them?
      DI.%tflI)uted version has not fully I)een implemented yet.


                         457

System Summary and Timing

Universitaet Dortmtind
Phrase automatic ad hoc (Fuhr leariling)

General Cominent.~

The fiming.~ .~hould be the time to replicate run~ from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be ditTicult, such as getting total time ft)r d()(:ument indexing of huge
text sections, or mailually building a ~owledge base. Please do your best.


I.      Construction of indices, kfl()wlCdLTe bases, and other data structures (please describe all data structures that
        your system needs fi)r seŁircliin~)

        A. Which of the following were used to build your data structures?
               1. stopword list
                      a. how many words in list? 570
               2. is a coiltrolled v(~abul~iry used?
                      Not f~)r single tern's.
                      A phrase list was aut(~rnatically c(~nstructed from phrases occurring 25 times or more
                      In the first doc set (Dl). ouly those phrases were used.
               3. ste1~ing yes
                      a. st~d~ud stelnining algorithms
                             which ones? SMART
                      b. morphological alialysis
               4 tenn weighting
                      In docs, linear c()ml)inati()n of several factors
                      In (lueries, tf * idf, c(jsine normalization (ntc)
               5. phrase discovery
                      a. what kind of phrase?
                             Adjacent n(Jn-st()pwords, comp()nenLs stemmed, that occurred at least 25
                             times In the Dl document set.
                      b. using statistical meth(ids
                      c. using syiltactic methods
               6. syiltactic p(~sing no
               7. word sense disambiguation no
               8. heuristic associations no
               9. spelling checking (with manual correction) no
               10. spelling correction ~
               11. proper Iloull identification algorithm no
               12. tokenizer (reco(2nizes dŁttes, phone numbers, cominon patterns) no
               13. ~ire the m~~ually-indexed terms used? no
               14. other techniques used to build data structures (brief description)
                      Coefficients f~)r linear c(~mhinati()Ils used in weighting were determined automatically
                      using QI,DI,judgments of QI (~n DI. This took 2.4 hours (not including 5.6 hours
                      to index QI,Dl).

        B. Statistics on data structures built from TREC text (please fill out each applicable section)
               I. inverted index
                      a. total amomit of storage (megabytes) 840
                      b. total computer time to build (approximate number of hours)
                             9.7 hours to create doc vectors from text


                                           458

                                 2.9 hours to reweiglit doc vectors and pr(KIuce inverted tile
                     C. is the pr(~ess coiiipletely aut()InŁ'Itic? yes
                     d. ~`ire term positions wi~in d(X'ulnellts stored? no
                     e. single terins only? Ilo
              5. other data structures built from TREC text (what?)
                     Map from d(~id t(~ text location (also gives title f(~r each dE)c)
                     a. total ainoulit of storuge (niegabytes) 68 Ml)ytes.
                     b. total computer tilne to build (approxu nate number of hours)
                                 Time t(~ create included in inverted tile creation al)()ve.
                     C. is the pr(xess completely aut()JnŁitic? yes

                 other data structures built from TREC text (what?)
                     Map from internal concept to t~~ken string
                     a. total ~unount of storŁige (megabytes) 25 Ml)ytes
                     b. total computer tilne to build (approxiznate number of hours)
                                 Time to create included in inverted tile creation ahove.
                     C. is the pr(~ess completely automatic? yes

                 other data structures built from TREC text (what?)
                     Phrase dictionary (controlled v(~al)ulary)
                     Phrases were adjacent n()n-stopw()rds, components stemmed, that occurred at least
                     25 times in the Dl document set.
                     ~i. total unount of stor~ge (me~abytes) 14 Ml)ytes to store dictionary.
                     b. total computer tillie to build (approx~~'ite number of hours)
                                 It took 5.8 hours to index Dl, finding ~ phrases and their collection
                                 stats. Ot those phrases l58,(~()() (~curred at least 25 times.
                     C. is the ~r(icC55 completely automatic?

       C. Data built from source5 other thul the input text
              None, (~ther than st()pw()rd tile.

II. Query construction
              (please fill out a section for each query construction method used)

       A. Autx)lnatically built queries (ad hoc)
              1. topic fields used Topic, Nationality, Narrative, Concepts, Factors, Description
              2. total computer tilne to build query (cpu seconds) 2.7 seconds
              3. which of the f~)llowing were used?
                     a. term weighting with weights b~~~ed on terms in topics (idf)
                     b. phrase extraction from topics yes, using controlled list of phra~es

III. Searching

       A. Tot~~ computer tilne to search (cpu seconds)
              374 seconds (includes retrieval + ranking).
              1. retrieval tilne (total cpu seconds between when a query enters the system until a list of
                     document numbers al-c obtained)
              2. railking time (total cpu seconds to sort d('cument list)

       B. Which methods best describe y~~ur machine searching methods?
              1. vector space m(XIel
              2. probabilistic model

       C. What factors cLrC included in your ranking?


                                                   459

              1. tenn frequency
              2. inverse d(~urnent frequency
              7. proxilnity of terins (for phra.%'e'%*.)
              8. infoi~ation theoretic weiglit.~
              9. document lengili

IV. What machine did you conduct tlie TREC experilnent on'? Sun SPARC 2
              How much RAM did it have? 64 MB
              VVhat was the clock rate of die CPU? 4(~ MHz

V. Some systems Ł`tre reseŁ'~ch prototypes and others Ł`irC coin'nerciŁ'~.
              To help coinp~'ire diese systems:

              1. `low much "soflw~ire en~ineering" went into the development of your system.?
                     AI)oUt 3 person-years for the SMART system itself
                     2 person-weeks for the Fulir weighing code
              2.  Given appropriate resources, could your system be made to run faster?  By how much
                     (esti'nŁ'tte)?
                     Of course!
                     Can speed up phrase indexing l)y 4~~% l)y algorithm change (speed up has l)een done
                     for single terms, l)ut not for phrases)
              3. What features is your system missing that it would benefit by if it had them?


                                           460

System Summary and Timing

Universitaet Dortmund
Automatic routing (RPI feedback)

General Coininents

The timings should be the time to replicate runs from scr'~tch, not including trial runs, etc. The tilnes should also
be re~~onably accurate. This sometilnes will be difficult, such as gettilig total time ft)r d&~ument jiidexilig of huge
text sections, or m~uiually buildilig a kiiowledge base. Please do your best.


I.      Construction of indices, knowledge bases, and other dattt structures (please describe all data structures that
        your system needs t;()r searching)

        A. Which of the ft)llowin(T were used 10 build your data structures?
               1. st()pword list
                       a. how many words in list? 57(~
               2. is a controlled v(~abul~u~y used? no
               3. stelnilling yes
                       a. stŁ~idard stemming ~`tlg()n thins
                              which ones?     S~~ART
                       b. m()1~h()l()gicŁ'1l ~~alysis
               4. 1dm weighting
                       In docs + queries, tt. * idt; cosine normalization (ntc) (in docs idf is l)ased on
                       collection frequency within doc set Dl only)
               5. phrase discovery no
               6. syntactic parsing no
               7. word sense dis~~nbiguation  n(~
               8. heuristic associations no
               9. spelling checking (with manual correction)       no
               10. spelling correction  no
               11. proper noun identification algorithm n(i
               12. tokenizer (rec()L'nizes d~tes, phone numbers, CoifliflOli patterils) no
               13. are the mŁ'uiuŁ.illy-indexed terins used? Ilo
               14. other techniques used to build ~ta structures (bnef description) no

        B. S~itistics on data structwes built from Tl~EC text (please fill out each applicable section)
               1. inverted index
                       a. total ~`uli()unt of storŁ'ige (ineg~ibytes) 275
                       b. totil computer tilne to build (approxilnate number of hours)
                              1.9 hours (not including tllue to index Dl to o')tain collection frequency info)
                       c. is the pr(x:ess completely ~`Lut()ln'Ltic? yes
                       d. (`ne term positions wi~in (1(iculnents stored?   Ilo
                       e. single terms only?  yes
               5. other dali structures built from Tl~EC text (wh~'it?)
                       Map from dodd to text location (also gives title for each doc)
                       ~`i. total Ł`ui'ount of st()rŁ'ige (megabytes) 24 M')ytes.
                       b. t()~l computer tilfle to build (approxilnate number of hours)
                              Tinie t(~ create included in inverted tile creation ahove.
                       c. is the pr~'cess completely (`tutomatic?    yes

                 other data structures built from TREC text (what?)


                                                  461

                      Map from int~rnal concept to token string
                      a. total aiflOulit of st()r~1~e (`ne~~'~bytes) 13 Ml)ytes
                      b. tot~'Il co'npu ter time to build (approxli~ate number of hours)
                            TIme to create included in inverted tile creation of Dl.
                      C. is the pr(~ess completely automatic? yes

        C. Data built from sources other th~~i ~e input text
               NE~Ile, other than st()pw()rd tile.

II. Query Construction
               (please fill out a section for each query construction method used)

        D. Aut()Inatic~illy built queries (routing)
               1. topic fields used all
               2. toL~ computer tilne to build query (cpu seconds)
                      1300 seconds, not including time to Index Dl (3.0 hours)
               3. which of the fi)llowiIlg were used in building the query?
                      a. terms selected from
                            (1) topic
                      b. tefln weighting
                            (1) with weights based on terms in topics
                            (2) with weights b~ised on terins in all training documents
                            (3) with weights ~sed on terms from documents with relevance judgments

III. Searching

        A. TotŁ~ computer tilne to se~irch (cpu seconds) 312 seconds (includes retrieval + ranking).
               1. retrievLil tilne (t()~l cpu seconds between when a query enters the system until a list of
                      document numbers aic ()bL~inCd)
               2. ranking time (tot~'il cpu seconds to sort d(~uInenI list)

        B. Which methods best describe your in~~chii'e se~irching methods?
               1. vector space m(xlel
               2. probabilistic model

        C. What fac(()rs ~Lre included ill ~()w. raiiking?
               1. terin frequency
               2. inverse d(~uInent frequency
               8. infonnation theoretic weights
               9. document length

IV. What machine did you conduct the TREC experilnent oil'? Sun SPARC 2
               How much RAM did it have? 64 MB
               What wŁL~ the clock rate of the CPU? 40 MHz

V.  Some systems are research prolotypes ~ind others Łue c()mmerci~d.
               To help compare these systems:

               1. how much "software engineering" went into tl)e development of your system?
                      AI)out 3 person-years f~)r the SMART system itself
               2. ("iv en appropriate resources, could your system be made to run f~'tster?    By how much
                      (Cstim~ite)?
                      Of course!
                      Due to algorithm flaw, CPU time f~)r constructing routing ~iuery is al)out a factor of


                                           462

      5 t()( much (Algoritlilil tound I)e%'t terfll% to expand I)y even though we had re(1uested
      expan%~i()n I)y () terni).
3. What feature.~ i~ Y()U~ ~y~tein mi~~iflg that it would benefit by if it had them?


                           463

System Summary and Timing

University of Illinois at Chicago

General Coininents

The timin~~s should be the tilne to replicate fulls from scratch, not including trial fulls, etc. The tilnes should also
be reasonably accurate. This sometilnes will be difficult, such as getting total time for d&~ument indexing of huge
text sections, or maiiually building a kuowledge base. Please do yoUr best.


      Construction of indices, knowledge b~'i~es. and other data structures (please describe all data structures that
      your system needs for se~'irchi'ig)
             Each document is represented as a set of word pairs. Pairs were formed from all adjacent
             words, plus all words separated Ily ()flC and two intermediate words. Documents were the
             unit of ()rgani7~ti(~n f(~r the data structure. If a pair occurred only once in a document it was
             dropped from the data structure for that document ~~nly.

             A sample record is as f(JII()ws:

             MULTIMEDIA                 ENCYCLOPEDIA                2 WSj88081S-E~14

             The numl)er of times the pair occurred in tile document appears in the third field, just l)efore
             the document id.

      A. Which of the following were used to build your data structures'?
             1. stopword list
                    The stopword list fr(~m SMART versi(~n 10 was used. Some additional stop words
                    from TREC markup codes were used.
                    a. how many words in list'? The total size of the stoplist was 631 words.
             2. is a controlled vocabulŁ'iry used'? none
             3. stelnining none
                    a. st~'rnd~'~d stemming algorithms
                          which ones'!
                          Some small stemming experimenLs were later perf(wmed using the code from
                          SMART versi~~n 10 and three training (lueries. For (~uery 002 stemming
                          had n(~ effect, while t')r (~uery (~6 it resulted in a 43% increase in recall,
                          and f(~r (~uery (~9 a 73% impr(~vement in recall.
                    b. InoIi,li()loI~ic~'~l ~`ui~'ilysis Ilolle
             4. tenn weighting
                    None. Weighting w~~' planned Ilut could not l)e implemented given limitations that
                    arose.
             5. phrase discovery
                    a. what kind of phr~'tse?
                          Word pairs occurring within three word positions Of one another.
                    b. usin(g st~'~istical ineth(Kls All such pairs were identified.
                    c. usin~ sylitaclic methods
             6. syntactic p~'u'sint,' none
             7. word sense dis~nbiguation fl()flC
             8. heuristic aNsociations
                    a. short definition of these Ł`L';s()ci~'~ti()IIs Only the l)asic pairing ass(iciatioIL~ used.
             9. spellinL' checkinLY (with manuŁ'il correction) none
             10. spellin~ correction none


                                                        464

                11. proper noun identific~~ti()n (dg()n~rn  n~~ne
                12. tokenizer (rec()~nize.~ d~te~, phone nuiuber.~, coifliflon patteni.';) Il()~C
                13. aic the rn~~u~~ly-iiidexed tenn~ used?   none
                14. other techniques used to build ~ita structwes (brief description)      flofle

         B. Statistics on d~~ita structures built from Tl~EC text (please fill out each applicable secti()n)
                1. inverted index Based only on pairs, not individual ternis.
                       a. total ainount of storage (megabytes)    819 ~egahytes
                       b. total computer tilne to build (approxu nate number of hours)        1(11) hours
                       C. is the pr('cess completely automatic?   yes
                       d. Łire terin positions witlim    d(icuments stored? no
                       e. single terms only? flolie
                2. n-grains, suffix (~ays, signature files  See BI.

         C. Data built from sonices other th~ ~e input text    --no

II.  Query construction
                (please fill out Ł,i section flir each query c()nsti~ucti()n method used)

         A. Automatically built queries (ad hoc)
                1. topic fields used Title, Description, Narrative, and Concepts (only tirst two.)
                2. to~l computer time to build query (cpu seconds)   (1.26 SeCond.
                3. which of the following were used?      fiolle

         D. Automatically built queries (r()utin(~)
                1. topic fields used Title, Description, Narrative, Concepts (first two).
                2. total computer tune to build query (cpu seconds)  55 SeConds
                3. which of thc following were used in building ~e query?
                       c. phrase extrLlcti()n
                              (2) from ~~Lll trunilig documents
                                      Word pairs occurring in the relevant training documents for the
                                      query 1)ut not in the irrelevant documents were used.

III.  Searching

         A. Total computer tilne to se(u~di (cpu seconds)
                1. retiiev~tl tilne (total cpu seconds between when a query enters tlie system until a list of
                       document numbers are ()bt~1ined)
                       This was not optimized f~)r the current experiments. Run time was approximately
                       2(1 minutes per search. 1~r()per ()ptiniizati(~n will reduce this tinie.
                2. r~~mkin~ time (total cpu seconds to sort d('cument list) .22 seconds

         B. Which metliods best describe your michine searching metliods?
                4. n-grŁ~ matching

         C. What f~~ct()rs aic included in y~~ur r~~tnking?
                11. li-grun fiequency

IV.   What mŁ~hine did you conduct ~e TRFŁC experiment on?       IB~1 3(19()/3()(lj
                How much RAM did it h(tve?          16 Meg f(~r a virtual machine.
                Wh~~it w~Ls the clock rate (if ~e Cl~~ 1? 14.5 nanoseconds, or 69 MH7.

V.    Some systems aic research prototypes (md others ~ue c()InmerciŁ(1.
                To help c(impŁue ~ese systems:


                                                       465

1. How much ".`;oflw~ire ellgilleerillg" went into the development of your sy.~tem?
       4(J h(~urs (~f new developmellt, 1)eyond using word pairing tools that were developed
       earlier over a peri(~ of years.
2.  Given appropn'aie reN()urce.~, could your sy.~Iein be made to run f~~ter? By how much
       (e.';Iiinale)?
       Yes, search time could l)e reduced, hut a reliahle estimate of how much cannot he
       made at this time.
3. What features is your system missing (hal it would benefil by it' it had them?
              Phrase weighting
              Term weighting and auxiliary single term search
              Stemming
              Removal of pair order etTects
              Shortest path network search


                           466

System Summary and Timing

Belicore

General CoinmexiLs

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The t~nes should also
be reasonably accurate. This soinetilnes will be difficult, such `~ gettin~ total time ft)r document indexing of huge
text sections, or manually buildin(2 a k'iowledge base. Please do your best.

I.       Construction of indices, knowledge b~~es, and other data structures (pleŁ~~e describe all data
         your system needs for se~irching)

structures that

       A. Which of the following were used to build your data structures?
              1. stopword list yes (though SoniC experiments without stoplist)
                     a. how many words in list? n=439; standard SMART list, I think
              2. is a controlled v(~abul~uy used? no
              3. steinining fl()flC (except truncation at 20 character.'~wd)
              4. tenn weighting yes, l()g(tt)*(1~entr()py)
              5. phrase discovery no
              6. syntactic p~'irsing no
              7. word sense disainbiguation no
              8. heuristic Ł~ssociations no
              9. spelling checking (with manual correction) ~()
              10. spelling correction
                     no (not directly, l)ut the LSI analyses does some of this fi)r free
              11. proper noun identification ~dg()rithIn Ilo
              12. tokenizer (recognizes dates, phone numbers, coininon pattenis)
              13. are the manually-indexed terms used? no
              14. other techniques used to build ~ta structures (brief description)
                     LSIISVD analysis of term~l)y-d()cument matrix.    Takes raw term-hy-doc matrix;
                     transforms  entries   using      log  entr(~py term weightings; calculates hest
                     "reduced-dimensi()nal" approximation to transformed matrix using SVD. Numl)er
                     of dimensions 250-350.       Does all (1uery-doc matching in this reduced-dimension
                     vector space.

       B. Statistics on data structures built from TREC text (please fill out each applicable section)
              5. other data structures built from ~~REC text (what?)
                     LSIISVD uses reduced-dimensi()nal vectors (see l)elow fi)r description of how they are
                     derived). The numl)er of dims was I)etween 235 and 250. There is one such vector
                     for each term and fi)r each d(K:ument. Queries are also represented as vectors and
                     compared to every document.
                     a. total ainount of st()rŁige (Ine(2~.Ibytes)
                            All reduced dimensional vectors are stored in a hinary datahase. Datahase
                            c(~sists ~ a vector fi)r every doc and every term occurring in more than
                            one doc. The vectors currently consist (~f single precision real values. For
                            TREC, we huilt (jne datahase fi)r each collection.
                            Approx. 50000 docs are sampled. Terms that occur in more than one of
                            these documents are used in the SVD analysis.    The remaining docs are
                            added to the database.

                            DOEI    - docs: 226(~7, terms: 42221, ndim: 250-> 262 meg dl)


                                                 467

wSjI   - docs:  99111, terms:       ndim: 250
API    - docs: 8493(~, terms: 78167, ndim: 25(~
ZWFI   - docs: 7518(), terms: 6()565, ndim: 250
FRI    - docs: 26207, terms: 54713, ndim: 25()
W5j2   - docs:  7452~), terms:      ndim: 235
AP2    - docs: 79923, terms: 82997, ndim: 235
ZIFF2  - docs:  5692(~, terms: 72197, ndim: 235
FR2    - docs:        terms: 48728, ndim: 235

-> 169 meg dl)
-> 163 meg dl)
-> 135 meg dl)
-> 80 meg dl)
-> 141 meg dl)
-> 153 meg dl)
-> 121 meg dl)
-> 64 meg dl)

      Used 25() dims fi)r routing and 235 dims for ad hoc (~uerjes
      In general, database size will be: (ndocs+nterms)*ndim*4
      The totals here are 1288 meg (750000 docs and 585(~00 terms).
      If a single database had been used, the total would have been smaller
      becauSe of term overlap--currently, many of the terms are represented in
      more than oliC datal)ase; there are only 2000(M) Uni(Jue terms.

b. t()t~l Computer tilne to build (~~ppr()xilnate iiuinber of hours)
      Four main stages:
      I. indexing (extracting keys; calculating wts; etc.)
      2. SVD (number ~)f d1111C1151()fl5 extracted ranged from 235-310)
      NOTE 1: only 235-25() dims were actually used fi)r retrieval. I don't have
             timing data for extracting only this smaller numl)er of dimensions,
             but I'd estimate that the numbers t~)r APi, ZIFFi and FRi could
             l)C reduced by about 20%.
      NOTE 2: initial indexing and SVD are typically done on a subset of 50()00
             docs and uterms
      3. various i/o translations (much (~f this will g(j away soon)
      4. adding new docs to dl)a5e (if sul)-sampled for SVD).
      SVI) done oil 5(~(H)() docs; the remaining docs are indexed and added to the
      datal)ase after the SVl).
      all times in ~1INUTES (SVD
      DOEI   - index: 49 SVD: 1219 io:
      wSj1   - index: 241 SVD: 1474 i():
      APi    - index: 271 SVD: 1644 i():
      ZIFFi  - index: 241 SVD: 1359 i():
      FRi    - index: 241 SVD: 939 io:
      WSJ2   - index: 427 SVD: 1382 io:
      AP2    - index: 338 SVD: 1210 i():
      ZIFF2  - index: 260 SVD: 1452 i():
      FR2    - index: 187 SVD: 486 io:

                              run on DECS()()(); rest on SPARC2)
                                   194 add: 591 SUM: 2053 mins
                                   174 add: 4()4 SUM: 2293 mins
                                   214 add: 455 SUM: 2584 mins
                                   156 add: 352 SUM: 2108 mins
                                   133 add:   0 SUM: 1313 mins
                                   22(J add: 461 SUM: 2490 mins
                                   218 add: 273 SUM: 2(~9 mins
                                   2()8 add:  0 SUM: 1920 mins
                                   loS add:   0 SUM: 778 mins

C. i.~ the pr(~es.~ Completely ~ut()Jn~'ttiC? YES
d. brief deNcription of Ineth()d.~ u.~ed
      LSI/SVD analysis (~f document collection
      1. creates raw term-l)y.d()c matrix; transf~)rms entries using log entropy
      term weightings
      2. calculates beSt "red uced-dimensional" approximation to transformed
      matrix using SV1). Number of dimensions in the SVD calculations ranged
      fi~()m 235 to 3I(~.  BUT, only 235 ()~ 250 were used f(Jr the comparisons.
      Fewer dims could have been calculated, So Some reported SVD times are
      higher than necessary. I'd estimate about 2()% reductions in SVD times for
      API, ZIFFI, and FRI.
      3. perf~)rm various datal)ase translations. Current SVD program outputs
      vectors in a different f(~rmat and order than we need for the database. It


                         468

                            Will eventually output vectors in the appropriate datahase format, and this
                            entire step can l)e omitted.
                            4. SVD calculations usually run on -5(),()(M) docs x nterm%' matrices. The
                            remaining docs (if any) were indexed and added to the datal)ase here.

       C. Data built from sources other th~ui tlie input text --no

II. Query c()i'structioil
              (please till out a sectioll f()r each query consti-uction method used)

       A. Automatic~illy built quefles (ad hoc)    yes
              5u1)Initted two sets of ad hoc (1ueries; (1ueries were the same in hoth c~%'es; only difference
              was how information from diflerent sul)-c()llecti()ns was coml)ined
              1. topic tields used
                      all (except NO manually indexed terms used)
              2. to~l computer tilne to build query (cpu seconds)
                      Queries are vect(~r sums (~f constituent term vectors
                      Separate query vector created fi~r matching against each of 9 datal)ases (DOE,
                      WSJI, API, FRi, ZIFFI, WSJ2, AP2, FR2, ZIFF2)
                      Time = .4 secI(1ueryldatal)ase -> 3.6 secs/(~uery
                      NOTE: These times simulate handling each query separately (so      there is no ilo
                      l)utfering). There are l)ig improvements if you initially read in all the term vectors
                      and create all the ad hoc queries at once.
              3. which of the following were used?
                      a. term weighting wi~ weights b~~ed on teims in topics
                            term weighting, but weights based on term usage in document c(~llections
                      Ii. expalision of quenes usin(~ previ()usly-constructed dr~ta structure (from pŁirt I)
                            not really

       D. Automatic~-dly built queries (routin{',) yes
              submitted two sets of routing queries. Both were automatically created from
                      I) the text of the topics and
                      2) the relevant documents
              1. topic tields used
                      all (except NO manually indexed terms) for 1)0th 1) and 2)
              2. total computer tilne to build qucly (cpu seconds)
                      Queries are vector sum~. of constituent term vectors [case I)] ()~ document vectors
                      (case 2)].
                      Separate query vector created for matching against each of 4
                      (WSJ1, APi, FRI, ZIFFI)
                      Time = .4 seclqueryldatal)ase in case I) -> 1.6 secs/query
                      Time = .1 sec/query/database in case 2)-> ().4 secsI(~uery
                      NOTE: These times simulate handling each query separately
                      buffering).

                                                                          separate databases

              3. which of the ft)ll()win~ were used in buildin~ ~e query?
                     a. terms selected from
                             (1) topic case I)
                             (3) only documents with relev~~ice judgments case 2)
                     b. telin weighting
                             (2) with wei{',hts b-~sed oil terms in all training d()c~ents

                                                                          (so  there is no i/o

III. Searching


                                          469

      A. TotŁ~ coinpuler tune to seŁ'uch (cpu seconds)
              1. retrieval tilne (to~Il CPU seconds between When a query enters tlie system Until a list of
                     document numbers ~ire obtained)
                     Time = -5()(~)(~ query-doc c()mparisonsiminute when all vectors are pre-loaded.
                     Currently, we c~~mpare ALL d(~cs t(~ each query.
                     For ad hoc queries, the time to c~~mpare a query to the 75(~K~~ docs is -12 minutes
                     For r(~uting queries, the time to c~~mpare a query (new doc) to the profiles (50
                     profiles in each (~f 4 datal)ases) is al)out .3 Sec
              2. rankini2 time (totŁ'Ll cpu seconds to sort d(~Ument list)
                     none; it's included in the times given in 1. Currently 1)0th comparisons and ranking
                     are done in the same routine

      B. Which methods best describe ~()U~ `u~~chine se~'irching me~ods?
              1. vector sp~'~e m(KIcl

      C. What factors Ł`ire included in y()w r~'uiking?
              Hum, not sure I get this. Similarity l)etWCeIl a query and a document is the cosine l)etween
              the query vector and the document vectE)r. This cosine determines the rank.
              Term weights are used to determine the location ~)t. the query vector. The query is located
              at the weighted vector sum (~f i~s constituent terms.
              1. tenn fiequency l()g(tf)*(1~efltr~~py) term weight; s(~ there's a tf part
              3. other tenn weights (where do they come from?)
                     log entropy; weight come fi~om training docs (diski) for routing queries, and from
                     both the training and test docs for ad hoc queries
              4. se'nŁuitic Closeness (as in semantic net distance)
                     sort of; if you think of term vector I()cati()ns as reflecting semantic ~ssociations. But
                     these locations are auto derived from the SVD analysis
              8. infonnation theoretic weights IE)g (tt) * (1-entropy)

IV. What machine did you condUct tlie TREC experiment on?
              How much RAM did it have?
              What was the clock rate of ~e CPU?
                     SVDs run on DEC5t)t)() wi --4(~(~ meg; clock is ??? MHz
                     all else run on Si~ARC 2 WI 384 meg; clock is 25 MHz (I think)

V. Some systems ~we rese~~~ prototypes md others LrC c()Inmerci~'tl.
              To help C()InP~C tliese systems:

              1. How much "softw(tre eIl'~illeenn~' went into the development of your system?
                     Real hard. The system was huilt as a research prototype to l(x)k at many different
                     issues. I'd say aboUt 1-2 person-years, l)ut this is much more than would have heen
                     required if specs had l)eell fixed at the beginning.
              2.  (~`iven appr()pfl~1te rcs()Urces. coUld yoUr system be made to run fŁ%';ter?  By how much
                     (estimŁ'~le)?
                     The existing tools were used pretty much as is for TREC, even though they were
                     devel(~ped t(~ work with much smaller databases.  Also, there are far more
                     parameters and options than we typically use.   Alm(~t no effort went into
                     re-engineering for large databases ()~ to more efficiently handle what we now use as
                     default parameters.
                     Time in query c(~nstructi()n and retrieval are spent:
                            1) seeking for vectors in a single large database of term and doc vectors.
                                   The database could easily be split.
                            2) many calculations (scalings (~t. various sorts) are done on the fly.     This
                                   could be eliminated if one knew that users wanted to retrieve only


                                          470

                     documents, f~~r example.     Currently l)()th terms and docs can he
                     retrieved with the same pr~~grams and scaling isn't done until we
                     see thit the user wants retrieved.
              3) all calculations are done in tl()ating point. Could he done with integers.
              4) each ad hoc (~uery was compared to EVERY d(~ument.      This can he
                     speeded up hy 5()~C document clustering algorithms that we have
                     looked at. This can also he speeded up tremendously hy using more
                     than one machine or hy using a parallel machine. All vectors are
                     independent, so it's trivial to split query processing.
       I'd guess that improvements (~f a factor of 2-5 could he (Jl)tained just hy tweaking
       items 1), 2) and 3).
       Parallel query matching is the way to go. For example, we got speed-ups of 5()-1(M)
       times using a MasPar for query storage and processing with no attempt to optimize.
       In terms of pre-processing and SVD analyses:
              I) ahout 1(~% (~f the time is spent in unnecessary `10 translation (hecause
                     we've patched together pre-existing t()()ls).  Much of this will
                     eventually g(i away.
              2) more than 5(J% of the time is spent in the SVD.  These alg()flthms get
                     hetter and faster all the time (the algorithm we n~iw use is ahout
                     I(X~ times       faster than what we used  initially). There are
                     speed-memory trade()ff%' in different SVD algorithms, so time can
                     pr()hal)ly he decreased hy a factor of 2 ()~ 3 hy using more memory.
                     Parallel alg(~rithms will help Some, hut pr()hahly only hy a factor or
                     2 ()~ 3.
       These are (~Ile-tinle costs f~)r relatively stahle domains. We've found that new items
       can he added to the existing solutions without redoing the scaling f~)r a while.
       Others ???
3. VVhat features is your system  inissin~ thŁ'it it would benefit by if it had them?
       Precision would prol)ahly he increased hy many of the standard things--phrases,
       proper noun identitication, tokenizer (f~~r dates, phone nuinhers, addresses, etc.), and
       some hetter handling of negation and union.
       S(~me form (~f literal string matching might he useful to use in *comhinati()n with LSI
       for some types of queries.
       Others ???


                                   471

System Summary and Timing

Queens College, CUNY

General Conunents

The fimings should be the tilne to replic~ite ruiis from scratch, not including trial runs, etc. The tilnes should also
he reasonably accur~'~te. This sometilnes will be difficult, such as getting total time for document indexing of huge
text sections, or `n~~ually building 1 kilowledge base.  Please do your best.

I.       Constructioll of indices, knowledge b~ises, ŁLnd other dat~'~ structures (please describe
         your system needs for se~'ucliin~)

all da~ structures that

       A. Which of the following were used to build your data structures?
               1. st()pw()rd list yes
                       a. how many words in list'? 595
               2. is a c()ntR)lled v(~abul~lry used'!   no
               3. stelninilltT
                       a. stŁuid~ird steimnilig algon thins yes
                                which ones'?   I~()rter's Algorithm
                       b. In()1~h()l()gical (`u1'Llysis 11(~
               4. telin weighting   yes
               5. phr~'ise discoveiy n~j
               6. syntactic p~u;sinL' no
               7. word sense dis'unbitjuati()n 11(~
               8. heuristic associations n(j
               9. spelling checki'i~ (with manual collection)       no
               10. spellitiLl colTection Ilo
               11. proper noun idCntifiC~tti()Il L'il~()ri thin no
               12. tokellizer (reco~flizes d~'LtCs, phone nujnbers, COiThflOfl patterus) no
               13. Ł`u'e the in~tiiually-ii'dexed tenns used'!   110
               14. other techniques used 10 build d~it~i structures (brief descuption)
                       A tal)le of 396 manually created 2-word phrases.           When these are identifled in
                       adjacent positions in documents or (~ueries, they are used as additional index terms.

       B. StŁ~tistics on d:ita structures built fiom TREC text (please fill out each applicable section)
               1. uiverted index
                       a. total ~~n()unt of storage (megabytes)       378
                       b. total computer tune to build (approxilnate number of hours)
                                95+11+2=11)8 fi~r 5(1(1MB. clock tilne
                       c. is the process completely automatic? Yes, if sutlicient disk. Not in this experiment.
                                if not, ~tppr()xiinL'Itely how many hours of manual labor? (1.5
                       d. ~`ue term positions within d('cuments stored'?
                                No, Ilut sentence yes. Call modify to capture word positions.
                       C. single terms only'?  Yes, except t~)r I.A.14.
               4. special routing structures (wh~t?)      See I.B.5
                       Network       node, edge       tiles.    Routing using network     node and  edge files is
                       straightforward.
                       Ł1. total Łun()unt of st()r:lge (ine~abytes)
                                Node tile: 4x7.5              Edge tile: 4x4
                                Netw(~rk segmented int(~ 4, Ilecause (~f insufficient ram.
                       b. t()L~I computer tune to build (appr()xilnL~te number of hours)


                                                        472

                     4(~+5+1+4x().2=46.8, starting from text tile.
              C. is the pr~}cess colupletely Ł~utornatic? yes if sufticient rain and disk space.
              (1. brief descriptioii of methods used
                     1. Process (old) collection A.
                     2. Process (lueries against collection A.
                     3. Process new collection B as if they were (lueries--to make use of collection
                     A statistics.
                     4. C()ml)ine (iuerles, (old) dictionary and collection B into network for
                     retrieval.
       5. other data structures buill from TREC text (what?)
              1. Suhd(~cun1ent file
              2. C(~ed tile
              3. D~~id checking file
              4. Termid checking tile
              5. Docnum tile
              6. Termnum (dictionary) file
              7. Direct tile
              8. Index to direct tile
              9. N('(le tile
              lo. Edge file
              a. totil ~un()uflt of st()r(~ge (me~aby(es)
                     1.481         2.324
                     3.7           4.4
                     5.11          6.6
                     7.372         8.19
                     9. 4x14       lo. 4x9
                     System was developed for experimental research, with tlexil)ility to generate
                     other data. Some of the tiles are not necessary for retrieval.
              b. total computer tillie to build (approxil nate number of hours)
                     1.  1.5
                     2,3,4,5,6. 95
                     7,8. 11
                     9,1(). 4x(~.25=1
              C. is the pr('cess completely aut()mŁ~tic?
                     Yes Ir sutTicient RAM and disk space. For this experiment, no.
                     if not, ~ipproxim~itely how many hours of m~uiual labor? 2
              d brief description of methods used
                     ra~v text -.> sul)d()cunlent tile
                     sul)d()cuIllent --> c(~ded tile, dodd file, termid tile, docnum
                     (dictionary) file.
                     Zipf-law prograni truncates dictionary via user assigned limits.
                     Coded, terninuni --> direct file with index
                     direct -> inverted file
                     direct, inverted --> node, edge tiles.

C. Data built from sources other th('w ~e input text
       1. inte~~illy-built auxiliL~y files
              a. domain independent ()~ d()mŁlin specific (if two
                     of questions for each file) phrase file
              b. type of file (thesaurus, knowledge b~';e, lexicon,
              C. total ainount of st()rŁige (meg~'iby(es) (~.E)E)5
              d. total number of concepts represented 396
              f. tOtL~ computer tilne to build (approxiluate number of hours)
                             (~ (this is a tile created via editor).

                                  473

                                                                       file, termnum

                                                      sepŁLrate files, please fill out one set

                                                      etc.) word pair

                     g. total m~u1u('d tilne to build (approximate number of hours) 16
                     h. use of m(~nu('1l l'!b()r
                             (4) other (describe) Search for WSJ terminology in 1il)rary and from topics.
              2. externally-built auxili('uy file ~()

II. Query construction
              (please fill out a section for each query construction method used)

       A. Automatic~tlly built queries (ad hoc)
              1. topic fields used <TITLE>, <DESC>, <NARR>, <CON>
              2. total computer tilne to build query (cpu seconds) 5 (average for each query).
              3. which of the following were used?
                     a. term weighting with weights b~~~ed or' terms in topics yes + others
                     h. Cxp(~1si()n of queries usin~ previously-constructed data structure (from part. I) yes
                             (1) which structure? word-pair phrase tile

       B. Manually constructed queries (ad 11(x)
              1. topic fields used <TITLE>, <DESC>, <NARR>, <CON>
              2. average t~e to build query (minutes) 3(XJ mjiiutes for 25 queries
              3. type of query builder
                     b. computer system expert
              4. tools used to build query
                     Łt. word frequency list sometimes
                     b. knowledge base browser (knowledge base described in p~ut I) no
                     c. other lexical tools (identify) 110
              5. which of the following ~vei-e used'?
                     a. term weighting
                     b. BOolean collilectors (AND, oR, N()T)
                     d. additk)n of terms not included in topic
                             (1) source of terms word-pair phrase tile

       C. Feedback (ad hoc)
              1. initial query built by method 1 or meth('d 2'? method 1
              2. type of person doing feedkick
                     b. system expert
              3. average tilne to do complete feedback
                     a. cpu tilne (total cpu seconds for all iterations)
                             12 per query per iteration--no expansion
                                 `I""            "  --with expansion

                     b. cl(xk time from initial construction of query to completion of final query (minutes)
                             6(~ per query to do relevance judgment
              4. average number of iterations I
                     a. average nwnber of d(x'ulnents exŁ~nined per iteration 1(1
              5. minimum number of iterations  I
              6. maximum number of iterations  I
              7. what determines the end of an iterition? deadline + lack of manpower
              8. feedback methods used
                     a. automatic term reweighting "loin relevant documents
                     b. automatic query exp~'wsi()n from relevŁ~it documents
                             (2) only top X terms added (what is X)
                                    Top 2E) most `activated' terms that have document frequency <
                                    2(~()() were used. Because many were already in query, ahout 12 on
                                    the average were new and added per query.


                                               474

                    C. other automatic methods
                           bnef descriptioll feedhack is l)ased on sul)-d()cuments

      D. AutomaticŁ'illy built queries (routing)
             1. topic fields used <TITLE>, <DESC>, <NARR>, <CON>
             2. total computer time to build query (cpu seconds) 5 (average f(~r each (luery)
             3. which of the following were used in building tlie query?
                    a. terms selected from
                           (1) topic
                    b. telin weighting
                           (1) with weights bised on terms in topics
                           (2) with weights based on terms in all training documents
                           (3) with weights based on terms from documents wid) relevance judgmenL';
                    1. eXpansion of queries usinLT previously-constructed data structure (from part I)
                           (1) which structure? word-pair phrase tile

      E. Manually constructed queries (routing)
             1. topic fields used <TITLE>, <DESC>, <NARR>, <CON>
             2. average tilne to build query (minutes) 3()(~ minutes for 25 (jueries
             3. type of query builder
                    b. system expert
             4. data used fi)r buildin~ query
                    a. from trijiling topic
             5. tools used to build query
                    a. word frequency list sonletinles
             6. which of the f()ll()win~ were used?
                    a. term weighting
                    b. 13()()leŁ'i'1 connectors (ANt), oR, NoT)

III. Searching

      A. Total computer time to se~uch (cpu seconds)
              1. retriev~'il tilne (total cpu seconds between when ~`i query enters ~e
                     document numbcrs Łue ()bt~iined)
                     8-2(~ per query without soft-Boolean (Conll)ifle 2 methods).
                                  with   "   `  (coml)ine 3 methods).
              2. ranking time (total cpu seconds to sort d(~ument list) 4-12 per query

      B. Which methods best describe y~~ur Inichille searching me~()ds?
              2. pr()babilistic model
              8. neur~il networks

      C. Whai factors are included in your raiiking?
              1. tenn frequency
              3. other term weights (where do they ColfiC fi-om?)
                     inverse collection term frequency total word occurrences
              9. document leng~

IV. What machine did you conduct ~e TRE(' experiment on? SPARC-2GS
              How much RAM did it have? 48 ~1B
              WhLit w~-is the clock r~-ite of ilie CPU? 4(~ ~1Hz

V. Some systems are rese~'irch prototypes and others ~`ue colninercial.
              To help C()mp~u-e tl)ese systems:


                                           475

system until a list of

I. How much "soflw~'ire efl(TiIleerin(~"' went into the developmeiit of your system?
       N(~t much, time wa% spent to truncate record slzes to save space and fit certain
       structures in memory; replace 5()~C linked lists with arrays.
2.  Given appropri~ite resources, could your system be made to run faster? By how much
       (estimate)?
       Yes. 5(~1(N~%. Lots (~f code was translated from PASCAL to C and UL~ed as is.
3. Wh~'u features is your system missing that it would benefit by if it had them?
       Dedicated SPARCstation.
       More RAM.
       More disk space.


                           476

System Summary and Timing

New York University

General Coinments

The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difticult, such as getting total time for document indexing of huge
text sections, or maiiu~llly building a kliowledge base.  Please do your best.

I.      Construction of indices, knowledge b%~es, and other data structures (please describe
        your system needs for se~'irching)

        A. Which of the fo11owin~ were used to build your data structures?
               1. st()pw()rd list yes
                      a. how many words in list.?    38()
               2. is a controlled v(~abul~u~y used? Ilo
               3. stemming yes
                      a. stŁ~d~~d stemming ~dgorithIns    110
                      b. m()1'ph()logical (~(dysis yes
               4. teim weighting    yes
               5. phrase discovely   yes
                      a. whit kind of phrŁise?  NI~'s Vl~'s, others
                      b. using st~itistical meth(xls partially
                      C. using syntictic methods    yes
               6. syntictic pŁ~sing  yes
               7. word sense dis~unbiguati()n no
               8. licuristic associations yes
                      a. short definition of these ~iss()ciations synonymy, specializations
               9. spelling checkiu~ (with in~uiual CoflectiOn) 110
               10. spelling ColTection  no
               11. proper noun identificŁttion algorithm partial
               12. tokenizer (recognizes dates, phone numbers, coinmon patte~s) no
               13. ~ire the m~inuLilly-indexed terms used? 110
               14. other techniques used to build ~ta structures (brief description)

all data structures that

       B. Statistics on d~ta structures built from TREC text (please fill out each applicable section)
              1. inverted index
                      ~i. toial unount of st()rŁige (megabytes) 29(~ MB (().5 GI)yte txt)
                      b. toLd computer time to build ((Ipproxilnate number of hours)     25()
                      c. is the pr(~ess completely ~iutomatic?  yes
                      d. ~ire term positions wi~in d('cuments stored? yes
                      e. sin~le terms only.? 110
              3. knowledge bases  yes
                      a. toLd Ł~ount of stol-Lige (meg~'ibytes) (1.5
                      b. toLd number of concepts represented 3262
                      c. type of represenLitioll (fr(unes, semantic nets, rules, etc.) ass(lciations
                      d. total computer time to build (approximate number of hours)      175
                      C. to~l minual time to build (ipproxilnate number of hours)       (I
                      f. use of manual lalx)r none
                      g. auxili~iry files needed fi)r machine use
                              (1) m~ichine-readible dictionny (which one?)  OALD


                                                477

       C. Data built from sources other th~'ui ~e input text --no

II. Query colistruction
              (ple~i~e fill out a section I;()r e~ich query construction method used)

       A. Aut()Inatic(IIly built queries (~`id hoc) yes
              1. topic fields used <nuni> <title> <desc> and <narr>
              2. toI~'il computer tilne to build query (cpu seconds) 3.()
              3. which of the f()llowin(~ were used?
                      b. phrase extraction from topics
                      C. syntactic p~trsiilg of topics
                      C. proper IIOUll identification al~ontliin partial
                      g. heuristic Ł`Lssociati()I's to add terms
                      h. exp~uision of queries using previously-constructed data structure (from part I)
                             (1) which structure? term clusters
                      j. other (describe) syntactic phra%es

       D. Automatically built queries (routin~)
              1. topic fields used same as in ad hoc
              2. total computer tilne to build query (cpu seconds) 3.2
              3. which of the following were used in building tlie query?
                      a. terms selected from
                             (2) all trŁ'tining documents
                      b. tenil weighting
                             (2) with weiL~hts based on terms in
                      c. phrL~';e extraction
                             (I) from topics
                             (2) from all Inuning d()cument.~
                      d. syn(~ictic pŁ'irsin~
                             (1) of topics
                             (2) of ~`dl tr'.uning d(~uments
                      f. proper noun idCIltiliCL'Lti()II algolithin
                             (1) from lopics partial
                             (2) fr~)in all tr'.Lining documents partial
                      Ii. heuristic Ł~`;s()ciati()ns to add terms
                             (2) from all tr"uni'ig documenL~
                      i. expansion of queries using previ()usly-constructed data structure (from part I)
                             (1) which structure? clusters from training data

                                                              all training docwnents

III. Searching

        A. Total computer tilne to seŁ'uch (cpu seconds)
                 1. retrieval tilne (total cpu seconds between when a query enters tlie system until a list of
                        document numbers are ObtL'lii'Cd)
                        TOTAL TIME (CPU + I/O) search and ranking is al)out 6(~ minutes per query
                 2. ranking tune (total cpu seconds to sort d(x:ument list)

        B. Which methods best describe your machine searching mettiods?
                 1. vector space m(slel

        C.  What f~tctors are included in your rankiug?
                 1. tenn frequency
                 2. inverse d(icument frequency


                                                478

IV. What machine did you conduct tl)e TREC experilnent oii'?
             How much RAM did it have? 56 Ml)ytes
             What w~'~ the clock rate of ~e CPU? 28 MIPS

V. Some system.'; are research pr()t()types and others are commerci~~.
             To help compare ~ese systems:

             1. Flow much "software en&2ineen'ig" went ilito the development of your system? A lot
             2.  (jivell appropriate resources, could your system be made to run faster? By how much
                    (estimate)? hase IR system is very inefticient flow
             3. What features is your system missing that it would benefit by if it had them?
                    There is still a lot (~f room for improvement of NLP pr(~ram~; more time and
                    experiments are re(lllired


                                        479

System Summary and Timing

University of Central Florida

General Conunents

The fimings should be the tune to replicate runs ~m scratch, not including trial runs, etc. The times should also
be re~~onably accurate. This sometunes will be difficult, such as getting total time ror document indexing of huge
text sections, or mŁ'uiuŁ..llly build~ig a kilowledge base. Please do your best.


I.       Construction of indices, knowledge bases, and other data structures (please describe all data structures that
         your system needs for se~lrChin(J)

         A. Which of tlie folk)wing were used to build your data structures?
                1. stopword list yes
                       a. how many words in list?
                              166 stop words, 122 al)l)reviati()ns, 47 hyphenated words, 24 entries for
                              al)I)reviati()ns and alternate n()ti()ns for months, 35 entries for legitimate
                              words `lot to Ile prefixed, and 6 entries for legitimate pretixes.
                2. is ~t coutrolled v(icabul('uy used? Il(~
                3. stemmin~' yes
                       a. st'wdŁu-d stemming alg()ri~Ins
                              which ones? .J.B. I~()vins' Stemming Algorithm (nl()dltied).
                       b. m()1~h()l()gical ~uialysis Ilolle
                4. telin weighting yes
                5. phrase discovery ~
                6. syntactic pŁLr5i11(2 no
                7. word sense dis~nbigu'.1ti()n
                       Yes. The semantic lexicon we used is l)ased ()~ word senses f()und in Roget's
                       Thesaurus.
                8. heuristic ass()ci~Lti()ns Ilo
                9. spelling checking (with inanutI c()11~ecti()n) no
                10. spellin(2 corlection no
                11. proper IloUII identification algoritlim Ilo
                12. tokenizer (recognizes dites, phone numbers, coirunon patterns) yes
                       a. which patterns Ł`ue tokenized?
                              The QA System recognizes dates. But we felt it was not useful f~)r the NIST
                              experiment so we removed this feature to improve text processing speed.
                13. ~`~re the m~'uiuŁilly-indexed tenns used? 110
                14. other techniques used to build d~ta structures (brief descuption)
                       The QA System uses B-tree storage structures ti)r inverted index tile access and
                       semantic lexicon access. But for the NIST experiments, we used the QA System text
                       scanning al)ility and coupled it with hash tal)le access (replacing tile B-tree access)
                       and the use of 32-l)it Codes for text strings.

         B. Statistics on data structures built from TREC text (please fill out each applicable section)
                1. inverted index yes
                       a. tOtLil ~un()unt of st()ra~e (meg~ibytes)
                              For Vol.1 the index storage was 385 megahytes.
                       b. total computer tune to build (approx~nate number of hours)
                              73 hours using nine IBM 5(~ MHz 486 PCs running in parallel.
                       c. is the Pr(~C55 completely automatic? yes


                                                     480

                    d. are te~ positions wi~in d('cuInents stored? no
                    e. single terms only? yes

      C. Data built from sources other thŁ~i tlie input text yes
             1. inteni~'illy-built auxili~iry files yes, a semantic lexicon
                    ~1. domain independent ()~ d()m~Lin specific (if two separate files, please fill out one set
                          of questions for each file) Domain independent
                    b. type of file (thesaurus, knowledge bŁ'L~e, lexicon, etc.)
                          Semantic lexicon l)uilt l)y examination of Roget's Thesaurus.
                    C. tot~il Łunount of storage (megabytes) 0.34 megal)ytes.
                    d. total number of Concepts represented
                          There are 36 semantic categories and there are approximately 24,E~0 words
                          in tw(~ lexicons with the categories they trigger.   The prohahility of each
                          triggered category is aLso stored.
                    e. type of representation (fr~es, semantic nets, rules, etc.) It could he viewed as rules.
                    t.. total computer time to build (approximate number of hours)
                          (1) if abeady built, how much time to modify for TREC?
                                 Since the 1911 edition of' Roget's Thesaurus hecame pul)lic domain
                                 recently, we spent approximately 16 hours creating the software to
                                 pr(~cess the 1911 Thesaurus. Approximately 6 hours of processing
                                 time was required to automatically extract 20,(WO lexicon entries.
                                 However, we did not have time to explore the use of these entries.
                    g. t()t~'~l niwual tilne to build (approximate number of hours)
                          (1) if already built, how much t~e to modify for TREC?
                                 Pn()r to TREC, there were 3,(HX~ entries in the lexicon established
                                 by manual processing of approximately 6,000 words in 300 hours.
                                 For TREC, we made 1,(HHJ new entries (in 85 hours) by examination
                                 ~ 1,7(~) frequently occurring words found in the training topi~s and
                                 the training text. S(~, the lexicon we used had 4,(Hlt~ entries in it.
                    Ii. use of manual l~'ib()r
                          (4) o~er (describe) Refer to (t) and (g).
             2. exten'Ł'dly-built L'1uxili~uy file Ilo

II. Query construction
             (please fill out a section for e~'ich query construction method used)

      A. Automatically built queries ((`Id hoc) yes
             1. topic fields used All fields
             2. toL~ computer tilne to build query (cpu seconds) 1 second
             3. which of the following were used.?
                    f. tokenizer (recoL'nizes d~tes, phone numbers, ~OInlfl()I' patterns)
                          Dates recognizal)le but not used.
                    Ii. exp~'ulsi()n of queries using previously-constructed dŁ~ta structure (from part I)
                          (1) which structure'? Semantic lexicon described in I.C.1.
                    j. other (describe) Term weighting based (~n terms in training text.

      D. Aut()matically built queries (r()utin~) yes
             1. topic fields used All fields
             2. total computer tilne to build query (cpu seconds) 1 second.
             3. which of tlie following were used in buildin~ tlie query'!
                    a. te~s selected from
                          (1) topic
                    b. teun weighting
                          (2) with weights based on temis in all trainin~ docwnents


                                       481

                     g. tokeiiizer (recogIlizes dŁ~tes, phone numbers, coimnon pattenis)
                           Dates are recognized l)y the QA System l)ut were not used for the TREC
                           experiments.
                     i. expansion of queries usint~ previously-constructed data structure (from part I)
                           (1) which structure? Semantic lexicon descril)ed in I.C.1.

III. Searchiug

      A. Total computer ti'ne to se~'trch (cpu seconds) 3-1(~ minutes per (~uery to retrieve and rank.
             1. retrieval ti'ne (total CPU seconds between when a query enters tlie system Until a list of
                     docuineut numbers ~tre obLimed)
             2. ranking time (t()t~tl CPU seconds to sort d(~ument list)

      B. Which methods best describe ~()U~ `n~~chine se~Lrching inetliods?
             1. vector space m(XIel

      C. What f~tciors ~tre included ill ~()U~ ru~ing?
             1. tenn frequency
             2. inverse d(~ument frequency
             9. docwnent length

IV. What machine did yoU conduct the TREC experilnent on?
             We used nine IBM P512 Model 95 computers. These were ~ MHz 486 computers with 8
             megahytes ~ RAM. Tw~~ of them had 16 megal)ytes of RAM. A 33 MHz 486 PC was used
             to distrihute text to the nine IBM PCS fi)r indexing and (~uery processing.
             How much RAM did it have?
             What ~ the clock rate of the CPU?

V. Some Systems are research prototypes and others are coi"inerci~'tl.
             To help compare these systems:

             1. How much "soRware engineeriIl~" went into the development of yoUr system?
                     Our QA System (huilt for NASA and restricted to an IBM compatihle PC platform
                     running under DOS and using ~() other license agreement commercial software such
                     as a DOS extender) is a prototype and has heen under development for one and a
                     half years. Approximately 2,E)()(~ hours (~f programming have heen used to develop
                     the current s(~ftware. The system is implemented in C and uses B-tree structures for
                     the inverted file structure. We felt our system was not fast enough to appear
                     reasonahle f~)r TREC, so we designed a separate system without a pleasant user
                     interface which used a hashing scheme to estal)lish codes for strings to cut down on
                     st(~rage space; we also eliminated the use of B-trees in this separate system. We
                     custom huilt a system for TREC during July and August; approximately 400 hours
                     of programming and dehugging went into this effort. The custom system generated
                     the results which we sent in. H(~wever, we are now trying to pr~'duce some semantic
                     results using the original QA System.
             2.  (jivel' appr()pnate resources. could your system be made to run fŁ~ter?  By how much
                     (estimate)?
                     Assuming we stay with DOS then we could easily run 8 to 16 times faster using the
                     following:
                     Hardware Improvements:
                     1. New 66 MHz PCs now on the market.
                     2. Multiple hard drives.
                     3. 16 ~ 32 megahytes of RAM instead of 8 megahytes to he used for a larger disk
                     cache and for ()U~ hashing algorithms.


                                         482

      S(~ftware Inlprovenlent.%:
      I. Pr(~er U.%C (~ RAM t')r h~.'hing.
      2. Use of a DOS extender or switch t(~ an OS/2 or UNIX environment.
3. What features IS your system  missing that ii would beuefit by if it had them?
      The follE'wing software improvements would l~neflt the retrieval performance:
      1. Relevance Feedback
      2. Larger Semantic Lexicon
      3.   Breakdown of Lexicon into noun, verb, adjective, adverb, preposition,
      conjuncti~'n, intei~jection, use coupled with a part of speech tagger.


                                   483

System Summary and Timing

Advanced Decision Systems


General Coininents

The timings should be the tjine to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as ~ettin~ total time f~)r d(lcument indexing of huge
text sections, or mai~u(..11ly buildin~~1 a k'iowledge base.  Please do your best.

I.      Construction of indices, knowledge bases, and other data structures (please describe all
        your system needs for se~.irchin(T)

data structures that

       A. Which of the followin(T were used to build your data structures?
              1. stopword list yes
                      a. how many words in list? 421
              2. is a controlled voCabul~y used?      no
              3. stenYInin~ ~()
              4. tenn weighting    ~()
              5. phrase discovely   no
              6. syntactic p'Ifsing no
              7. word sense disambigultion        ~()
              8. heuristic Ł~ss()ci~'1ti()ns Ilo
              9. spelling checking (with manuil c()1~ection)     Ilo
              10. spelling ColTeCtion       no
              11. proper noun identificŁ~ti()n ~d~()ritl1In no
              12. tokenizer (rec()t2nizes dates, phone numbers, coinini)n patterns)  no
              13. ~LrC the mŁ~u~illy-indexed terms used?      ~()
              14. other techniques used to build data structures (brief description)    original documents and
                      yes--1)inary classitication trees Iluilt automatically from the
                      topic statements

       B. Statistics on dŁita structures built from TREC text (please fill out each applicable seCtk)n)
              5.  other data structures built from TREC text (what?)
                      yes~~classificati()n vectors; actually integer arrays
                      a. total ~`yln()unt of st()rŁ~ge (megibytes)
                               Only a few Kl)ytes fi)r the training sets used f()r tile oflicial scores--vectors
                               generated oil the tly for routing the test data.
                      b. tOtLtl computer time to build (approximate number of hours)
                               Feature extraction takes less than lo seconds per document.
                      c. is the pr(xess completely ~iutomatic?     yes
                      d. brief description of methods used
                               Give a specification of a set of features, fl)r example, a list of word tokens;
                               tile docunlent is searched fi)r the nunliler of (~currences of each feature.

       C. Data built from sources other th~w the input text    --110

II. Query construction
              (please fill out ~ section for e~'ich query construction method used)

       D. Automatically built queries (routing)


                                                     484

             1. topic fields used <desc>, <narr>, <con>, <det'>
             2. total computer tune to build query (cpu seconds)
                     takes less than ~ Seconds to huild the classitication tree including feature extraction-
                     -this dE)e5 depend on the size of tile training set though
             3. which of the followin~ were used in building ~e query?
                     a. terms selected from
                           (1) topic yes
                           (2) all tr(Lining documents no
                           (3) only documents with relevance judgments
                                  yes--including 5()~C additional judgments generated l)y us
                     k. other (brief description)
                           f~ature counts--in this case these are just word counts


III. Searching

      A. Total computer t~e to search (cpu seconds)
             1. retrieval tilne (total cpu seconds between when a query enters ~e system until a list of
                     document numbers are obtalned)
                     approximately 20 hours (sic) of elapsed time (jn the WSJ test set--no accurate
                     measures of CPU time availal)le to us
             2. rankuig time (total cpu seconds to sort d(~ument list)
                     approximately 5 minutes of elapsed time--no accurate measures of CPU time
                     availahie to us

      B. Which methods best describe your machine searchilig metliods?
             10. other (describe)
                     l)inary classification algorithm liased on counts of feature occurrence in the TES
                     document

      C. What factors are included in your ranking?
             15. other (specify)
                     statistical estimate of the misclassification rate (prol)al)ility) of the classifier

IV. What machine did you conduct tlie TREC experiment on?
                     Sun SPARCstation IPC
             How much RAM did it have?
                     24M1)
             What wLts the clock rate of tlie CP~J?
                     2(~MHz

V. Some Systems are resear~ prototypes and others are commercial.
             To help compare these systems:

             1. How much "software engineerilig" went mx) the development of your system?
                     approximately 4 person-weeks for the TREC infrastructure--the CART algorithm
                     implementation used was `1otT the shelf"
             2.  Given appropriate resources, could your system be made to run t~~Lster?   By how much
                     (estimate)?
                     Al)s()lutely! The feature extraction algorithms were not optimized for speed, and no
                     datal)ase or indexes were huilt to do the testing. With faster algorithms and a set
                     of inverted indexes, we estimate a d(~ument could he classified in less than 1 second.
             3. What features is your system missing that it would benefit by if it had them?
                     We would like t(~ experiment with "(~ff the shelf" to()ls to assist in feature


                                         485

.%`pecirlcati(Il and  xtrdcti(n, f(r xample: a part or .%`peech tagger, a t(kernzer, a
proper name recEpgnizer. We aI.%() did n~~t explore the u.~ (jf concept-l)ased techni(1ues
(e.g., RUBRICFFOPNC) t(P provide low-level concepts as features.


                          486

System Summary and Timing

CITRI, Royal Melbourne Institute of Technology
We are providing 2 reports oil the systeni. This is becLiuse we have tried experiments on Iwo very different systems,
and tested quite different hypotlieses.

Project: retrieval from a compressed datahL'b'e using the CoSine measure Ł~id approximate representations of d&icument
lengths

General CommenL~

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be re~sonably accurate. This sometimes will be difficult, such ~`; getting tot~~ time f~)r d(icument indexing of huge
text sections, or manuilly building a kiiowledge base. Please do your best.


I.      Construction of indices, knowledge kises, and other data structures (please describe all data structures that
        your system needs for se(~chiIi~)

        A. Which of the following were used to build your data structures?
               1. stopword list no
               2. is a controlled v('cabul~iry used? no
               3. sten~ninL' yes, tor Construction (jf index
                      a. staiid~ird stemming ~d gori ~ins
                               which ones'! 1~()vins' 1968 algorithm
               4. tenn weighting no
               5. phrase discovery OE)
               6. syntactic p~siIlg n(~
               7. word sense dis~bit~uL1ti()n no
               8. heuristic ~~5sociations Ilo
               9. spelling checking (with manual correction) n(j
               10. spelling correction 110
               11. proper noun identification Lilgoritlim no
               12. tokenizer (recognizes dates, phone numbers, COifliflOli patterlis) no
               13. are the manually-indexed terms used? no
               14. other techniques used to build data structures (brief description)
                      no, Ilut see discussion (~f compression below

        B. Statistics on ddata structures built fi-om TREC text (please fill out each applicable section)
               1. inverted index
                      a. total ainount of storage (Jne(~Tabytes)
                               5(~.7 Ml) (37.9 MI) f~~r pointers, 12.8 MI) for fre(1uencies)
                      b. total computer time to build (approximate number of hours)
                               4.20 CPU hours, ()flCC a vocabulary has I)een huilt
                      c. is the pr&~ess completely automatic? yes
                      d. are term positions wi~in d(~uInents stored?
                               no, l)ut term frequency within document is stored
                      C. single terms only? yes
               5. other data structures built from TREC text (what?)
                      model for sul)se(1uent c()nlpressi(~n of text
                      a. total ainount of storage (Ine(~Tabytes) 2.4 Ml)
                      b. total computer time to build (approxu nate number of hours) 2.54 hours
                      c. is the pr(~ess completely automatic? yes


                                             487

                    d. brief descriptioll of Illetliods used
                           count word and n(Jn-w(~rd fre(Iuencies using splay tree

                other data structures built from  ~`REC text (what?)
                    a single file (jf the text itselt; compressed
                    a. loLil ~ufl()uflt of sI()r~I~e (incgabytes)  253.2 MI)

                    b. to~l computer time to build (approxilnate number of hours) 3.10 cpu hours
                    C. is the process completely automatic? yes
                    d. brief description of methods used
                           zero-order w()rd-l)ased model using Huffman c(KIing

                other data structures built from  TREC text (what?)
                    a file of document addresses and document lengths (f~)r cosine)
                    a. total ~lin()uflt of st()r~i~e (megabytes) 1.8 Ml)
                    b. total computer Ijine to build (approxilnate iiumber of hours) negligil)le

                other data structures built from  TREC text (what?) vocal)ulary for inverted index
                    a. total ~unount of stor~~~e (megabytes) 3.6 Ml)
                    b. toL~l computer tilne to build (approxilnate number of hours) 2.41 cpu hours
                    C. is the process completely automatic? yes
                    d. brief description of Ineth(xls used count stemmed w~~rd fre(luencies using splay tree

                other da~ structures built from TREC text (what?) a file of inverted index entry addr~sses
                    a. toLd ~unount of st()rŁ1~e (me(Tabytes) 1.2 Ml)
                    b. total computer tilne to build (approxilnate number of hours) negligil)le

                other data structures built from TREC text (what?)
                    a file of approximate document lengths
                    a. total (unount of storaLTe (megabytes) 0.2 Ml)
                    b. total computer tilne (0 build (approxilnate number of hours) negligihle

       C. Data built from sources other th(w the iuput text --no

II. Query construction
              (please fill out a section for each query construction method used)

       A. Automatically built queries (ad hoc)
              1. topic fields used all
              2. toL~ computer tilne to build query (cpu seconds) less than one second
              3. which of the ft)llowin(~ were used?
                    a. tenn weightin~ witli weights k~ed on tenns in topics yes, as in cosine measure
                    j. other (describe)
                           used stop words to eliminate comnion words from query
                           eliminated SGML tags and all punctuation

III. Searching

       A. Total computer tilne to scaich (cpu seconds)
              I & 2 were not timed separately; 35 seconds per query to identify the top 2(10 ranked items
              further 4.6 seconds of cpu decompress the top 200 items, 18.6 seconds in total including
              retrieval time
              1. retrieval time (total cpu seconds between when a query enters the system until a list of
                    document numbers ~ obt'~ined)
              2. rankin~ time (t()tal cpu seconds to sort d(~ument list)


                                                   488

      B. Which rneth()d~ be~t deNcnbe your rnŁichine se~ching me~()d~?
              1. vector ~pace m(KIel cosine measure

      C. What factors ~ue jucluded in y()UI railing?
              1. tenn frequency
              2. inver~e d(~urnent frequency
              9. d(}cuIl'ent length approximate document lengths were used to reduce memory re(1uirements

IV. What machine did ~()U conduct ~e TREC experilnent on? Sun SPARC 2
              How much RAM did it have? 128 Ml)
              What wa~ the clock rate of flie CPU? 25 MIP

V. Some system.~ ~`ire research prototypes and others ~ue commerciLd.
              To help compare ~ese systems:

              1. How much "s()ftwLire en(jineenn(T" went into the development of your system?
                     very little
              2.  Given appi-opriate resources, could y~~ur system he made to run f~~ster?  By how much
                     (estimate)?
                     procesSIng to rank Items can lIe 3(~-5(~% faster; retrieval and decompression of text
                     are currently Ilnilted by characteristics of tile disk and the UNIX operating system
              3. What features is your system missing that it would benefit by if it had them?
                     current transformation of topics into (1ueries is simple-minded
                     the database is static


                                           489

System Summary and Timing

CITRI, Royal Melbourne Institute ol' Technology
We are providing 2 rep~~rts oil the system. This is bec~'Luse we have tried experiments oil two very different systems,
and tested quite differeni hypotheses.

ProjecL' retrieval from a compressed daL~b~'~e using the CoSine measure aiid approximate representations of d(~ument
lengths

General Comments

The fimings should be the tilne to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This soinetilnes will be diff'icult, such as getting total time for d('cument indexing of huge
text sections, or m~ually building a k'iowledge base. Please do your best.


I.      Construction of indices, knowledge kises, ~`ind other data structures (please describe all data structures that
        your system needs for searching)

        A. Which of the f()lk)wing were used to build your data structures'!
               1. st()pword list
                      a. how many words in list'? 42(1
               2. is a controlled v(~abulary used'! n(~
               3. stemming
                      a. stalidard stemming ~`Llg(withms
                             which ones'! I~()vifls' 1968 algorithm
                      b. morphological (`~(`ilysis no
               4. tenn weighting tf.idf
               5. phrase discovery
                      a. what kind of phrase? Adjacent pairs
                      b. using statistical Ineth(ids yes
                      C. using syiltactic methods  n(~
               6. syntactic parsin~ no
               7. word sense disambiguation 110
               8. heuristic associat~ns no
               9. spelling checking (with manual con-ection) (lueries only
               10. spelling correction queries only
               11. proper noun identification ŁLIg()rithm no
               12. tokenizer (recognizes dates, phone numbers, common patterns) no
               13. are the mŁ'wually-indexed terms used'! they were not discarded
               14. other techniques used to build d,'ita structures (brief description)

        B. Statistics on data structures built from Tl~C text (please fill out each applicable section)
               2. n-grams, suffix arrays, si~~nature t'iles
                      a. total alnount of st()r'~L'C (me~abytes)
                             Data (compressed) 220m
                             Index 313m
                      b. total computer time to build (approxil nate number of hours) 23 hrs
                      c. brief description of methods used niulti-organisational signature FILE
                      d. is the process completely aut()m~-Itic? yes

        C. Data built from sourCes oilier th&.ui the input text --no


                                                  490

II. Query construction
             (please till out a section for e~tch query construction method used)
               A large numl)er of techniques were tried.

      A. Automatically built queries (ad hoc)
             1. topic fields used
                    Boolean queries were constructed from a variety of the topic tields. The (lueries
                    were then ranked p()ssil)Iy using ditTerent flelds.
             2. total computer time to build query (cpu seconds) -10
             3. which of the followiug were used?
                    a. term weightilig wi~ weights bŁLsed on tenns in topics
                    b. phrase extraction from topics
                    i. automatic addition of B(x)lean connectors or proximity operators

III. Searching

      A. Total computer tilne to search (cpu seconds)
             1. retrieval tilne (total cpu seconds between when a query enters die system until a list of
                    document numbers are obtained)
             2. ranking time (total cpu seconds to sort d(icument list)
             These operations (wcurred together. It t()()k 6 lirs to ol)tain a ranked list of 1,000 documents
             for each of the 50 queries.
      B. Which methods best describe your machine searching mediods?
             1. vector space m(XIel

      C. What factors are included in your ranking?
             1. tenn frequency
             2. inverse d(icument frequency
             7. proxilnity of terms
             9. docuinent lengtli
             15. other (specify)
                    The location of the term in the query. A variety of modeLs were tried.

IV. What machine did ~()~ conduct die TREC experiment (m? Sun SPARC 2
             How much RAM did it have? 128 Ml)
             What w~is the clock rate of tile CPU? 25 MIP

V. Some systems are research prototypes and others are commercial.
             To help compare diese systems:

             I. How much "software CIIL'iilCCflIit(Y" went into the development of your system?
                    The system is a rol)ust research tool. Limited eff(~rt has heen put into speed.
             2.  Given appropriate resources, could your system be made to run faster? By how much
                    (estimate)?
                    Consideral)Iy faster, hut we estimate it would twice ~s f~~t if we changed the
                    architecture (we use UNIX pipCs to communicate).     It Ls unclear what other
                    speed-ups can (~cur.
             3. What features is your system missing that it would benefit by if it had them?
                    All sorts of things would l)e nice! A go(~ form of transaction management would
                    lie the most useful t(J transform the system into a commercial product.


                                        491

System Summary and Timing

Australian Computing and Communications Institute

General Comments

The fimings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as gettin~ total time for document indexing of huge
text sections, or manually building a ~owledge base. Please do your best.


I.      Construction of indices, knowledge bases, and other data structures (please describe all data structures that
        your system needs for se(~ching)

The software does not invert tile text. It inverts the (lueries (or
through the c()ml)ined index formed from the (~ueries.

filters) and passes the text

II. Query construction
              (please fill out a section for each query construction method used)

       D. Automatically built queries (routin(~)
              1. topic fields used Concept field used
              2. total computer time to build query (cpu seconds) < 5 seconds
              3. which of the following were used in building the query?
                     a. terms selected from
                            (1) k)pic
                     b. tenn weighting
                            (3) with weights based on terms from documents with relevance judgments
                                  Terms weighted with weights hased (~n terms from documents with
                                  relevance judgments, and  dynamically m(xlified through the
                                  training set and the test set.
                     c. phrase extraction
                            (1) from topics
                     j. Łiutomatic addition of B()()leall connectors or proximity operators
                            (1) using inf()rmati()n from the topics

       E. Manually constructed queries (routin(~)
              1. topic fields used All topic fields used
              2. average tilne to build query (minutes) 30 minutes
              3. type of query builder
                     b. system expert
              4. data used fi)r building query
                     a. from training topic
              6. which of the fiAlowing were used?
                     b. Boolean comiectors (AND, OR, NOT)
                     c. proximity operators

III. Searching

       A. Total computer time to search (cpu seconds)
              One message through 200 filters per second. This includes searching and ranking.
              1. retrieval time (total cpu seconds between when a query enters the system until a list of
                     document numbers are obtained)


                                          492

              2. rŁinkin~ time (totil CPU ~ec()I1ds tO .`;ort d(~uIneI1t list)

       B. Which methods best describe ~()U~ niachiiie seirchiug methods?
              6. fuzzy loiTic (juclude your deilnition)
              10. other (describe) Software uses a fuzzy AND and Proximity measure to rank documents.

       C. What factors &`irC jucluded ill your rauking?
              1. tenn frequency
              5. position ill document
              7. proxilnity of terms

IV. What mach~ie did you conduct the TREC experilnent on?
              How much RAM did it h~ive?
              What wŁ~s the clock rate of die CPU?
              The experiments were run on an HP 486/33 with 8 MI)ytes under SCO UNIX. The CD ROM
              drive was accessed via NFS.
V.  Some systems are rese~irch prototypes and others ~`tre c()lnlnerciŁil.
              To help comp~ire these systems:

              1. How much "softw(lle en~i'ieeriIlg" went into the development of your system?
              2. Given ~Ippr()pri~~te resources, could your system be made (0 fun f~~ster?   By how much
                     (estimate)?
              3. What features is your syslein missing that i( would benerit by if it had them?-

              AMR   is c()mn~rciaI strength software developed hy Computer Power Group.               Its
              colilfllercialisati()n software engineering phase to(~k 5(~me three person-years.


                                          493

System Summary and Timing

Carnegie Mellon University

General Comments

The timings should be the time to replicate ruus from scratch, `lot including trial runs, etc. The times should also
be ttasonably accurate. This sometilnes will be difficult, such as getting tot~ time for document indexing of huge
text sections, or m~~iually building a knowledge base. Please do y~~ur best.


I.      Construction of indices, knowledge b~i~es, and other data structures (please describe all data structures that
        your system needs for seŁ'irc'1ing)

        A. Which of the following were used to build your data structures'?
               1. stopword list
                     No. But the NLP/m()rph()l()gical-analysis components of the system do limit the
                     possible lexical categories of SoniC English words to eliminate useless ambiguities.
                     For example, `9l)ut" is given lexical category "cnj" (conjunction) and not alternative,
                     possible categories, such as "sn" (5ingular.n()un); "can" is limited to category
                     "auxm" (nl(KIal.auxiliary.verl)) and not "sn91; etc. Such selective restrictions have
                     some (~f the effects (jf "stop-word" lists, since spurious (or irrelevant) categories will
                     not enter into later indexing stages.
                     Furthermore, the NLPlparsing components of the system return simplex noun
                     phrases (NI's) as candidate terms in which some components of the NP are
                     eliminated, such as (luantitiers (e.g., "many", "one", etc.), determiners (e.g., "the",
                     "a", etc.), and c(Jnjuncti()ns (e.g., "and", "or", etc.). In addition, in normal CLARIT
                     NP processing, the parser does not return prepositions, non-NP adverbs, and
                     extra-NP elements. Tli is practice, therefore, aLso has the effect of eliminating items
                     that normally appear on "stop-word" lists. It clearly goes beyond that practice in
                     eliminating all extra-NP words as well.
                     a. how many words in list'?
                            Approximately 1(~() lexical items have been given restrictive syntactic
                            treatment, ill addition t(~ the words with unambiguously empty categories.
               2. is a controlled v(~abul~'Lry used'? No
               3. stemming No
                     a. st~indŁ~d stelnining algorithms
                            which ones'?
                     b. Inorpholo~'ical alialysis
                            Yes. The Morph component of the system provides for comprehensive
                            inflectional.m()rph()l()gical analysis. In practice, the morph-i~ormal form of
                            nouns and adjectives is used in the NP-based terms of the system.
                            Participles are not morphologically reduced (though it is possible to do so).
                            Derivati()nal.m()rphol()gical analysis is not used. A lexicon of approximately
                            ~ `r()()t-f~)rnl' items (English words) is the principal resource used by
                            Morph in addition to its morphological rule set.
               4. tenn weightin"
                     Yes/No. The CLARiT process uses NLP to identify candidate terms in route to
                     indexing, development of ~ss()ciated resources (e.g., thesauri), and analysis of queries
                     or topics. These are taken as the `information units' of interest and are analyzed
                     statistically and heuristically. `Weights' are ass('ciated with terms at various stages
                     of pr(wessing. In indexing TREC documents, for example, an IDFfrF score was
                     associated with terius for each document. In the case of multi-word terms (the


                                          494

      norm), the terms are assigned IDF/TF scores, and each word in the term is broken
      out and assigned an independent IDF/TF score.
5. phrase discovery Yes
      a. what kind of phrase?
             Simplex noun phrases (= all moditiers and the head of the NP but no
             deternijners,  t1uantitiers,  or post~head~position m(~ifying phrases    or
             clauses).
      b. using statistical rneth(xis
             No. NPs retained fi)r thesaurus creation are scored using statistically-based
             measures ~)f expected `rarity' (based on component words), distribution,
             fre(1uency, and coverage.     But N1~s are not identified in texts based on
             statistical parsing, for example.
      C. using syiltactic methods
             Yes.     NI's are discovered using a parser that implements a `heuristic'
             grammar.     In particular, following word-for-word morphological analysis
             (resulting in a set of syntactic-category tags t~)r each word encountered in
             a text), the parser identities the sul)se(luences that form NI's. Identification
             of NI's is based on rules that perf~)rm NI'~b()undary-c()ndition tests.
6. syntactic p~irsing Yes (see above). A single-pass parser follows morphol()gical analysis.
7. word sense dis~biguati()n
      No. No attempt is made to control for word senses in morphological or syntactic
      analysis. As noted above, disambiguation of grammatical categories is facilitated by
      restricting possible categories for selective items. In addition, absolute preferences
      are established for grammatical categories appearing in n~~un phrases.
8. heuristic ~~~sociations
      a. short definition of these ~L~5ociations
             Yes. The principal relation the system currently uses is that of `similarity'
             of terms.    `Similarity' is determined by different procedures in different
             contexts. For example, partial or `fuzzy' matching of terim~ is facilitated by
             noting whether terms share words or attested sul)phrases. For example, in
             vector-space modeling of documents, the contained words of all terms (in the
             document vector as well as the query vector) are broken out, giving, in
             effect, the possibility ~ matching parts of terms (though, technically, the
             individual words are realized as independent dimensions of the space). In
             addition, in nominating terms for inclusion in thesauri and in matching
             terms to thesauri, CLARIT processing takes account of contained words and
             attested sul)phrases.
9. spelling checkin~ (with rn~'~nu~~l ColTection) No
10. spelling correction No
11. proper noun identification ~Ll~()ri~In
      YesINo. The system provides for identification ~ `candidate proper nouns' b~~ed
      on morphological analy %~~%  (F%sentially, since the morphological analysis is virtually
      exhaustive for English, words that cannot be mapped to specific lexical ite~s are
      given the provisional label "cpn"--'candidate proper noun'--and parsing proceeds
      accordingly.)   There is a facility in CLARIT for highly-reliable proper name
      (including acr(~nym) identification, but it was not used in this round of TREC
      processing.
12. tokenizer (recognizes dates, phone numbers, ColTilTIOll pattenis)
      a. which patteills ~ tokenized?
             Certain common abbreviations are included in the lexicon and, under
             morphol(~gical processing, are rendered into normalized forius. The system
             can     utilize--and  even  partially  discover--supplemental  lexicons  of
             domain-specific abbreviations and other phrasal-lexical patterns, but this
             facility was not used for TREC processing.


                                 495

      13. aje (he `n~rnu~~ly-indexed ienn~ u';ed?
             Yes/No. The manually-indexed terms were treated as additional text and processed
             (for NPs) along with the other sections (~f the topic statement. They `nay or many
             not have survived review; they were not given speCial treatment except as potential
             sources (~f NI's for the topic.
      14. o(1'er techniques used to build &L'ita structures (bijef description)
             The CLARIT system has facilities for the discovery of `first-order' thesauri (= a list
             of important and characteristic terms) over collections of documents. The techni(Jue
             re(Iuires that documents in the collection be from the same `domain' or `topic'
             (broadly conceived) and is relial)le only if the d('cument set is large enough (e.g.,
             minimally 2-3 megabytes).   TREC topics--even when supplemented by sets of
             relevant documents--fall far short of the minimal size re(luired, so general CLARIT
             thesaurus discovery could not be used in preparing topics ()~ to support the indexing
             of texts. However, (~ne effect of the CLARIT thesaurus-discover procedure is to
             rank terms in a c(~lecti()n based on their fre(~uency, distribution, and `rarity' scores.
             In preparing sets of terms to assist in partitioning the TREC corpus (to identify a
             subset (~f documents with the best candidates under any topic), we produced
             pseudo-thesauri for each topic by using CLARIT thesaurus-discovery modules. In
             particular, the pr(~cess produced a list of terms from the available topic-relevant
             documents (or from a small sample of relevant documents that we may have found)
             and automatically chose the top (approximately 2()%) ranked terms to supplement
             the  original (luery (as derived from the topic sta(ement)           to produce  a
             "r()utinglpartiti()ning thesaurus" t~)r the topic. (The use of this resource is described
             below.)
             Furthermore, in developuig extended (lueries for ~~ur final processing step (= a
             vector-space retrieval), we supplemented the original set of terms for the topic with
             *all* of the terms from the small set of top-ranking documents (as determined by
             routing/partitioning score) for each topic.

B. Statistics on data structures built from TREC text (please fill out each applicable section)
      4. special routing structures (what?)
             Yes. Each topic text was automatically analyzed by CLARIT to extract NI's. Terms
             nominated by parsing were reviewed by members (~f the CLARIT team for
             appropriateness (and retained ()~ eliminated) and given a weight of "1", "2", or "3"
             to (luantify relevance. Available topic-relevant d(wuments were processed for
             supplemental ternl% (each given a fi'actional weight, e.g., "0.3").  The combined
             list--terms from the topic text and terms from the topic-relevant documents--formed
             a "r()utinglpartitioning thesaurus" for the topic.
             Each TREC document was `scored' against the routing/partitioning thesaurus for
             each t(~pic. In particular, every NI' in each document was matched against the NI's
             (terms) in the routing thesaurus; partial matches were allowed; a formula yielded
             a composite score for the document based ()~ the number of exact and partial hits
             as a function (~f document length and term `value'.
             In the first round (first So topics) of processing, this approach was used to identify
             the highest-scoring 2()()() documents for each topic.
             a. total ~~~ount of storage (ine~abytes)
                      ().6 megabytes f~)r the merged 50 routing structures,         i.e., the 50
                      "r()utingipartiti()ning thesauri" for the So topics.
             b. total computer tilne to build (approximate number of hours)
                      S minutes of real time--exclusive of the preparatory time to parse, build a
                      simple index, find SOniC relevant documents, review them, and coml)ine them
                      into an input file.
             c. is the process completely automatic?
                      The manual review and weighting of terms from the topic statement took


                                     496

                    approximately 5 minutes per topic.      All additional steps were performed
                    automatically.
             d. brief descriptioll of methods used (See al)()Ve.)
      5. other d~ita structures buili from TREC text (what?)
             Each i'REC docunlelit had to l)e f(Jrmatted for CLARIT processing, hy making the
             uni(~ue text II) accessil)le to CLARIT as a special field and hy delimiting the
             heginning and end (~f each text in a tile.      Intermediate (hut unretained) files
             generated in CLARIT processing include a tile of the words in each text, in their
             original order, annotated with morphological categories. Other files contain the
             output of the parser, as a list of NPs in the order in which they occurred in each
             text. The parsed representation of the text was retained and used at all sul)se(Iuent
             steps of pr('cessing.
             a. total ~Lin()unt of storige (megabytes)
                    Processing steps are piped through the system; intermediate files are not
                    retained. The parsed representation of all the texts takes up appr(~imately
                    98% of the space occupied hy the original text.
             b. total computer tilne to build (~Ipproxilnate number of hours)
                    The total time to transform the original 2-gigahytes of text into parsed text
                    takes ahout 10 real hours, with processing distrihuted over 5 machines.
             C. is the pr('cess completely aut()In~1tic? Yes
             d. brief description of methods used
                    A `lex' pr~~gram was used to reformat the TREC text to CLARIT format.
                    The English m(Irph()l(Jgical analyzer is written in C, and utilizes the lexicon
                    of 97,000 items (mentioned ahove and further descrihed helow).
                    The n(~un phrase parser, also written in C, uses the grammatical categories
                    supplied I)y the m(~rph()l()gical analysis and an ATN-style rule set to extract
                    n~~un phrases.

C. Data built from sources other th(~ ~e iliput (ext
      1. inte~('41ly-built auxili~uy tiles
             a. domaili independeut or dolnaul specific (if two sep~Lrate files, please till outone set
                    of questions for e~~ch tile) Domain independent
             b. type of file (thesaurus, knowledge ~ lexicon, etc.)
             c. total ~un()unt of st()r~ge (megabytes)
                    CLARiT Lexicon (2 megahytes)
                    English -word         statistics derived from the G rolier's Encycl(~pedia (2
                    megahytes)
             d. total number of concepts represented
                    97,000 words (CLARIT Lexicon)
                    139,(X~~ words ((;r()lier's list)
             e. type of representatioli (trwnes, semantic nets, rules, etc.)
                    Lexicon: A sorted word list, giving for each word its possihle grammatical
                    categories and category-dependent normalization.
                    (;r()lier's: A list of words with distribution and frequency counts
             f. tot~-tl computer tilne to build (approxu nate number of hours)
                    (1) if already built, how much tilne to modify ft)r TREC?
                             Already huilt--Not modified for TREC
             g. total matiual tilne to build (approximate number of hours)
                    (1) if aheady built, how much time to modify for TREC?
                             Already huilt--Not modified fi)r TREC
             Ii. use of `nanu~ll labor
                    (1) mostly `nanuŁ~ly built usin~' speci~~ interface
                    (2) mostly machiuc built wi~ manu~~ con-ection
                    (3) initi~d core m~mu~-illy built to "bootstrap" for completely machine-built


                                           497

                                   completion
                            (4) ()~C~ (describe)
                                   Initially denved from on-line sources but substantially modified and
                                   maintained manually
             2. extenially-built auxili~~y file
                    a. type of file (Treebank, WordNet, etc.) None
                    b. toL~l aifloUnt of storage (megabytes)
                    C. to~~ number of concepts represented
                    d. type of representation (fr~unes, selnailtic nets, rules, etc.)


II. Query Construction
             (please fill oUt a section for each query construction method used)

      B. Manually constructed queries (ad h()(:)
             N(~te, as descnbed below, there were only two steps in the CLARIT process that re(luired
             non-automatic  pr(wes.%ing:  (I) initial review and weighting of  the index  terms
             aut()matically-n(Jminated and derived f~~r the topic and (2) review of 1st-pass retrieved
             documents to identify 5-I(~ relevant OneS for "feedback".

             1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <del>
             2. average tilne to build query (minutes)
                    5 minutes--average time to review & weight automatically-nominated terms
             3. type of query builder Graduate students
             4. tools used to build query
                    c. other lexical tools (identify)
                            CLARIT noun-phrase parsing (extraction) nominated query terms from the
                            textual descriptions of topk's.
             5. which of the following were used?
                    a. terin weighting
                            Yes. Graduate students weighted terms with weights of "3", "2", or "1",
                            according to whether the extracted terin was central or peripheral to the
                            topic. (Sonic extracted noun phrases were discarded as irrelevant or
                            ill-formed; the vast majority were retained.)
                    C. proxitnity operators
                            No. Though proximity plays an implicit role when noun phrases are used as
                            terms.
                    d. addition of terins not jucluded ill topic
                            (1) source of terms Not in the first round of routing
                    C. other (describe)
                            The ad hoc queries for the second fifty topi~s were formed in three stages.
                            The first stage was the construction of a topic-derived routingipartitioning
                            thesaurus.
                            The routingipartitioning thesaurus was generated by CLARIT from the
                            method described al)()ve, using only text fields of the topics. The
                            automatically derived noun phrases were hand-weighted by graduate
                            students with weights (~f "3", "2", or "1", according to whether the
                            extracted term was central or peripheral to the topic. Some extraneous
                            terms were deleted.
                            The  routingipartitioning   thesaurus was  passed  over the  parsed
                            representation of original 1.2 gigabyte training set, inducing a ranking of all
                            ~ documents using a scoring method taking account of exact and
                            partial matches and document length. The top 5(~ documents were retained,
                            for the next stage. These documents were manually judged by graduate


                                            498

                     students, st~rting from the highest scored downward until 5-lo relevant
                     documents were found. In etfect, this represented a "relevance-feedl)ack"
                     step in the retrieval pr~~ess.
                     In the next stage, the 5-I(~ "relevant" d(K:uments were used to produce a
                     CLARIT-derived pseudo-thesaurus f~)r the topic. (As descril)ed ahove, this
                     Consists of a list of prominent terms in the collection of documents, h~sed
                     on frequency, distril)uti()n, and "rarity" scores.) To this thesaurus were
                     added the ternis retained from the hand-weighting of the original topics.
                     This thesaurus fi)rmed the second routing/partitioning thesaurus. The entire
                     2-gigahyte   TREC    collection was  rescored  against this  second
                     routingipartitioni ng thesaurus and the highest ranking 2(~(~(~ documents were
                     selected fi)r the final-query stage.
                     The third, ()~ final-query, stage involved, first, calculating an IDF/TF score
                     fi)r each term and all term-contained words in the 2(X~)-document set for
                     the topic.  The query for that topic was created l)y taking the IDF/TF
                     weightings ~ the ternis from the originally chosen 5-1(~ relevant documents
                     and automatically forming a query l)y coml)ining all these terms along with
                     the topic-derived  terms  into a long  query  vector. A vector-space
                     representation (~f the 2EX)(~ documents was generated; the query vector was
                     used to identify the final set of 2()() ranked documents for each topic l)ased
                     oil cosine similarity measures.

D. Automatic~'ylly built queries (routing)
       1. topic fields used <title>, <desc>, <narr>, <con>, <fac>, <det~
       2. total coinpuler tilne to build query (cpu secoilds) (~.()3 cpu seconds
       3. which of the fi)llowing were used iii building ~e query'?
              a. terms selected from
                     (1) topic
                     (3) only documents with relevuice j udgineilts
              b. tenn weighting
                     (1) with weights based oil terms in topics
                            Yes. Topic terms were initially hand weighted.
              c. phr~'ise extraction
                     (1) from topics
                     (3) from d(icuinents with relev~'uice judgments
              d. syntactic pŁirsing
                     (1) of topics
                     (2) of ~`ill irLining documents
                     (3) of documents wi~ relevance judgments
              g. tokenizer (rec()L~nizes d~~tes, phone numbers, CoilliflOil pattenis)
                     (1) which patterns ~`ut tokenized'?
                            Only simple acronyms such     as "I.B.M."
                            recognized as a unit.
                          description)
                            The routing queries were fi)rmed in two stages.
                            The first stage was the construction of a routingipartitioning

                                                                were automatically

              k. other (brief

                          thesaurus.
                          The routing/partitioning thesaurus was generated l)y CLARIT from
                          the supplied list of relevant documents per topic. The text of the
                          topic fields was parsed and added to the pseudo~thesaurus derived
                          from the relevant d(wuments. (Each pseudo~thesaurus consists of
                          automatically  chosen noun  phrases scoring ahove a certain
                          threshold, when scored fi~r rarity, distrihution, and frequency in the
                          relevant document set.)   Partial noun phrases, derived from


                                  499

                                 thesaurus entries, and attested in the documents, were also added
                                 to the thesaurus with a partial score.
                                 The r()utingipartiti()ning thesaurus was passed over the parsed
                                 representation of 1.2-gigal)yte training set, inducing a ranking of all
                                 5(~(),(~()(~ docunleilts. The top 2(~(X) documents were retained for the
                                 next stage.
                                 The next stage of construction of each topic's routing/partitioning
                                 query l)egafl l)y calculating the IDF![F score of all the terms and
                                 their contained words in the 2(K~(~ retained documents for that topic.
                                 The IDF[1~F-weighted terms fi~()m the 5 relevant documents that
                                 were ranked highest in the previous stage were added to the
                                 original hand-weighted query terms, forming the final query.
                                 For the second 900-megal)yte data set, the routingipartitioning
                                 thesaurus developed in the first stage of processing (~q descrihed
                                 al)()ve) was used to select the 2000 highest-ranked documents.
                                 The final query pr(KIuced in the second stage (al)ove) was used as
                                 a vector-space query (with partial matching) over the 2000
                                 documents to produce a tinal set of 2(J(~ ranked documents for each
                                 topic.

III. Searching

      A. Tot~il computer tilile to search (cpu seconds)
             1. retrieval time (total CPU seconds between when a query enters the system until a list of
                    document numbers (`trC ()bL~ined)
                    The final set (jf 2()()(J documents for each topic was collected l)y the use of the
                    r()utinglpartiti()ning thesaurus  (descril)ed al)ove).  This  process was  done
                    simultaneously for all queries and took al)()ut 6 hours f~~r the complete corpus.
             2. rŁmking time (t()tŁ'Ll cpu seconds to sort document list)
                    Once the vector-space matrix for the final set of 2E)(~(~ documents was constructed,
                    the actual comparison of the query vect(~r to all other vectors in the matrix took on
                    the order of 1()-2(J seconds.

      B. Which methods best describe your mŁ'ichine searching methods?
             1. vector space m(xlel
                    Yes. Using whole and partial matching on IDFITF-weighted terms.

      C. What flictors are included in your rai~ing?
             1. tenn frequency
             2. inverse d(~UmCnt frequency
             3. other term weights (where do they come from?)
                    Topic terms were given additional factors of "1", "2", or "3".
             7. proxilility of terms
                    Parts of noun phrases are close. Our partial matching of n~~un phrases implicitly
                    includes proximity.
             9. docwnent length

IV. What machine did you conduct the TREC experilnent on'?
             How much RAM did it have?
             What wŁ~~ the clock rate of the CPU'?
                    Total availalile machines, used variously:

                           I DECstati()fl 582(~ (64-Meg RAM)
                           2 DECstati()Ii 5(X)() (32-Meg RAM)


                                              500

                            1 DECstati()n 5000 (24-Meg RAM)
                            3 DECstati()n 3100 (24-Meg RAM)

V. Some systems are research prototypes Lnd others `uc c()InInerci~'d.
              To help compare ~ese systems:

              1. how much `s()ftw'Łue engineerin'~" went into the development of your system?
                     The CLARIT system is a research prototype and has Ileen under development for
                     4 years. The original system was implemented in Lisp; the current system has 1)eefl
                     re-engineered into C in the past 12 monthS.
                     The specific configuration of the system used in the TREC experiments was
                     produced in less than a week.
                     As a research prototype, tile system has minimal true "software engineering".
              2.  Cuven appropri~'ite rcs~)urces, could your system be made to ilin f~Lster?  By how much
                     (estiIn~'~te)?
                     Size constraints and the lack of gl(~I)al methods ~ attack caused us to duplicate work
                     (I)()tll human and computer). (;l()1)al methods that are smarter aI)out resource
                     CoilSuIliptioll could make an order of magnitude difference. Almost all CLARIT
                     processing is modular and separal)le; results of pr(~cesses are additivelcomposal)le.
                     Splitting the pr(~ess across machines--or running in parallel--would greatly speed
                     up the system.
              3. What features is your system missing that it would benefit by if it had them?
                     User interface.
                     Some datahase mechanism for document storage.
                     Potential "next features" include the f()ll()wing:
                            - automatic spelling correction
                            - integrated pr(~per noun recognition
                            - programmaille token recognition
                            - progranimaille I automated category assignment (guessing)
                            - pr()grammal)le I automated d~)cument structure analysis
                            - automated syn~~nym I related word discovery and use
                            - datahase support tor domains and thesauri, contexts, etc.
                            - an integrated interface for 1)0th datal)ase construction and (Juery
                            elaboration


                                          501

System Summary and Timing

ConQuest Software, Inc.

General Comments

The timings should be the time to replicate runs from scratch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be dimcult, such as getting total time for document indexing of huge
text sections, or mailually building a lalowledge base. Please do your best.

I.      Construction of indices, kuowledge bŁ~';es, and other data structures (ple~~se describe
        your system needs for seŁ~ching)

all data structures that

       A. Which of the following were used to build your data structures?
              1. stopword list yes
                     a. how many words in list? 70
              2. is a controlled v(~abul~lry used? no
              3. stelnining
                     a. stŁ..uid~ud stemming ~dg()ritl)nlN 110
                     b. In()rph()lo('ical ŁulLdysis yes
              4. (Cflfl weighting yes
              5. phrase discovery yes
                     a. what kind of phr~~e? I)araphr~se of Query
                     b. usilig statistical meth(XIs Statistical proximity match
                     c. using syiltactic methods Limited
              6. syntactic parsing Linilted--PoS assignment
              7. word sense disainbiguation In query hy user, & in explosion of terms
              8. heuristic associations yes
                     a. short definition of these associations Terms associated via semantic net
              9. spelling checkin(2 (with manual correction) In query only
              10. spelling correction no
              11. proper noun identification ~~dg()rithIn If identitied l)y lexicon
              12. tokenizer (recognizes dates, phone numbers, common pattenis)
                     a. which pattenis are tokenized? Many
              13. are the m~'~ually-indexed terins used? no
              14. other techniques used to build d~ta structures (brief description)
                     Index organized hierarchically so that best documents (based on a coarse grained
                     ranking algorithm) are returned to user while search continues on very large
                     databases. Linked lists are used to connect and identify idioms. Semantic network
                     term explosion is c(~ntr()lIed by "weighted" links where weights are selected as either
                     numerical or fuzzy sets based upon the link source and relatio~ship.

       B. Statistics on d~ta structures built from TREC text (please till oUt each applicable section)
              1. inverted index
                     a. total ~unount of stonige (me&iabytes) 1.2 Gb for 2.3 Gb text, 52%
                     b. total computer tune to build (approximate number of hours) 150
                     c. is the pr&~ess completely automatic? yes
                            if not, appmximately how many hours of manual labor? Setup--4 hours
                     d. are term positions within d(icuments stored? yes
                     C. single tenils only? no
              3. knowledge bases
                     a. total ainount of storage (meLYabytes) 12 Mbytes


                                         502

                      b. t()~l number of c()1lcepts represented 25(),()(N~+ c(~ncepts, 1.5M links
                      C. type of representLIti()Il (fr~unes, semaxitic Ilets, rules, etc.) Weighted semantic network
                      d. total computer time to build (appr()x~ate iiumber of hours)      (~, already had it
                      e. total muiu~'d time to build (approximate uuinber of hours) (~
                      f. use of manual latx)r
                             (2) mostly m('lchine built with manuŁ'il correction
                                        yes--I)ut prior to TREC, hot DB specitic
                      g. auxili~iry tiles needed for machine use
                             (1) `nŁichine-readable diction~'iry (which one?)   ~erriaIn Wel)ster (al)ridged)
                             (2) other (identify)      Word Net, plus several thesaurus tiles

       C. Data built from 5OurCC5 other th'ui ~e iliput text  See 3(g) al)ove
              1.  inteni~illy-built auxili~'iry files Semantic Netw(Jrk
                      a. do'n~un independent or domain specific (if two sep~irate files, please fill out one set
                             of questions for each file)
                      b. type of file (thesaurus, knowledge bŁ~~e, lexicon, etc.) All in one
                      C. total Łlln()unt of stora~e (Ine~Tabytes) 12
                      d. total number of concepts represented 25(~,()(~(~+
                      C. type of represenL~ti()n (fi-unes, semantic nets, rules, etc.) Semantic net
                      f. t()tŁ'Ll computer tilne to build (approxilnate number of hours)  Already had
                             (1) if Ł`itready built, how much time to modify for TREC?       None
                      g. total m'~u~tl time to build (approximate number of hours)      Already had
                             (1) if ah-eŁ'Ldy built, how much tune to modify for TREC.!      None
                      h. use of manual labor
                             (2) mostly machine built with manu~'il correction

II. Query constructioll
              (please till out a section for each query c()nstl~ucti()n method used)

       A. AutoinaticŁilly built queries (ad hoc)
              1. topic fields used  Used entire topic with s(jme simple tiltering
              2. total computer tilne to build query (cpu seconds)    unknown, est. < (~.1 sec. ea.
              3. which of tlie following were used?
                      a. term weighting with weights bL'L~Cd on teflns in topics
                      b. phrase extraction from topics
                      C. syntactic pusing of topics
                      d. word sense disLnnbiguL~i()n
                      C. proper IIOUII identific~ti()n algorithm (look up)
                      f. tokenizer (reco(2nizes ~ites, phone numbers, coliuflon pattenis)
                             (1) which pattems (`LrC tokenized?      many
                      h. expailsion of queries Usin(T previously-c()nstructed ~ita structure (from part I)
                             (1) which structure?        Tapered wind(~w

       B. Manually constructed queries (ad h(ic)
              1. topic fields used  User judgment
              2. aver~ige tune to build query (minutes)    1-5 minutes
              3. type of query builder
                      b. computer system expert
              4. tools used to build query
                      a. word frequency list        yes
                      b. knowledge base browser (knowledge base described in part I)       yes
                             (1) which structure from pail I
                      c. other lexical tools (identify)   Lexicon
              5. which of the following were used?


                                                      503

                     a. term weighting yes
                     b. Boolean Connectors (AND, OR, NOT) Availal)le. Not used.
                     C. proxilnity operators Automatic
                     d. addition of tenus not jiicluded in topic yes--I)ased on user judgment
                            (1) 5()U~CC of tenfis
                     e. other (describe)

       C. Feedback (ad hoc) AvailaI)Ie. Not 5U1)Illltted in TREC.

       D. AutolnaticUly built quenes (routing) Av~ liable. Not Submitted in TREC.

       E. Manually constructed quenes (I'()utin(') Available. Not 5U1)Illltted in TREC.

III. Searcililig

       A. Tot~~ computer tilne to search (cpu seconds) 2-10 seconds, dep. on (juery
               1. reLlieval t~e (tot~ CPU seconds between when a query enters ~e system until a list of
                     document numbers LUC obtained) see above
               2. ranking time (t()t11 CPU seconds to sort d('cument list) Included in number above

       B. Which methods best describe ~()U~ machine se~tiching me~()ds?
               1. vector space m(xiel Some teCllIlI(1Ue5 used
               2. probabilistic model Some probability used in ranking
               5. Boolean m~~ching Available. Not used in TREC.
               6. fuzzy logic (include y()Lll defluition) Fuzzy semantic net links used in term explosion.
               8. neural networks No--See 6
               9. conceptual graph matching Yes--query concept created by explosion

       C. What factors are included in yow- ranking?
               1. tenn fi-equency
               2. inverse d(~ument fiequency Available. Not used in TREC.
               3. other term weights (where do they COIflC from'?) Manual
               4. sem('~tic Closeness (LL~ in sein~tntic net distance) yes
               5. position in document Available. Not used In TREC.
               6. syntactic clues (state how) Availal)le. Not used in TREC.
               7. proximity of terms
               9. document lengtli
               10. completeness (what (;/,, of the query terms are present)
               15. other (specify) User cII()()ses--pr()grammable

IV. What machine did YOU conduct ~e TREC experilnent on? Sun SPARC II
               How much RAM did it have? 64 Mbytes
               What wa-s the clock rate of ~e CPIJ? 50 MHz

V.  Some systems are research prototypes and others are commercial.
               To help compare fliese systems:

               1. How much "software en(~~ineering" went into the development of your system?
                     The underlying "engine" used f~)r TREC is also used in a commercial product
                     (C()nQuest)--llence, lots of SIW engineering is behind it.
               2. 6iven appropriate resources, could your system be made to ruii fŁ~ter?  By how much
                     (estimate)? Yes--at least a factor (~f 2
               3. What features is your system missing that it would benefit by if it had them?
                     Subject domain add-on dictionary.


                                               504

System Summary and Timing

GE Research and Development Center

General Conuneni~

The timings should be the time to replicate runs from scrŁ'itch, not including trial runs, etc. The times should also
be reasonably accurate. This sometimes will be difficult, such as getting total time ft)r document indexing of huge
text sections, or inailually buildinQ a k'iowledge base. Please do your best.


I.     Construction of indices, knowledge bases, and oIlier data structures (please describe all data structures that
       your system needs for se~ircliing)
              We did rn~ pre-indexing of the data

       B. Statistics on ~ita structures built from TREC text no data provided

       C. Data built from sources other tliaii the input text --no

II. Query construction
              (please fill out ~`i section for each query construction method used)

       B. Manually constructed queries (ad 11(X)
              1. topic fields used Mostly description, narrative, and concepts
              2. averŁ'Ige time to build query (minules) Al)out 2(~ minutes fi)r initial query
              3. type of query builder
                     b. computer system  expert
              4. tools used to build query
                     b. knowledge base browser (knowledge bise described in part I)
                            (1) which structure from part I inverted samples of corpus
              5. which of the f()llowino were used?
                     b. B(x)lean connectors (AND, OR, NOT)
                     c. proximity operators
                     d. addition of telins not included in topic
                            (1) source of terms
                                      system lexicon, statistical analysis of samples matched l)y initial
                                      queries

       C. Feedback (ad hoc) We did not do feedliack, hut we did query refinement

       E. Manually constructed queries (r()utin~) Ad hoc and routing were d(~ne using the same method
              1. topic fields used
              2. average time to build query (minutes) Ahout 2(~ minutes for initial query
              3. type of query builder
                     b. system expert
              4. data used fi)r building query
                     b. from all trŁuning d(Xuments statistical analysis of samples retrieved
                     c. from documents with relevance judgments
                            used for training, testing, and word frequency analysis
              5. tools used to build query
                     d. machine aiialysis of ~iining documents
                            (I) describe
                                      Word weighting analysis t6 determine what terms to add to queries


                                               505

              6. which of the f()ll()wifl(' were u.~ed?
                      b. Booleall C()flIleCtor~ (AN[), OR, NOT)
                      C. proxilnity oper~I(()r.~
                      C. other (bnef de~Cripti()fl)
                            system lexicon, statistical analysis of samples matched by initial (iueries

III. Searching

       A. Total computer tilne to NeLUCh (Cpu NeC()fldN)
              1. retiieval tilne (totil CPU NecondN between when a query enters the system Until a list of
                      document numbers tie obLimed)
                      AI)out 2(~ h(~uI's = 72l)(~) CPU seconds. As the documents are not pre-indexed, this
                      includes all operations oil all documents
              2. ranking time (tolil C~U seconds 10 sort d(X'ulnent list) Al)()ut 3()() CPU seconds

       B. Which methods best descnbe Y()U~ machine searching methods?
              5. Boolean matching
              7. free text sCan'lin(T

       C. What faCt()rs are inCluded in your ri'iing?
              5. position in doCument
              15. other (specify) Numl)er (~f hiL%' on topic description

IV. What maChine did you ConduCt the TREC experilnent on'? SUN SPARCstation-2
              [low muCh RAM did it have? 48 Meg
              What wŁ~~ the clock rate of the (?PU? standard

V. Some systems are reseŁu~ch prototypes and others ~u.e C()mmerciŁ'il.
              To llelp COIn~&UC these systems:
              Our system used a pattern matcher and lexicon that have l)een commercially developed, but
              the basic Boolean document processing engine was developed for TREC in a few days
              1. How much "software en~ineering" went into the development of your system?
                      2 days for the l)asic engine
              2.  Given appr()pflate resources, could your system be made to run t~L~ter?   By how much
                      (estimate)?
                      Processing time per document could easily be improved by a factor of 2. Processing
                      time f(Jr ad hoc retrieval could be impr(Jved by a factor of about 1(K~OOO by using
                      an inverted indexing strategy, at a cost of additional storage and indexing time for
                      the corpus.
              3. What features is your system missing that it would benefit by if it had them?
                      Automatic query  generati~~n, aids f~~r compiling queries from         higher-level
                      descriptions


                                           506

System Summary and Timing

TRW

General Coininenis

The fimings should be tlie tilne to replicate runs from scratch, not including tnal runs, etc. Tlie tilnes should also
be reasonably accurate. This soinetilflCs will be difticult, such as gettin~ total time lor document indexing of huge
text sections, or manually building a k'iowledge base. Please do your best.


I.     Construction of indices, knowledge bases, and other da~ structures (please describe all da~ structures that
       your system needs for se~chi'ig)

       A. Which of ~e following were used to build your da~i structures?
              None--we used a free text scanning approach. CD-ROM data was decompressed and loaded
              onto niagnetic disk in raw form.

       B. Statistics on d~ta structures built from     TREC text (please fill out each applicable sectk)n) --none

       C. Data built from sources other th'ui ~c input text --none

II. Query construction
              (please fill out a section for each query construction method used)

       A. Automatically built queries (ad hoc)
              We performed some initial trials with 1)uilding queries l)ased on word frequency taken from
              documents from the initial relevance judgments supplied l)y NIST.      Unfortunately, this
              appeared to lead us down a l)lind alley, perhapS l)ecause the initial judgments were not all
              that good. We are planning to try this again with the new judgments.

       B. Manually constructed queries (Ł~ h(~) We primarily used this niethod f~)r the TREC queries.
              1. topic fields used
              2. average t~e to build query (minutes)
                        The initial query would take a couple of nimutes to manually form l)y cutting and
                        pasting from the topic descriptionS with a text editor. "Feedliackt1 was the human
                        looking at the retrieved documents, comparing with the sample good documents
                        supplied l)y NIST, making independent judgments on document relevance, and
                        retining the query in an iterative manner.
              3. type of query builder
                        b. computer system expert
              4. tools used to build query Ilo special tooLs
                        a. word frequelicy list
                        b. knowledge base browser (knowledge base described in part I)
                               (1) which structure from part I
                        c. other lexical tools (identify)
              5. which of the f~)llowinLT were used?
                        b. Boolean connectors (AND, OR, NoT)
                        c. proxilnity operators
                        d. addition of terms not included in topic
                               (1) source of terms
                                     Additional terms were supplied l)y human hased on outside
                                     knowledge or from reading the text.


                                                       507

      C. Feedback ~ hoc)
             1. iflitiLil query built by me~()d 1 or Ineth(ld 2?
                     Initial (1UCI-ie.~ were 1)uilt by human fl~()m sul)set of topic keywords.
             2. type of per.~on doing leedh.ick
                    b. .~y.~te'n expert computer .%ystenl analyst
             3. (Lver~Ige time to do complete feedbtek We did this manually.
                    a. cpu tilne (total cpu .`;ec()nd.~ for all iterations)
                          A human refining tlie (lueries fl)r an hour might use 1(~ minutes of FDF
                          time.
                    b. cl(lck time floin initifi construction of query to completion of final query (minutes)
                          Feedl)ackl(1uery refinement was done manually.          Some topics were fairly
                          easy, with reasonable results being achieved in less than an hour. Others,
                          took several hours.
             4. average number of iterations
                    a. average nuinber of documents exatnined per iteration Typically 2()-3().
             5. minimum number of iterItions    ~1aybe 1(~.
             6. m~ixiinum number of iterations  ~Iaybe 1()().
             7. what determines the end of in iteration?
                    Each iteration is (1) the human updates the (lueries, (2) the machine executes, (3) the
                    Ii uman reviews the retrieved documents.
                    We stopped working oil a topic when it seemed that the results were converging to
                    practical limit for our approach, i.e., adding additional synonym keywords, or
                    changing the (~uery structure, wasn't g()iOg to produce more reasonable results.
             8. feedback me~()ds used
                    d. m~~ual methods
                          (1) using individual judgment with 110 Set ~ilgon~m
                                  After working through the first dozen (jr so topics, we started to fall
                                  into a semi-routine. We are still thinking about the nature of this
                                  "routine" and what types of tools could help automate it.

      E. Manually c()nstructed queries (r()utinL')
             Same answers as fl)r ad h(Jc. If fact, given our query language approach, final ad hoc queries
             and r(~uting queries are the same.

III. Searching

      A. Total computer t~e to search (cpu seconds)
             1. retrieval tilne (toLil cpu seconds between when a query enters Ilie system until a list of
                    document numbers are obLimed)
                    Time to process a single set of topic queries against 1.2GB is 2-3 minutes.

                    Time to load the tipster corpus (read from CD-RoM, decompress, and load onto
                    FDF's disk) was less than 8 hours.
             2. ranking time (total cpu seconds to sort d(lcument list) 1-2 seconds

      B. Which methods best describe your machine searching me~()ds?
             7. flee text scanning
                    T(J perf(~rm the actual searches, we used the fast data finder (FDF) text search
                    hardware.  The FDF implements a wide variety of pattern matching functions
                    including w()rdLstringlphrase matching, fuzzy matches, Boolean logic, proximity
                    operators, term counting, term completeness, and numeric ranging.
      C. VVhat factors are included in your ranking?
             5. positi()n in document
             7. proxilnity of terms


                                                508

              9. document lengtli
              10. c()Inpleteness (what ~ of the query ter'n~ Łu-e pre.~ent)
              12. word specificity (i.e., ~`wiinal vs. dog vs. p()()dle)
                     To provide a c(~rse-grain ranking we ran several (lueries per topic, to provide
                     increasing levels (~f recall. The five methods al)()ve were used in addition to Boolean
                     logic, numeric ranging, and word order.

IV. What machine did you conduct ~e Tl~C expelilflellt on? Sun-3/16() with FDF2E)~)() and C-51 disk array
              How much RAM did it have? 8 NIB
              What w~~s the clock rate of ~e CPU?
                     A Sun-3116() is a couple of Mips. N(~te that the Sun is just the host, the FDF does
                     the actual pattern matching. The FDF2()()() model used for TREC clocks at around
                     12 MHz.

V. Some Systems ~ue rese~uch prototypes and others ~`u-e co'n'nerci~'d.
              To help comp~ire ~ese systems:

              1. How much `s()ftwL'ue cllginecnng" went ilito the development of your system?
                     No special programming was done for the TREC conference. The FDF system iLself
                     was the result of extensive pri(~r development.
              2.  Given appropriate resources, could your system be made to run f~L';ter?  By how much
                     (estimate)?
                     How fast would y(~u like it to go? The system used to execute the TREC (~ueries was
                     2()%  (~f a full-up system. We're currently working on software that will
                     automatically c(~E)r(linate multiple FDF systems to working in parallel. We're aLso
                     considering faster FDF chips and data transfer methods.
              3. VVhat features is y~~ur system missing th~it it would benefit by if it had them?
                     The next generation (~f FI)F systems have an in-hardware term weighting capahility
                     that can l)e used, in c()nll)inati()Il with the existing features, to return a numeric score
                     for a document. This wouki allow f~)r finer grain in ranking. New model prototypes
                     were not availal)le for this eff~,rt.


                                           509

System Summary and Timing

vPI & sU

General Coininenis

The timings should be the tilne to replicate runs from      scratch, not including trial run 5, etc. The t~es should also
be re~'tsonably accurate. This soifletitnes Will be dimcult, such Ł~ getting total time ft)r document indexing of huge
text sections, or `naliu~'tlly building a k'iowledge b~'Lse. Please do your best.


I.      Construction of indices, knoWledge bises, and other data structures (ple~~se describe all data structures that
        your system needs for se~'u'chi'ig)

        A. Which of ~e folloWing were used to build y~~ur data structures'!
               1. st()pW()rd list  yes
                      ~I. 110W In my Words in list'? 41 ~
               2. is LI controlled v(~('LbulL'n'y used?     11(~
               3. stCIfl111int~ ~
               4. teilli Weighting
                      Vector and p-n(~rni runs were d~~ne with n(~ term weights. Vector runs were aLso
                      perf~)rIned with aug~n(~rn1 * idf weighting.
               5. phrase discovery   no
               6. syntactic p('~sing 110
               7. Word sense dis~~nbiguati()n      11(~
               8. heuristic ~L~s()ciati()ns Ilo
               9. spelling checking (With manual collection)         no
               lo. spelling correction    110
               11. proper IIOUll identificŁ'Iti()n Ł`Ilg()ri~ln As pr(Jvided in SMART
               12. tokenizer (recogil izes dItes, phone numbers, C()I~()11 patterils) As provided in SMART
               13. Ł`ire the in~'uiuŁ'dly-indexed terms used'!    not used as suggested in guidelines
               14. other techniques used to build d~It~'I structures (brief descuption)
                      1983 VerSIon of SMART, enhanced with VPI&SU routines

        B. SL~tistics on ~Ita structures built fR)In TREC text (ple~'Ise fill out each applicable section)
               Except it you want Us to answer under 4 here re the knowledge l)ase used to help Iluild our
               Boolean (lueries, please advise.
               5. other ~It~'I structures built from TREC text (what?) Document vector tile and term dictionary
                      LI. toLIl Ł`u'1()unt of storige (IllegIbytes)
                                   Approx. 15Nil~ t~)r the dictionary and 121 l,IB for the Document vector file
                                   for the entire ~Vall Street journal collection.
                      b. told computer tune to build (`Ippr()xiln~'Ite number of hours)
                                   Approx. time t(~ build above lo hours (~n ccrdl (DECstation 5(~N~ Model 25,
                                   i.e., a MIPS R3(~()(~ chip running at 2SMHz)
                      C. is the pr('cess completely automatic'! yes
                      d. brief description of methods used
                                   The document text is tokenized, stop words are thrown out, and non-noise
                                   words are kept in the term dictionary along with its occurrence frequency.
                                   Each term ill the dictionary has a unique identitication numller. The vector
                                   tile contains for each document its unique ID, and a vector of term ID and
                                   weights for the term. The weighting scheme is flexulle and can l)e changed
                                   to ()flC (~f several schemes after the indexing is complete. (If necessary we
                                   can till in details here. Please advise.)


                                                            510

      C. Data built from  sources other thŁui ~e input text --no

II. Query construction
              (please fill out Ł`t section for cich query construction method used)

      A. Automatically built queries (~id hoc)
              I. topic fields used Description, Narrative, and Concepts.
              2. total computer tilne to build query (cpu seconds) Vector queries--5(~ Seconds for ~ topics
              3. which of the following were used?
                        a. term weighting wi~ weights b~L~ed on te~s in topics
                               Term weighting was used for vector queries.
                        C. proper noun identific~ition algori~m As provided in SMART
                        f. tokenizer (recognizes dŁ~tes, phone numbers, coininon patterils)
                               As provided in SMART

      B. Manually constructed queries (`id hoc)
              1. topic fields used Description, Narrative, and Concepts.
              2. average time to build query (minutes) 3 mills/query
              3. type of query builder
                        b. computer system expert
              4. tools used to build query
                        b. knowledge base browser (knowledge base described in p(u~t I)
                               (1) which structure from p~ut I
                                      for solliC of our work we build a knowledge base to help suggest
                                      broader/narrower terms--added inf~)rInati()n can be provided if
                                      appropriate
                        c. other lexical t()()ls (ideutify) vi (editor)
              5. which of the following were used?
                        b. Boolean connectors (ANI), ()I~, N(~~)
                        d. Ł`tddition of terms not included in topic
                               (1) source of temis domain knowledge of experts

III. Searching

A. Total computer tilne to se~irch (cpu seconds)
       Approx. 4 minutes for each topic.
       We did a full sequential pass through documents for this since we did
       space fi)r the inverted file.
       1. retrievŁd tilne (total cpu seconds between when a query enters ilie
              document numbers uc obtained)
       2. rankin(2 time (total cpu seconds to sort d('cument list)

not have enough disk

system until a list of

       B. Which metliods best describe your m~icliine sen-ching Ine~()ds?
              Meth~~ds: ~e used three main methods, and a scheme f(a- c()ml)ining the resulLs from those
              runs
              1. vector space In(xlel
              5. B()()lean matching
              6. fuzzy logic (include y~~ur definition) 1)-norm matching

       C. What factors ~`ire included in y~~ur rinking?
              We used several weighting methods in combination with the methods, to get a total of 8 ru~s
              that were the basis fi)r our sul)missi(~n.  We used binary weights, as well as:
              1. terin fi-equency
              2. inverse d('cument frequency


                                           511

             3. oilier tenn weights (where do they come from?)
                     augnorm, c(~mputed by SMART using the above factors

IV. What machine did you couduci ilie TREC experilneilt on? DECstati()n 5()(~~~ Model 25
             How much RAM did it h~~ve? 4(~M bytes
             What wa.'; tile clock rate of ~e CPU? MIPS R3()()(~ at 25MHz
             At the end of our work f~,r the submission, we finally had 3Gbytes of disk storage to work
             with.

V. Some systems [ire rese'.irch protolypes t.~nd other"; we c()Inlnerci[d.
             To help COInpL.~C tiiese systems:

             1. How much "soflwaie eugiucerilig" went into tile development of your system?
                     We began with the 1983 version of SMART, and have enhanced it. We tried to use
                     the new version (~f SMART on an RSI6(i(N~ but could not get reliable results and so
                     went back to our older version. We underwent extensive software development since
                     May but due to lack to disk space could m~t use most of what we developed for the
                     submission.
             2.  Give'i appropnate resources, could your system be m'.ide to run f~~ter?  By how much
                     (estimate)?
                     With m(~re disk space we could have used the inverted tile option and that would
                     have made things much faster. That would have allowed real time interactive
                     searching. Also, with Ill()~C disk space, we c(Juld have used an RSI~~O, a~uming
                     SMART could l)e ported and made fully operational.
             3. What features is your system missing thL.a it would benefit by if it had them?
                     Because (~f the disk space problem we were n(~t al)le t(~ do many of the efforts we
                     wanted to do. Work will continue this fall if disks are received in time. Among the
                     features:
                           - phrase identification and matching
                           - building "decisi~~n trees" after training with a sufficient set of relevance
                                  judgments
                           - implementing the CE() model (~f P. Thompson and trying it out in a
                                  variety of ways to combine results from a variety of runs and
                                  indexing schemes    (that could  include  stemming            andlor
                                  IliorphEJIogical analysis).


                                         512

System Summary and Timing

GTE Laboratoijes

General Coininents

The fimings should be the tilne to replicate runs from saatch, not including trial runs, etc. The tilnes should also
be reasonably accurate. This soluetilnes will be difficult, such ~ getting total time for document indexilig of huge
text sections, or m~ually building a knowledge base.    Pleise do your best.


I.      Construction of indices, knowledge b('Lses, and other datLi structures (ple~~se describe all data structures that
        your system needs for sea~ching)

        A. Which of the following were used to build y~iur d~tta structures?
               1. stopword list
                       a. how muly words in list?       28(~ words
               2. is a controlled v()c~'ibul'iry used? no
               3. steinlnin~
                       a. st~uid~u-d steininin (T L'4g()rithlns
                                which ones?     1~aice conflation
                       b. m()1i)h()l()gical Łui~dysis Ilo
               4. telin weighting  yes
               5. phrase discovely  Ilo
               6. syntactic p~~;~ing Ilo
               7. word 5C115C dis~unbigu~ition  ilo
               8. heuristic ~L~s()ciati()ns n(~
               9. spelling checking (with mŁmu(il colTectioll)   ilo
               10. spelling conection     Ilo
               11. proper noun identificition (ilgori flim Ilo
               12. tokenizer (recognizes dates, phone numbers, common p~'itterns)  Ilo
               13. we the m~uilly-indexed te~s used?         no
               14. other techiuques used to build ckiti structures (brief descuption)

        B. Statistics on ~iti structures built floin T~~C text (ple~'ise fill out each applicable section)
               1. inverted index
                       a. total Ł`~()unt of storige (ineg~'tbytes) 336(~ (f~~r the 24(~(J ~B ~4 text)
                       b. totil computer time to build (~ppr()x~~'ite number of hours)       672
                       c. is the process completely (`tutolnitic?   yes
                       d. Lue terin positions wi~in d(icuments stoled?    yes
                       e. single terms only?    yes
               5. other dati structures built flom TREC text (whit?)     statistics files
                       a. total `unount of storige (meg~'ibytes)   400
                       b. to~l computer time to build ((`ipproxilnate number of hours)       24
                       c. is the pr(icess completely (`wt()mŁ'itic? yes
                       d. brief description of methods used
                                Index is scanned for fre(luency, location, popularity and record size
                                statistics. The results are used in normalizing tile weighting attril)utes.

        C. Data built from sonices other th'~ the input text  --no

II. Query construction
               (please fill out Ł1 section for each query construction method used)


                                                      513

         A. AutoInŁltic(dty built queries ((id hoc)
                1. topic fields used Topic, De.%'cri1)tio)fl, Narrative
                2. total computer ti'ne to build query (cpu seconds) 2 seconds
                3. which of the following were used?
                       a. term weighting witli weights bŁbed on tenns in topics
                       C. syntactic p~u-sing of topics
                       1. automatic addition of B(x)lean connectors or proximity operators


         [). Automatic(Llly built queries (r()utin~)
                1. topic fields used Topic, De.%cripti()n, Narrative
                2. total computer t~e to build query (cpu seconds) 2 secondS
                3. which of the f()llowin(2 were used in buildin~~ the query?
                       a. teflns selected from
                               (1) topic
                       b. tenn weighting
                               (1) with weights based oil terms in topics
                       d. syntactic ~(U5~Il~
                               (1) of topics
                       j . (~ut()In(1tic addition of B()()lCL1'1 connectors or proximity operators
                               (1) using iIlf()rm~~ti()n fr()1n the topics

III.  Searching

         A. Tot~il computer time to se~uch (cpu seconds) 1* all t()pic.% *1 1()8E)E)(l
                1. retiiev~Ll tillie (total cpu seconds between when a query enters the system until a list of
                       document numbers Łue ()bt(~ined)    72()()()
                2. ranking time (totti cpu seconds to sort d('cument list) -36(l()(~

         B. Which methods best describe your machiuc se~irching methods?
                10. other (describe) niulti-level attril)ute weighting

         C. What factors are included in your rinking?
                1. tenn frequency
                2. inverse d(icument frequency
                3. other term weights (where do they CoIIIC from?) explicit term weighting hy user
                5. position in document
                7. proxilility of terins
                9. document length
                10. completeness (wh(~t Y~. of the query terms Łue present)
                15. other (specify) record (d(~unient) id

IV.   What machine did you conduct the TRE(? experilnent on'? IBNI RSI6()(JE) 32(~
                How much RAM did it liLve? 32 NIB
                What w~i~ the clock ritC of the (?PI J? 25 NIHz

V.    Some systems are rese~tich prototypes ŁLnd others ~ue c()mluerci~d.
                To help coinpue these systems:

                1. How much `software engineering" went into the development of your system?
                       This is a prototype.
                2.  Given appropriate resources, could y~~ur system be made to run f~~ter?     By how much
                       (estimate)?
                       yes, given   taster hardware      and  m(~re      RAM,   we can pr()l)al)ly douhie the


                                               514

      pertomlance.
3. What feature%' i~ your sy~teIn ini.~~i'ig that it would benefit by if it had them?
      Varial)le .%~ized I)ucket.% to inipleinent linked lists.
      Iniproved ranking attril)ute range calculation.
      Spelling c()rrecti(~n.


                          515

System Summary and Timing

Siemens Corporate Research, Inc.

General Coininents

The timings should be the time to replicate runs from  saatch, not including trial runs, CtC ~)C tilnes should also
be re~~~onably accurate. This solnetilnes will be diflicult, such ~ getting total time for document indexilig of huge
text sections, or m(~ually building a ~()wledge bŁ~LsC. Ple~Lse do YoUr best.

Summary of method:    Completely aut()mJ tic vector matching where both document and (iuery vectors have
l)eefl expanded using syn~~nyms extracted from W()rdNet.

I.      Consti-uction of indices, know ledge b~i~es. ~~nd other da~ sti-uctures (please describe all data structures that
        your system needs for seuching)

        A. Which of the following were used to build your data sti-uctures?
               1. st()pw()rd list
                        ~i. how many words in list?
                              571 word st()pw()rd list used (standard SMART st()pword list)
               2. is a controlled v(~abulafy used?     Ilo
               3. stenlinin~'
                        ~ stand-ud steinining ~-LIg()ri thins
                              which olles?

                        b. m()i~ph()l()gic(~l (-ulilysis
                              Extremely simple suffix stripper to look words up in W()rdNet. (Checks for
                              olle of 22 suflixes and p()ssil)ly modifies end (~f stem if a matching suffix is
                              found. This was in code I inherited--I don't know the source of the sufrix
                              list, l)ut the list is a sul)set (~f that used l)y SMART, so it prol)al)ly comes
                              fl~()m SolliC "standard" algorithm.)      All words aLso pass through the
                              "triestem" stemmer (~f SMART.          This stemmer was originally hased on
                              I~()vin's CACM article, l)ut has evolved over the years.
               4. telin weighung
                        A tf*idf weight is used fi)r hoth i~uery and document terms, where the weight is
                        further n()rniali~'~ed so that an inner product computation produces the cosine ("tfc"
                        weights using the ternimology of "Term ~eighting Approaches in Automatic Text
                        Retrieval" l)y Silt(~n and l~uckley). A term is counted as appearing in a document
                        (for idf purposes) if it was in the original text ()V If it was added as a synonym. The
                        tt~idf portion of an added term's weight is multiplied hy .8 to produce its final
                        weight.
               5. phr~'L~e disc()veI~
                        ~t. wh(~1t kind of phr(-~se?
                        b. usin~ stitisticLI Ineth('ds
                        c. usintT s~tactic methods
                              W()rdNet contains c(,ll()cati()ns as meml)ers of synonym sets, so some phrases
                              may l)e added as synonyms.          However, such a collocation is assigned a
                              uni(lue concept numl)er and will (~nly match that exact collocation (so I
                              don't consider it to l)e "phrasing"). No other phrasing used.
               6. syntactic p(-u;~in(T Ilo
               7. word sense dis(-unbiguati~~n
                        No specific sense disaIiil)iguati()n procedure used. If a term ~~ccurs in more than one
                        ~(~rdNet syn~~nym set (which, hy definition, means that it is polysemous), the
                        syn~~nynis from all of its senses may potentially he added to the vector.           The


                                                     516

                      algorithm re(luires that at least two original text words agree on a synonym hefore
                      it is added to the vector.
                      The effect (~f this is to do a ~()()~ man's version of sense disaml)iguati()n for the
                      synonyms.
               8. heunstic Ł~~s()ciŁ'ttions
                      a. short definition of these ~sociations
                             W()rdNet synonymy relation only association used.
               9. spelling checking (with inatiull con-ection)  no
               10. spelling correction no
               11. proper i)OUII identificition ~`tlgori~in no
               12. tokenizer (reco(,'nizes d~-~tes, phone numbers, coi~on pattenis)   no
               13. Łu-e the mL~1u~-tlly-indexed terms used?   no
               14. other techniques used to build dŁ'Lta structures (brief description) no

       B.  Statistics on data structures built from FREC text (please fill out each applicable section)
               1. inverted index
                      a. total ~~ount of st()r~~2e (inegŁ-~b ytes) 947 megal)ytes (~f disk storage
                      b. total computer time to build (appr()xilnate number of hours)
                             5 hours to l)uild index given document vectors; document vectors took 37
                             hours t(J l)uild from text. Thus, approximately 42 hours to go from text to
                             inverted index.
                      C. is the ~R~C55 completely ~-1ut()In~ttic? yes
                      d. ~u-e terin positions wi~iii d(Xulnents stored'?
                             No term position information maintained.
                      e. single tCrins olily?
                             Single terms only (although, as stated al)()ve, a single term from WordNet
                             may l)e a collocation such as `electrical_discharge').
               2. n-grŁ-uns, suffix aiTays, siL'nature tiles
                      N-grams and signature tiles not used.        SMART stemmer algorithm incorporates a
                      (static) trie of suffixes.
               3. knowledge bases
                      No knowledge l)ase used other than W()rdNet (descril)ed under I.C.2).

       C. Data built from sources other th~-~ ~e input text
               1. inteni~i]ly-built auxiliai-y tiles Il()~C
               2. externuly-built ~-~uxili~'u-y lile
                      a. type of tile ~~-eebank, \V()rdNet, etc.)   W()rdNet (noun portion only)
                      b. t()tL~l
                      c. total
                      d. type

                          ~~()uI1t of st()r-t~'e (IneLT-Ibytes)  5 megal)ytes
                          number of concepts represented         35,155 syn~~nym sets (67,293 word senses)
                          of represeflt(-iti()Il (fr(-~es, ,~eIn~-~tic nets, rules, etc.)
                           We used only the syn~~nymy relation that W()rdNet contains.        However,
                           W()rdNet contains many other lexical relationships making it similar to a
                           semantic net.

II. Query construction

             (please fill out a section 1;()r etch query colisti-uction method used)
[We sul)mitted oliC set of results; those results were for automatically huilt ad hoc (lueries.]

      A. Autom~-ttic~tlly built queries (ad hoc)
              1. topic fields used
                    C(~ncepts (<con>),  Description (<desc>), Factors (<fac>), Narrative (<narr>),
                    Nationality (<nat>), Title (<title>)
              2. total computer titne to build query (cpu seconds)
                    1 second, ~ average (5~) seconds to index 50 (lueries)


                                            517

                3. which of the f()ll()win~ were used?
                       ~L. telin weighting wi~ weights based on te~s in topics Yes, as descril)ed al)()ve
                       d. word sense dis(UnbigUL~ti()n
                              Ouly ~s (1escril)ed al)()VC (two original (~uery ternis must agree on a synonym
                              to l)e `1(1(led).
                       h. exp~rnsi()n of qUeries using previ()Usly-c()nst1~ucted d~ta struetwe (from part I)
                              (1) which snucture?      ~V()rdNet.
III.  Searching

         A. TotLil computer tilne to se~'uch (cpU secouds)
                1. retrieval tjine (t()tŁil CPU seconds between when a query enters ~e system until a list. of
                       document nunibers ~ue ()btLtined)
                       15 CPU seconds, (P11 average    (756.4 cpu seconds to ~r()CC55 So (lueries)
                2. r~~nking time (totil CPU seconds to sort d(~UInent list)
                       not applical)le: list (~f top 2(H) similarities maintained while searching

         B. Which methods best describe your `n~chine se~nching methods.?
                1. vector 5~(LCC model

         C. VVhat factors ~ue included in your rŁ~nking?
                1. telin frequency
                2. inverse d(~uInent frequency
                4. seln'wtic closeness (L'L~ in selnintic ilet distance) (synonyms)
                9. docullient leng~
                13. word sense frequency
                       (nouns with only ()11C sense in ~V()rdNet get all their synonyms added)

IV.   What machine did YOU coilduct the Fl~iŁ('. experimeut oil? ~un II~X
                [low much 1~AM did it h(~Ive?  64 megal)ytes
                Wh~it w'~s the clock nite of ~e Cl~t J'? 4E)~1Hz

V.    Some systems ~`u-e rese(~Uch prototypes (md others ne c()mmerci~~d.
                To help c()1np~U'e ~ese systems:

                1. [low much "s()ftw~ire engilleering" went into the development of your system?
                       Our system Is a version of SMART with Ilb)dlfled indexing C(Pde. SMART has l)een
                       well-engineered (but its main goal is tleXil)ility, not raw speed). Little time was spent
                       (Pptimizing our Illodjfications.
                2.  (jiveli (~ippr()pfl'L'ite resources. could ~()U~ system be made to ruii f~~ster? By hoW much
                       (estimate)?
                       SMART could pr()l)al)ly l)e made to run s~~mewhat taster it' it were made less
                       tiexible, that is, it' we coded a version that performed only the sorts of runs we made
                       here. I doubt the difference would be dramatic. Preprocessing steps perfi~rmed on
                       W()rdNet could impr(Pve the efficiency (~f the expansion code.
                3. WhIt fe'ttures is Y()U~ system missing th~it it would benefit by if it had them?
                       Incorporating part-~~f-speech tagging s(P that we could kn(Pw it' the term is a noun
                       befiPre looking it up in ~Vi~rdNet should be beneficIal (we didn't do this for TREC
                       because the tagger we have is fairly slow).             In the same vein, a true sense
                       disaml)Iguat(Pr--a way (Pf picking the c(Prrect W()rdNet synonym set--would clearly
                       help, but I d(Pn't kn~~w of a way of doing th at automatically yet (it is part of. our
                       research).


                                                  518                         *u.s. (;.P.O.:1993-341-931:82636