System Summary and Timing Organization Name: IBM List of Run ID's: ibmgd1 ibmgd2 ibmge1 ibmge2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 255 - Controlled Vocabulary?: no - Stemming Algorithm: none - Phrase Discovery?: - Tokenizer?: Statistics on Data Structures built from TREC Text - Inverted index - Run ID: ibmgd1, ibmgd2, ibmge1, ibmge2 - Total Storage (in MB): 3485 - Total Computer Time to Build (in hours): 2 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: yes - Only Single Terms Used?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: desc - Average Computer Time to Build Query (in cpu seconds): 0.0126 - Method used in Query Construction - Tokenizer?: - Patterns which are Tokenized: Phrases common to many topics such as "To be relevant, a document must..." were removed. The list of phrases to match was manually constructed from the desc fields of topics 51-250. Also, each query term was automatically expanded into a list of terms using suffix expansion. - Expansion of Queries using Previously-Constructed Data Structure?: - Other: The system automatically constructs additional query terms from pairs of query words that occur within a window of five in the query. Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: title, desc, narr - Average Time to Build Query (in Minutes): 4 - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Addition of Terms not Included in Topic?: Searching Search Times - Run ID: ibmgd1, ibmge1 - Computer Time to Search (Average per Query, in CPU seconds): 400 - Search Times - Run ID: ibmgd2, ibmge2 - Computer Time to Search (Average per Query, in CPU seconds): 1000 Machine Searching Methods - Probabilistic Model?: yes Factors in Ranking - Term Frequency?: yes. Both in the document and in the collection as a whole. - Proximity of Terms?: yes (runs ibmgd1, ibmgd2 only.) - Document Length?: yes. Machine Information - Machine Type for TREC Experiment: IBM PowerPC RS/6000 42T - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 15028 - Amount of RAM (in MB): 64 - Clock Rate of CPU (in MHz): 120 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: Several years all non-TREC specific. - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: The system is I/O bound - Features the System is Missing that would be beneficial: Ability to execute Boolean queries; Ability to handle phrases; Ability to handle fields within the document. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: The system includes Lexical Affinities - terms that are formed out of pairs of words occurring within a distance of 5 words from each other. The system performs morphological expansion at run time instead of stemming at indexing time.