System Summary and Timing Organization Name: University of Kansas List of Run ID's: KU1 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 23 - Controlled Vocabulary? : no - Stemming Algorithm: no - Morphological Analysis: no - Term Weighting: yes - Phrase Discovery? : - Heuristic Associations (including short definition)? : yes - Tokenizer? : Statistics on Data Structures built from TREC Text - Inverted index - Run ID : KU1 - Total Storage (in MB): 325 - Total Computer Time to Build (in hours): 11 hours - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Run ID : KU1 - Total Storage (in MB): 196 - Total Computer Time to Build (in hours): 43 hours - Automatic Process? (If not, number of manual hours): yes - Use of Manual Labor - Number of Concepts Represented: 10022 - Type of Representation: similarity matrix - Auxiliary Files Needed: none - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: wsj: LP, TEXT; sjm: LEADPARA, TEXT - Average Computer Time to Build Query (in cpu seconds): 2.4 minutes elapsed time (cpu time unavailable) - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : yes - Phrase Extraction from Topics? : no - Syntactic Parsing of Topics? :no - Word Sense Disambiguation? : no - Proper Noun Identification Algorithm? :no - Tokenizer? : no - Heuristic Associations to Add Terms? :yes - Expansion of Queries using Previously-Constructed Data Structure? : yes - Structure Used: similarity matrix - Automatic Addition of Boolean Connectors or Proximity Operators? :no Searching Search Times - Run ID : KU1 - Computer Time to Search (Average per Query, in CPU seconds): 144 - Component Times : query expansion 10 document retrieval 134 Factors in Ranking - Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Other Term Weights? : yes - Semantic Closeness? : no - Position in Document? : no - Syntactic Clues? : no - Proximity of Terms? : no - Information Theoretic Weights? : no - Document Length? : yes - Percentage of Query Terms which match? : yes - N-gram Frequency? : no - Word Specificity? : no - Word Sense Frequency? : no - Cluster Distance? : no - Other: Term similarity between original term and terms added from similarity matrix by automatic expansion Machine Information - Machine Type for TREC Experiment: Sun SPARC 10 - Was the Machine Dedicated or Shared: Shared - Amount of Hard Disk Storage (in MB): 9 GB - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 50 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: modest - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : 20% - Features the System is Missing that would be beneficial: disambiguation, browser for viewing term similarity matrix Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: automatic calculation of term similarity based on the contexts in the corpus in which the terms instances appear.