System Summary and Timing Organization Name: Open Text Corporation List of Run ID's: colm1, colm1A, colm2, colm4, colm5 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: configured to 0 - Controlled Vocabulary?: No - Stemming Algorithm: divided between build and run-time - Morphological Analysis: disabled - Term Weighting: No - Phrase Discovery?: No - Syntactic Parsing?: No - Word Sense Disambiguation?: No - Heuristic Associations (including short definition)?: No - Spelling Checking (with manual correction)?: No - Spelling Correction?: No - Proper Noun Identification Algorithm?: No - Tokenizer? Yes: - Patterns which are tokenized: General Mealy machine specified in configuration file, Single Token Class, User Replaceable - Manually-Indexed Terms?: No - Other Techniques for building Data Structures: Multi-tree data model constructed based on configuration information and text parsing. Content and structure schemas applied in the construction of trees. Statistics on Data Structures built from TREC Text - Inverted index - Run ID: colm1, colm1a, colm2, colm4, colm5 - Total Storage (in MB): 4,000 - Total Computer Time to Build (in hours): 6 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions?: Yes - Only Single Terms Used?: No, arbitrary fast phrase support - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: Same as for Inverted Index - Type of Structure: other data structures that allow for efficient implementation of hybrid tree operations - Total Storage (in MB): Included under inverted index - Total Computer Time to Build (in hours): Included under inverted index - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: No data Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): thesaurus, lexicon - Total Storage (in MB): 2 Query construction Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: No - Average Time to Build Query (in Minutes): 0.5-1 - Type of Query Builder - Domain Expert: No - Computer System Expert: No - Tools used to Build Query - Word Frequency List?: No - Knowledge Base Browser?: No - Structure Used: None - Other Lexical Tools?: No - Method used in Query Construction - Term Weighting?: not enabled - Boolean Connectors (AND, OR, NOT)?: Yes - Proximity Operators?: Yes - Addition of Terms not Included in Topic?: Yes Searching Search Times - Run ID: colm1, colm1A - Computer Time to Search (Average per Query, in CPU seconds): Average elapsed time is less than one second - Run ID: colm2, colm4, colm5 - Computer Time to Search (Average per Query, in CPU seconds): Average elapsed time is less than 5-10 seconds Machine Searching Methods - Probabilistic Model?: Yes - Boolean Matching?: Yes Machine Information - Machine Type for TREC Experiment: DEC alpha server 2100 - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage (in MB): 75,000 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 200 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: 300 man years - Given appropriate resources - Could your system run faster?: Yes - By how much (estimate)?: factor of 50 Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: The data model and data structures of this system allow the concept of a document to be defined as part of the query. This capability to dynamically project text views permits the deployment of systems that are impractical or impossible with other technologies.