System Summary and Timing Organization Name: MulitText Project, Department of Computer Science, University of Waterloo List of Run ID's: uwgcl1 (ID used for both adhoc and routing) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 0 (no stopword list was used) - Controlled Vocabulary? : no - Stemming Algorithm: none - Term Weighting: none - Phrase Discovery? : - Tokenizer? : yes - Patterns which are tokenized: Basic token is a sequences of alphanumeric characters with alphabetic characters mapped to lower case. Tokens with length greater than one consisting entirely of upper case alphabetic characters were doubly indexed in both upper and lower case. SGML tags are recognized and indexed as such. Statistics on Data Structures built from TREC Text - Inverted index - Run ID : uwgcl1 - Total Storage (in MB): aproximately 60% of text size - Total Computer Time to Build (in hours): 32 hours (adhoc); 14 hours (routing) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : yes - Only Single Terms Used? : yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: description - Average Time to Build Query (in Minutes): 30 minutes - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Method used in Query Construction - Term Weighting? : no - Boolean Connectors (AND, OR, NOT)? : yes - Proximity Operators? : yes - Addition of Terms not Included in Topic? : yes - Source of Terms: personal knowledge Manually Constructed Queries (Routing) - Topic Fields Used: title, description, narrative, nationality, concepts, definitions - Average Time to Build Query (in Minutes): 30 minutes - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Data Used for Building Query from - Documents with Relevance Judgments: very limited use - Method used in Query Construction - Term Weighting? : no - Boolean Connectors (AND, OR, NOT)? : yes - Proximity Operators? : yes - Addition of Terms not Included in Topic? : yes - Other: GCL containment and ordering operators Searching Search Times - Run ID : uwgcl1 - Computer Time to Search (Average per Query, in CPU seconds): 40 seconds (adhoc, elapsed) 10 seconds (routing, elapsed) - Component Times : Each query was composed of several sub-queries, each of which was run separately. An average of 1.9 sub-queries per query for routing run gives an average search time of 5 seconds per sub-query. An average of 2.2 sub-queries per query for ad-hoc run gives an average search time of 18 seconds per sub-query. Machine Searching Methods - Boolean Matching? : yes - Other: GCL query matching (see main paper) Factors in Ranking - Term Frequency? : yes - Proximity of Terms? : yes - Other: solution density (see main paper for explanation) Machine Information - Machine Type for TREC Experiment: DEC Alpha 2000/300 - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 10GB - Amount of RAM (in MB): 64MB - Clock Rate of CPU (in MHz): 150MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: Base retrieval system is a research prototype. Approximately two weeks of software development was specific to TREC-4. - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : factor of two