System Summary and Timing Organization Name: University of Virginia List of Run ID's: drift1 drift2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 571 - Controlled Vocabulary? : no - Stemming Algorithm: yes, triestem from SMART v11.0 - Term Weighting: tfc documents, nfx queries - Phrase Discovery? : - Tokenizer? : Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID : drift1 and drift2 - Type of Structure: Document vectors - Total Storage (in MB): 190 - Total Computer Time to Build (in hours): 7 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: TREC text is run through Smart to remove stop words and do word stemming. Smart vectors are then converted to Drift format. Query construction Automatically Built Queries (Routing) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): less than 1 - Method used in Query Construction - Terms Selected From - Topics: yes - All Training Documents: yes - Term Weighting with Weights Based on terms in - All Training Documents: yes - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Searching Search Times - Run ID : drift2 - Computer Time to Search (Average per Query, in CPU seconds): 150 - Component Times : 2% query broadcast to sites 88% local search 10% result merge Machine Searching Methods - Vector Space Model? : yes Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Document Length? : yes Machine Information - Machine Type for TREC Experiment: Sun SPARCserver 20 - Was the Machine Dedicated or Shared: shared CPU, dedicated disk - Amount of Hard Disk Storage (in MB): 2000 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 50 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: System was built by 1 person, half-time in 6 months. - Given appropriate resources - Could your system run faster? : Yes - By how much (estimate)? : Factor of 10-20 - Features the System is Missing that would be beneficial: Hardware: More disk space, dedicated CPU. Software: Inverted index, thesaurus, query expansion, you name it. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: Drift simulates a distributed IR system. Search time, memory usage, and disk usage all increase as the number of sites in the system increases.