System Summary and Timing Organization Name: ETH Zurich, Switzerland List of Run ID's: ETHI01 (interactive ) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 571 - Controlled Vocabulary? : no - Stemming Algorithm: suffix stripping (Porter, 1980) - Morphological Analysis: no - Term Weighting: (so-called BM25(2.0,0.0,infty,0.75) - Phrase Discovery? : no - Syntactic Parsing? : no - Word Sense Disambiguation? : no - Heuristic Associations (including short definition)? : no - Spelling Checking (with manual correction)? : no - Spelling Correction? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : yes - Patterns which are tokenized: words - Manually-Indexed Terms? : no - Other Techniques for building Data Structures: no Statistics on Data Structures built from TREC Text - Inverted index - Run ID : ETHI01 - Total Storage (in MB): 2000 - Total Computer Time to Build (in hours): 240 for inserting all documents into the system - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : yes - Clusters - Run ID : none - N-grams, Suffix arrays, Signature Files - Run ID : none - Knowledge Bases - Run ID : none - Use of Manual Labor - Special Routing Structures - Run ID : none - Other Data Structures built from TREC text - Run ID : ETHI01 - Type of Structure: non-inverted files - Total Storage (in MB): 1500 - Total Computer Time to Build (in hours): see above * - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: for fast processing of relevance feedback, system uses non-inverted index for updates. - Other Data Structures built from TREC text - Run ID : ETHI01 - Type of Structure: Document Info - Total Storage (in MB): 1100 - Total Computer Time to Build (in hours): see above * - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Titles of documents to show in ranked list - Other Data Structures built from TREC text - Run ID : ETHI01 - Type of Structure: Feature Numbering - Total Storage (in MB): 14 - Total Computer Time to Build (in hours): see above * - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Features are mapped to numbers - Other Data Structures built from TREC text - Run ID : ETHI01 - Type of Structure: Hidden Markov Models - Total Storage (in MB): 0.0005 - Total Computer Time to Build (in hours): 2 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Hidden Markov Models used for passage retrieval Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): msec - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : feature frequency - Phrase Extraction from Topics? : no - Syntactic Parsing of Topics? : no - Word Sense Disambiguation? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : yes - Patterns which are Tokenized: words - Heuristic Associations to Add Terms? : no - Expansion of Queries using Previously-Constructed Data Structure? : yes - Automatic Addition of Boolean Connectors or Proximity Operators? : no - Other: none Automatically Built Queries (Routing) - Average Computer Time to Build Query (in cpu seconds): - Method used in Query Construction - Terms Selected From - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Patterns which are tokenized (dates, phone numbers, common patterns, etc): words and phrases - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Interactive Queries - Initial Query Built Automatically or Manually: Manually - Type of Person doing Interaction - Domain Expert: 2 out of 13 - System Expert: 2 out of 13 - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): unknown - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): 29 - Average Number of Iterations: ??? - Minimum Number of Iterations: ??? - Maximum Number of Iterations: ??? - What Determines the End of an Iteration: user decision - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents? : no - Automatic Query Expansion from Relevant Documents? : yes - All Terms in Relevant Documents added: no - Only Top X Terms Added (what is X): 20 - User Selected Terms Added: user selected relevant passages - Other Automatic Methods: none - Manual Methods - Using Individual Judgment (No Set Algorithm)? : yes - Following a Given Algorithm (Brief Description)? : no Searching - Search Times - Run ID : ETHI01 - Computer Time to Search (Average per Query, in CPU seconds): 1--2 sec - Component Times : - Search Times - Run ID : ETHI01 - Component Times : Machine Searching Methods - Vector Space Model? : yes (basic method) - Probabilistic Model? : yes (passage retrieval based on HMM) - Cluster Searching? : no - N-gram Matching? : no - Boolean Matching? : no - Fuzzy Logic? : no - Free Text Scanning? : no - Neural Networks? : no - Conceptual Graph Matching? : no - Other: no Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Other Term Weights? : Query Term Frequency - Semantic Closeness? : no - Position in Document? : no - Syntactic Clues? : yes - Proximity of Terms? : yes, for passage retrieval - Information Theoretic Weights? : no - Document Length? : yes - Percentage of Query Terms which match? : no - N-gram Frequency? : no - Word Specificity? : no - Word Sense Frequency? : no - Cluster Distance? : no - Other: no Machine Information - Machine Type for TREC Experiment: SPARC Center 1000 - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage (in MB): 10000 - Amount of RAM (in MB): 384 - Clock Rate of CPU (in MHz): 4x50 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: research prototype developped not exclusively for TREC, redesign of retrieval component: 0.5 person years - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : inserting and building the index: 10 times faster - Features the System is Missing that would be beneficial: Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: none