System Summary and Timing Organization Name: Swiss Federal Institute of Technology (ETH Zurich) List of Run ID's: ETHal1, ETHas1, ETHme1, ETHru1, ETHFR94, ETHD5N, ETHD20N, ETHD5P, ETHD20P Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 571 - Controlled Vocabulary?: no - Stemming Algorithm: suffix stripping (Porter, 1980), 4 - gram (ETHD5N), 3 - gram (ETHD20N) - Morphological Analysis: no - Term Weighting: Lnu (ETHal1, ETHas1, ETHme1, ETHFR94, ETHD5N, ETHD20N), co-occurrence of features (ETHru1), - Phrase Discovery?: yes (ETHal1, ETHas1, ETHme1) (ETHru1) no (other runs) - Kind of Phrase: feature pairs same as SMART (ETHal1, ETHas1, ETHme1) co-occurrence in distance 1 (ETHru1) - Method Used (statistical, syntactic, other): statistical - Syntactic Parsing?: no - Word Sense Disambiguation?: no - Heuristic Associations (including short definition)?: no - Spelling Checking (with manual correction)?: no - Spelling Correction?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Manually-Indexed Terms?: no - Other Techniques for building Data Structures: co-occurrence of indexing features, (ETHru1) Statistics on Data Structures built from TREC Text - Inverted index - Run ID: ETHal1, ETHas1, ETHme1ali - Total Storage (in MB): 1,900 - Total Computer Time to Build (in hours): 115 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Inverted index - Run ID: ETHru1 - Total Storage (in MB): 427 - Total Computer Time to Build (in hours): 90 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Inverted index - Run ID: ETHFR94, - Total Storage (in MB): 200 - Total Computer Time to Build (in hours): 11 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Inverted index - Run ID: ETHD5N, ETHD5P - Total Storage (in MB): 200 - Total Computer Time to Build (in hours): 12 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: ETHD20N, ETHD20P - Total Computer Time to Build (in hours): 325 - Automatic Process? (If not, number of manual hours): 23 - Use of Term Positions?: no - Only Single Terms Used?: no - Clusters - Run ID: none - N-grams, Suffix arrays, Signature Files - Run ID: none - Knowledge Bases - Run ID: none - Use of Manual Labor - Special Routing Structures - Run ID: ETHru1 - Type of Structure: Contingency Tables - Total Storage (in MB): 350 - Total Computer Time to Build (in hours): 30 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Contingency Tables for each indexing feature fulfilling certain requirements (see paper) - Run ID: ETHru1 - Type of Structure: Matrices of co-occurrences - Total Storage (in MB): less than 1 - Total Computer Time to Build (in hours): 40 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Statistics about Co-occurrences of Indexing Feature - Other Data Structures built from TREC text - Run ID: ETHas1, ETHal1, ETHme1 - Type of Structure: Hash Tables - Total Storage (in MB): 156 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: bi-directional mapping docid - number, feature - number - Other Data Structures built from TREC text - Run ID: ETHas1, ETHal1, ETHme1 - Type of Structure: Array - Total Storage (in MB): 48 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: array storing document length - Other Data Structures built from TREC text - Run ID: ETHru1 - Type of Structure: Hash Tables - Total Storage (in MB): 38 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: bi-directional mapping docid - number, feature - number - Other Data Structures built from TREC text - Run ID: ETHru1 - Type of Structure: Array - Total Storage (in MB): 15 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: array storing document length - Other Data Structures built from TREC text - Run ID: ETHFR94 - Total Storage (in MB): 126 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: bi-directional mapping docid - number, feature - number - Other Data Structures built from TREC text - Run ID: ETHFR94 - Type of Structure: Array - Total Storage (in MB): 30 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: array storing document length - Other Data Structures built from TREC text - Run ID: ETHD5N, ETHD5P - Type of Structure: Hash Tables - Total Storage (in MB): 6 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: bi-directional mapping docid - number, feature - number - Other Data Structures built from TREC text - Run ID: ETHD5N, ETHD5P - Type of Structure: Array - Total Storage (in MB): 2 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: array storing document length - Other Data Structures built from TREC text - Run ID: ETHD20N, ETHD20P - Type of Structure: Hash Tables - Total Storage (in MB): 81 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: bi-directional mapping docid - number, feature - number - Other Data Structures built from TREC text - Run ID: ETH20N, ETH20P - Type of Structure: Array - Total Storage (in MB): 30 - Total Computer Time to Build (in hours): in parallel with inv index - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: array storing document length - Other Data Structures built from TREC text - Run ID: ETHal1, ETHas1, ETHme1, ETHru1 - Type of Structure: Hash Table - Total Storage (in MB): 5.4 - Total Computer Time to Build (in hours): 7 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Hash Table for Phrases and Document Frequencies Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): independent - Type of File (thesaurus, knowledge base, lexicon, etc.): dictionary (British English, American English) - Total Storage (in MB): less than 1 - Number of Concepts Represented: 313 - Type of Representation: dictionary - Total Computer Time to Build (in hours): 0 - Total Manual Time to Build (in hours): 1 - Use of Manual Labor - Mostly Machine Built with Manual Correction: yes - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: description (ETHas1), all (ETHas2) - Average Computer Time to Build Query (in cpu seconds): 0 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: yes - Syntactic Parsing of Topics?: no - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: - Structure Used: dictionary - Automatic Addition of Boolean Connectors or Proximity Operators?: no - Other: no Automatically Built Queries (Routing) - Topic Fields Used: none - Average Computer Time to Build Query (in cpu seconds): time to build data structures for ETHru1 /50 (30*3600 + 40*3600) /50 - Method used in Query Construction - Terms Selected From - Topics: no - All Training Documents: no - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Topics: no - All Training Documents: no - Documents with Relevance Judgments: yes - Phrase Extraction from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: yes - Syntactic Parsing - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Word Sense Disambiguation using - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Proper Noun Identification Algorithm from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Tokenizer - Patterns which are tokenized (dates, phone numbers, common patterns, etc): no - from Topics: no - from All Training Documents: no - from Documents with Relevance Judgments: no - Heuristic Associations to Add Terms from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Expansion of Queries using Previously-Constructed Data Structure: - Structure Used: contingency tables, co-occurrence matrices - Automatic Addition of Boolean connectors or Proximity Operators using information from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: all - Average Time to Build Query (in Minutes): 5 - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Word Frequency List?: no - Knowledge Base Browser?: no - Other Lexical Tools?: no - Method used in Query Construction - Term Weighting?: no - Boolean Connectors (AND, OR, NOT)?: no - Proximity Operators?: no - Addition of Terms not Included in Topic?: yes (only if feedback) - Source of Terms: IR index Searching Search Times - Run ID: ETHas1 - Computer Time to Search (Average per Query, in CPU seconds): 600 - Component Times: base run 120 rocchio 480 - Search Times - Run ID: ETHas1 - Computer Time to Search (Average per Query, in CPU seconds): 660 - Component Times: base run 180 rocchio 480 - Search Times - Run ID: ETHme1 - Computer Time to Search (Average per Query, in CPU seconds): ??? - Component Times: ??? - Search Times - Run ID: ETHru1 - Computer Time to Search (Average per Query, in CPU seconds): 130 - Component Times: Lnu.ltn 120 rsv(i) from data structures +combination: 10 - Search Times - Run ID: ETHFR94 - Computer Time to Search (Average per Query, in CPU seconds): 40 - Search Times - Run ID: ETHD5N - Computer Time to Search (Average per Query, in CPU seconds): 50 - Search Times - Run ID: ETHD20N - Computer Time to Search (Average per Query, in CPU seconds): 65 - Search Times - Run ID: ETHD5P - Computer Time to Search (Average per Query, in CPU seconds): 2000 - Component Times: base run 50 reranking 1950 - Search Times - Run ID: ETHD20P - Computer Time to Search (Average per Query, in CPU seconds): 2015 - Component Times: base run 65 reranking 1950 Machine Searching Methods - Vector Space Model?: yes, ( ETHal1, ETHas1, ETHme1, ETHFR94, ETHD5N, ETHD20N, ETHD5P, ETHD20P ) - Probabilistic Model?: yes (ETHru1) - Cluster Searching?: no - N-gram Matching?: yes ( ETHD5N, ETHD20N) - Boolean Matching?: no - Fuzzy Logic?: no - Free Text Scanning?: no - Neural Networks?: no - Conceptual Graph Matching?: no Factors in Ranking - Term Frequency?: yes, - Inverse Document Frequency?: yes - Other Term Weights?: probabilistic weighting (ETHru1) - Semantic Closeness?: no - Position in Document?: yes (ETHru1) - Syntactic Clues?: no - Proximity of Terms?: yes (ETHru1) - Information Theoretic Weights?: no - Document Length?: yes - Percentage of Query Terms which match?: no - N-gram Frequency?: yes (ETHD5N, ETHD20N) - Word Specificity?: no - Word Sense Frequency?: no - Cluster Distance?: no - Other: no Machine Information - Machine Type for TREC Experiment: SPARC Center 1000 - Was the Machine Dedicated or Shared: dedicated (2 weeks), shared else - Amount of Hard Disk Storage (in MB): 20 000 - Amount of RAM (in MB): 384 - Clock Rate of CPU (in MHz): 4x 50 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: years (?) - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: 10 times - Features the System is Missing that would be beneficial: speed-up for query evaluation, administration of position information (routing), Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: For routing: certain steps while building data structures used Splus