System Summary and Timing Organization Name: Siemens Corporate Research, Inc. List of Run ID's: siems1, siems2, siems3 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 571 - Stemming Algorithm: standard SMART - Term Weighting: yes (cosine-normalized tf) - Phrase Discovery? : yes - Kind of Phrase: two words - Method Used (statistical, syntactic, other): statistical - Tokenizer? : Statistics on Data Structures built from TREC Text - Inverted index - Run ID : siems1 - Total Storage (in MB): 727 MB - Total Computer Time to Build (in hours): 13.75 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : no - Inverted index - Run ID : siems2, siems3 - Total Storage (in MB): 10 individual indexes totaling 955 MB - Total Computer Time to Build (in hours): 20 hours total - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : no - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): collection specific - Type of File (thesaurus, knowledge base, lexicon, etc.): siems2: training query clusters siems3: ranks of relevant docs in training queries - Total Storage (in MB): siems2: ~ .8 per db siems3: ~ 1 per db - Total Computer Time to Build (in hours): siems2: ~ 22 hours to do training query retrieval +~ .75 to cluster siems3: ~ 22 hours to do training query retrieval +~ .5 to make query collection - Use of Manual Labor - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): list of phrases generated by Cornell from disk 1 - Total Storage (in MB): 2 - Number of Concepts Represented: 158099 Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: description - Average Computer Time to Build Query (in cpu seconds): siems1: ~ .3 CPU secs siems2: .2 CPU secs ave. per query per database siems3: .5 CPU secs aver per query (in Query db) - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : yes - Phrase Extraction from Topics? : yes - Tokenizer? : - Expansion of Queries using Previously-Constructed Data Structure? : Searching Search Times - Run ID : siems1 - Computer Time to Search (Average per Query, in CPU seconds): 74 - Component Times : search consists of: run initial query get relevance assessments construct expanded query run expanded query (did not time individual pieces, sorry) - Search Times - Run ID : siems2 - Computer Time to Search (Average per Query, in CPU seconds): 73 (assuming parallel searching of dbs) - Component Times : 72 CPU secs average per query per database searching plus .9 CPU secs average per query for actual merging - Search Times - Run ID : siems3 - Computer Time to Search (Average per Query, in CPU seconds): 103 (assuming parallel searching of dbs) - Component Times : 72 CPU secs average per query per database searching plus 31 CPU secs average per query for actual merging Machine Searching Methods - Vector Space Model? : yes - Cluster Searching? : only in siems2 (for query clusters) Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Document Length? : yes (cosine) - Other: probabilistic creation of ranked set for merging runs siems2 and siems3. Document for next rank is randomly selected from dbs, with selection biased by the number of documents in each db remaining to be added to final ranking. Machine Information - Machine Type for TREC Experiment: SPARC-10/41 - Was the Machine Dedicated or Shared: mostly dedicated - Amount of Hard Disk Storage (in MB): ~ 13,000 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 40 MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: fusion code completely experimental. Retrieval done by SMART, a well-tuned research prototype. - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : at least halved for MRDD fusion: need to approximate optimization problem