System Summary and Timing Organization Name: Rutgers University List of Run ID's: rutfspt; rutfglob; All responses are for Sub-run: DL2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 0 - Controlled Vocabulary?: no - Stemming Algorithm: none - Morphological Analysis: none - Term Weighting: none - Phrase Discovery?: yes - actually "string discovery" - Method Used (statistical, syntactic, other): statistical: dictionary of n-grams built up using LZ78 method. - Syntactic Parsing?: no - Word Sense Disambiguation?: no - Heuristic Associations (including short definition)?: no - Spelling Checking (with manual correction)?: no - Spelling Correction?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: - Patterns which are tokenized: no - Manually-Indexed Terms?: no - Other Techniques for building Data Structures: no Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Run ID: {{KB}} - Total Storage (in MB): 26 (compressed) - Total Computer Time to Build (in hours): 0.57 - Automatic Process? (If not, number of manual hours): Automatic - Brief Description of Method: statistical: dictionaries of n-grams built up using the Ziv-Lempel-Welch method (LZW). Each dictionary is generated from all judged relevant documents for a single topic. - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Automatically Built Queries (Routing) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): 0 - Method used in Query Construction - Terms Selected From - Topics: All - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from - Other: non-ASCII characters removed; all white space replaced by spaces. Searching Search Times - Run ID: {{KB}} - Computer Time to Search (Average per Query, in CPU seconds): .01 - Component Times: 16 CPU hours Machine Searching Methods - Vector Space Model?: no - Probabilistic Model?: no - Cluster Searching?: no - N-gram Matching?: yes N-gram matching using greedy heuristic (LZW) to find a small number ofn-grams from a particular dictionary which parse the entire query document. Ranking is ordered by the ratio (original size in characters)/(final size in n-grams). - Boolean Matching?: no - Fuzzy Logic?: no - Free Text Scanning?: no - Neural Networks?: no - Conceptual Graph Matching?: no Factors in Ranking - Term Frequency?: no - Inverse Document Frequency?: no - Other Term Weights?: no - Semantic Closeness?: no - Position in Document?: no - Syntactic Clues?: no - Proximity of Terms?: yes: n-grams are permitted to straddle word boundaries or even include multiple words. - Information Theoretic Weights?: no - Document Length?: yes, rankings are normalized by query document length. - Percentage of Query Terms which match?: no - N-gram Frequency?: no - Word Specificity?: no - Word Sense Frequency?: no - Cluster Distance?: no - Other: Ranking is ordered by the ratio (original size in characters)/ (final size in n-grams). Machine Information - Machine Type for TREC Experiment: SuperSPARC II (both) - Was the Machine Dedicated or Shared: shared (both) - Amount of Hard Disk Storage (in MB): 1024 (machine 1), 914 (machine 2) - Amount of RAM (in MB): 483 (machine 1) 168 (machine 2) - Clock Rate of CPU (in MHz): 60 (machine 1) 75 (machine 2) System Comparisons - Amount of "Software Engineering" which went into the Development of the System: ~100 staff-hours - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: 2-3 times - Features the System is Missing that would be beneficial: stripping subsumed n-grams. Stop list.