System Summary and Timing Organization Name: APLab, SCILS, Rutgers List of Run ID's: rutscn20 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: no - Controlled Vocabulary? : no - Stemming Algorithm: - Morphological Analysis: no - Term Weighting: no - Phrase Discovery? : - Kind of Phrase: no - Method Used (statistical, syntactic, other): no - Syntactic Parsing? : no - Word Sense Disambiguation? : no - Heuristic Associations (including short definition)? : no - Spelling Checking (with manual correction)? : no - Spelling Correction? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : - Patterns which are tokenized: no - Manually-Indexed Terms? : no - Other Techniques for building Data Structures: 5-grams Statistics on Data Structures built from TREC Text - Inverted index - Run ID : no - Clusters - Run ID : no - N-grams, Suffix arrays, Signature Files - Run ID : rutscn20 - Total Storage (in MB): no - Total Computer Time to Build (in hours): no database built - Automatic Process? (If not, number of manual hours): n/a - Brief Description of Method: scanning - Knowledge Bases - Run ID : no - Use of Manual Labor - Special Routing Structures - Run ID : no - Other Data Structures built from TREC text - Run ID : no Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): no - Use of Manual Labor - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): no Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: no - Method used in Query Construction - Tokenizer? : - Expansion of Queries using Previously-Constructed Data Structure? : Automatically Built Queries (Routing) - Topic Fields Used: no - Method used in Query Construction - Terms Selected From - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: all - Average Time to Build Query (in Minutes): 2 - Type of Query Builder - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Method used in Query Construction - Addition of Terms not Included in Topic? : - Other: manual elimination of all stop words and non-content words Searching Search Times - Run ID : rutscn20 - Computer Time to Search (Average per Query, in CPU seconds): 18,000 using nawk - Component Times : automated construction of scanning script 0%, scanning 100% Machine Searching Methods - N-gram Matching? : yes - Free Text Scanning? : yes Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : no - Document Length? : yes - N-gram Frequency? : yes - Other: Partial match 5-grams, with any 4 of 5 characters correct. Machine Information - Machine Type for TREC Experiment: Sun SparcStation 20 - Was the Machine Dedicated or Shared: mostly dedicated - Amount of Hard Disk Storage (in MB): 9,000 - Amount of RAM (in MB): 48 - Clock Rate of CPU (in MHz): 50 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: a few hours - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : 100 times probably - Features the System is Missing that would be beneficial: The list is endless. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: Partial match 5-grams to provide some level of robustness against data corruption.