System Summary and Timing Organization Name: New York University / General Electric List of Run ID's: nyuge1, nyuge2, nyuge3, nyuge4 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 365 - Controlled Vocabulary? : no - Stemming Algorithm: yes, lexicon - Morphological Analysis: partial - Term Weighting: yes - Phrase Discovery? : yes - Kind of Phrase: syntactic - Method Used (statistical, syntactic, other): syntactic with statistical disambiguation - Syntactic Parsing? : yes - Word Sense Disambiguation? : no - Heuristic Associations (including short definition)? : no - Spelling Checking (with manual correction)? : no - Spelling Correction? : no - Proper Noun Identification Algorithm? : yes - Tokenizer? : yes - Patterns which are tokenized: names, fixed phrases - Manually-Indexed Terms? : no - Other Techniques for building Data Structures: none Statistics on Data Structures built from TREC Text - Inverted index - Run ID : nyuge1 and nyuge2 - Total Storage (in MB): 323 - Total Computer Time to Build (in hours): 7.7 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : no - Inverted index - Run ID : nyuge3 and nyuge4 - Total Storage (in MB): 643 - Total Computer Time to Build (in hours): 15.3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : no - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): 2 - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : yes - Phrase Extraction from Topics? : yes - Syntactic Parsing of Topics? : yes - Word Sense Disambiguation? : no - Proper Noun Identification Algorithm? : yes - Tokenizer? : yes - Patterns which are Tokenized: names, fixed phrases - Heuristic Associations to Add Terms? : no - Expansion of Queries using Previously-Constructed Data Structure? : no - Automatic Addition of Boolean Connectors or Proximity Operators? : no Automatically Built Queries (Routing) - Topic Fields Used: desc, con - Method used in Query Construction - Terms Selected From - Topics: yes - All Training Documents: no - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Phrase Extraction from - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Syntactic Parsing - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Tokenizer - Patterns which are tokenized (dates, phone numbers, common patterns, etc): names, fixed phrases - from Topics: yes - from All Training Documents: no - from Documents with Relevance Judgments: yes - Heuristic Associations to Add Terms from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: all - Average Time to Build Query (in Minutes): 2 - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Word Frequency List? : no - Knowledge Base Browser? : no - Other Lexical Tools? : - Method used in Query Construction - Term Weighting? : yes - Boolean Connectors (AND, OR, NOT)? : no - Proximity Operators? : no - Addition of Terms not Included in Topic? : yes - Source of Terms: builder's imagination Interactive Queries - Initial Query Built Automatically or Manually: automatically - Type of Person doing Interaction - Domain Expert: no - System Expert: yes - Average Time to do Complete Interaction - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): 20 - Average Number of Iterations: 2 - Average Number of Documents Examined per Iteration: 20 - Minimum Number of Iterations: 2 - Maximum Number of Iterations: 2 - What Determines the End of an Iteration: exactly 2 iterations - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents? : no - Automatic Query Expansion from Relevant Documents? : no - All Terms in Relevant Documents added: no - Only Top X Terms Added (what is X): no - User Selected Terms Added: yes - Manual Methods - Using Individual Judgment (No Set Algorithm)? : yes Searching Search Times - Run ID : nyuge - Computer Time to Search (Average per Query, in CPU seconds): 60 Machine Searching Methods - Vector Space Model? : yes - Probabilistic Model? : no Factors in Ranking - Term Frequency? :yes - Inverse Document Frequency? : yes - Syntactic Clues? : phrases and names weighted differently - Document Length? : yes Machine Information - Machine Type for TREC Experiment: Sun SparcStation 10 - Was the Machine Dedicated or Shared: dedicated - Amount of Hard Disk Storage (in MB): 4000 - Amount of RAM (in MB): 128 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: substantial - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : 8-10 times - Features the System is Missing that would be beneficial: user interface, term position information, automatic feedback