System Summary and Timing Organization Name: GE Corporate Research & Development School of Communication, Information and Library Studies, Rutgers University Lockheed Martin Corporation Department of Computer Science, New York University List of Run ID's: genrl1, genrl2, genrl3, genrl4, genrl5, genrl6, genlp1, genlp2, genlp3, genlp4, sbase1, sbase2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 429 - Controlled Vocabulary?: no - Stemming Algorithm: yes, lexicon - Morphological Analysis: partial - Term Weighting: yes, tf.idf - Phrase Discovery?: yes - Kind of Phrase: syntactic - Method Used (statistical, syntactic, other): lexical noun phrases, syntactical pairs with statistical disambiguation - Syntactic Parsing?: yes (for pairs) - Word Sense Disambiguation?: no - Heuristic Associations (including short definition)?: no - Spelling Checking (with manual correction)?: no - Spelling Correction?: no - Proper Noun Identification Algorithm?: yes - Tokenizer?: yes - Patterns which are tokenized: phrases - Manually-Indexed Terms?: no - Other Techniques for building Data Structures: none Statistics on Data Structures built from TREC Text - Inverted index - Run ID: adhoc(genrl[1234]) - Total Storage (in MB): 1,732MB (4 streams total) - Total Computer Time to Build (in hours): 14h (4 streams total) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Run ID: routing(genrl[56]) - Total Storage (in MB): 1,395MB (4 streams total) - Total Computer Time to Build (in hours): 3.5 (4 streams total) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Run ID: adhoc NLP track (genlp[14]) - Total Storage (in MB): 1,268 (4 streams total) - Total Computer Time to Build (in hours): 3h (4 streams total) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Run ID: adhoc NLP track (genlp[23]); sbase[12] - Total Storage (in MB): 768 - Total Computer Time to Build (in hours): 2.3h - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Special Routing Structures Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): independent - Type of File (thesaurus, knowledge base, lexicon, etc.): a list of hyphenated words extracted from corpus - Total Storage (in MB): 0.782642 - Number of Concepts Represented: no - Type of Representation: no - Total Computer Time to Build (in hours): 0.14 - Total Computer Time to Modify for TREC (if already built): 0 - Total Manual Time to Build (in hours): 0 - Total Manual Time to Modify for TREC (if already built): 0 - Use of Manual Labor - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: DESC field for genrl1; all the fields for genrl[234]. - Average Computer Time to Build Query (in cpu seconds): 5 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: yes - Syntactic Parsing of Topics?: yes - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: yes - Tokenizer?: yes - Patterns which are Tokenized: phrases, names - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: no - Automatic Addition of Boolean Connectors or Proximity Operators?: no Automatically Built Queries (Routing) - Topic Fields Used: all the fields for genrl[56]. - Average Computer Time to Build Query (in cpu seconds): 473 - Method used in Query Construction - Terms Selected From - Topics: yes - All Training Documents: no - Only Documents with Relevance Judgments: a subset of them - Term Weighting with Weights Based on terms in - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Phrase Extraction from - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Syntactic Parsing - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Word Sense Disambiguation using - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Proper Noun Identification Algorithm from - Topics: yes - All Training Documents: no - Documents with Relevance Judgments: yes - Tokenizer - Patterns which are tokenized (dates, phone numbers, common patterns, etc): phrases, names - from Topics: yes - from All Training Documents: no - from Documents with Relevance Judgments: yes - Heuristic Associations to Add Terms from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no - Expansion of Queries using Previously-Constructed Data Structure: - Structure Used: no - Automatic Addition of Boolean connectors or Proximity Operators using information from - Topics: no - All Training Documents: no - Documents with Relevance Judgments: no Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: all - Average Time to Build Query (in Minutes): 60 - Type of Query Builder - Domain Expert: no - Computer System Expert: no - Tools used to Build Query - Word Frequency List?: no - Knowledge Base Browser?: no - Structure Used: no - Other Lexical Tools?: ComLex - Method used in Query Construction - Term Weighting?: yes - Boolean Connectors (AND, OR, NOT)?: no - Proximity Operators?: no - Addition of Terms not Included in Topic?: yes - Source of Terms: top 10 retrieved documents - Other: no Interactive Queries (Adhoc manual) - Initial Query Built Automatically or Manually: Automatically - Type of Person doing Interaction - Domain Expert: no - System Expert: no - Average Time to do Complete Interaction - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): 60 - Average Number of Iterations: 2 - Average Number of Documents Examined per Iteration: 10 - Minimum Number of Iterations: 2 - Maximum Number of Iterations: 3 for about 10 hard queries - What Determines the End of an Iteration: out-of-time - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents?: no - Automatic Query Expansion from Relevant Documents?: no - All Terms in Relevant Documents added: no - Only Top X Terms Added (what is X): no - User Selected Terms Added: yes - Other Automatic Methods: no - Manual Methods - Using Individual Judgment (No Set Algorithm)?: yes - Following a Given Algorithm (Brief Description)?: no Searching Search Times - Run ID: genrl1 - Computer Time to Search (Average per Query, in CPU seconds): 8 - Run ID: genrl[2] - Computer Time to Search (Average per Query, in CPU seconds): 10 - Run ID: genrl3 - Computer Time to Search (Average per Query, in CPU seconds): 90 - Run ID: genrl4 - Computer Time to Search (Average per Query, in CPU seconds): 100 - Run ID: genrl5 - Computer Time to Search (Average per Query, in CPU seconds): 42 - Run ID: genlp1 - Computer Time to Search (Average per Query, in CPU seconds): 5 - Run ID: genlp2 - Computer Time to Search (Average per Query, in CPU seconds): 6 - Run ID: genlp3 - Computer Time to Search (Average per Query, in CPU seconds): 8 - Run ID: genlp4 - Computer Time to Search (Average per Query, in CPU seconds): 9 - Run ID: sbase1 - Computer Time to Search (Average per Query, in CPU seconds): 4 - Run ID: sbase2 - Computer Time to Search (Average per Query, in CPU seconds): 6 Machine Searching Methods - Vector Space Model?: yes - Probabilistic Model?: yes (only in genrl6) - Cluster Searching?: no - N-gram Matching?: SMART bi-gram is used in NLP track runs (sbase[12], genlp[23]) - Boolean Matching?: no - Fuzzy Logic?: no - Free Text Scanning?: no - Neural Networks?: no - Conceptual Graph Matching?: no - Other: no Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Other Term Weights?: no - Semantic Closeness?: no - Position in Document?: no - Syntactic Clues?: pairs (2 words), phrases (<= 7 words) - Proximity of Terms?: (bi-gram) - Information Theoretic Weights?: no - Document Length?: yes - Percentage of Query Terms which match?: no - N-gram Frequency?: no - Word Specificity?: no - Word Sense Frequency?: yes (the weight of the term having a single sense is increased) - Cluster Distance?: no - Other: no Machine Information - Machine Type for TREC Experiment: sparc 1000 server - Was the Machine Dedicated or Shared: Shared (it is the file server) - Amount of Hard Disk Storage (in MB): 12,000 - Amount of RAM (in MB): 2,000 - Clock Rate of CPU (in MHz): 40Mhz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: none