System Summary and Timing Organization Name: George Mason University List of Run ID's: English: gmu1 (manual), gmu2 (automatic) Corrupted: gmuc0, gmu10 Spanish: gmumanual, gmuauto Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 144 - Controlled Vocabulary? : No - Stemming Algorithm: None - Morphological Analysis: No - Term Weighting: Yes, tf-idf - Phrase Discovery? : yes - Kind of Phrase: two adjacent terms, not separated by stop terms or punctuation - Method Used (statistical, syntactic, other): syntactic - Syntactic Parsing? : no - Word Sense Disambiguation? : no - Heuristic Associations (including short definition)? : no - Spelling Checking (with manual correction)? :no - Spelling Correction? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : no - Manually-Indexed Terms? : no Statistics on Data Structures built from TREC Text - Inverted index - Run ID : gmu1, gmu2 - Total Storage (in MB): 248.3 - Total Computer Time to Build (in hours): 2:52:15 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : no - Clusters - N-grams, Suffix arrays, Signature Files - Run ID : gmuc0, gmuc10, gmuman, gmuauto - Automatic Process? (If not, number of manual hours): no - Brief Description of Method: For corrupted data, 4-grams were used with automatic query reduction based on term frequency across the entire document collection. For Spanish, 5-grams were used with no query reduction. Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): 60 - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : yes - Phrase Extraction from Topics? : yes - Syntactic Parsing of Topics? : no - Word Sense Disambiguation? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : - Patterns which are Tokenized: no - Heuristic Associations to Add Terms? : no - Structure Used: none - Automatic Addition of Boolean Connectors or Proximity Operators? : no Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: all - Average Time to Build Query (in Minutes): 10 - Type of Query Builder - Domain Expert: no - Computer System Expert: yes - Tools used to Build Query - Word Frequency List? : no - Knowledge Base Browser? : - Structure Used: no - Other Lexical Tools? : - Method used in Query Construction - Term Weighting? : yes - Boolean Connectors (AND, OR, NOT)? : yes - Proximity Operators? : no - Addition of Terms not Included in Topic? : yes - Source of Terms: none Searching Search Times - Run ID : gmu2 - Computer Time to Search (Average per Query, in CPU seconds): approximately 60 Machine Searching Methods - Vector Space Model? : yes - N-gram Matching? : yes - Boolean Matching? :yes Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Document Length? : yes (normalization) - Percentage of Query Terms which match? : no - N-gram Frequency? : yes - Word Specificity? : no - Word Sense Frequency? : no Machine Information: gmuc0, gmuc10, gmu1, gmuauto - Machine Type for TREC Experiment: SUN Sparc 2000, 18 processor - Was the Machine Dedicated or Shared: dedicated - Amount of Hard Disk Storage (in MB): 54000 - Amount of RAM (in MB): 2000 Machine Information: gmu1, gmuman - Machine Type for TREC Experiment: AT&T DBC-1012 Database Machine - Was the Machine Dedicated or Shared: dedicated - Amount of Hard Disk Storage (in MB): 25000 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: 1 person year for IR system, 1 person year for Relational - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : IR, 50 percent, relational 25 - Features the System is Missing that would be beneficial: IR prototype currently is not parallelized, both lack passage based retrieval and relevance feedback.