System Summary and Timing Organization Name: OIT/GMU/NCR List of Run ID's: English: gmu96au1, gmu96au2, gmu96ma1, gmu96ma2 Spanish: gmu96sp1, gmu96sp2 Chinese: gmu96ca1, gmu96ca2, gmu96cm1, gmu96cm2 Confusion: gmu96v00, vmu96v10, gmu96v20,gmu96v01, gmu96v11, gmu96v21 Large: gmu96lg4 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 144 - Controlled Vocabulary?: no - Stemming Algorithm: - Morphological Analysis: NO - Term Weighting: tf-idf - Phrase Discovery?: - Kind of Phrase: Yes, any two terms that were not separated by a punctuation mark or a stop term. - Syntactic Parsing?: No - Word Sense Disambiguation?: No - Heuristic Associations (including short definition)?: No - Spelling Checking (with manual correction)?: No - Spelling Correction?: No - Proper Noun Identification Algorithm?: No - Tokenizer?: - Patterns which are tokenized: No - Manually-Indexed Terms?: No Statistics on Data Structures built from TREC Text - Inverted index - Run ID: gmu96au2 - Total Storage (in MB): 500 - Total Computer Time to Build (in hours): 2.4 - Automatic Process? (If not, number of manual hours): Y - Use of Term Positions?: No - Only Single Terms Used?: No - Clusters - N-grams, Suffix arrays, Signature Files - Run ID: gmu96v00, vmu96v10, gmu96v20, gmu96v01, gmu96v11,gmu96v21 - Brief Description of Method: 4-grams were used that spanned words - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: DESC and NARRATIVE - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: Yes - Phrase Extraction from Topics?: Yes - Syntactic Parsing of Topics?: No - Word Sense Disambiguation?: No - Proper Noun Identification Algorithm?: No - Tokenizer?: - Patterns which are Tokenized: No - Heuristic Associations to Add Terms?: No - Expansion of Queries using Previously-Constructed Data Structure?: - Other: Yes, automatic relevance feedback was used for English, Spanish and corrupted data. Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: DESC and NARR used for gmu96ma1 and gmu96ma2 - Average Time to Build Query (in Minutes): 10 minutes - Type of Query Builder - Computer System Expert: Yes - Tools used to Build Query - Word Frequency List?: Yes - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Term Weighting?: Yes - Boolean Connectors (AND, OR, NOT)?: Yes - Proximity Operators?: No - Addition of Terms not Included in Topic?: Yes - Source of Terms: Manual sources, thesaurus, etc. Searching Search Times - Run ID: gmu96au2 - Computer Time to Search (Average per Query, in CPU seconds): 265.28 Machine Searching Methods - Vector Space Model?: Yes - N-gram Matching?: Yes Factors in Ranking - Term Frequency?: Yes - Inverse Document Frequency?: Yes - Document Length?: Yes - N-gram Frequency?: Yes Machine Information - Machine Type for TREC Experiment: English, Chinese, and confusion tracks were done on a single processor, Intel Pentium processor. A second English run and Spanish were done on a 4 processor DBC-1012. - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 4 GB - Amount of RAM (in MB): 62 MB System Comparisons - Amount of "Software Engineering" which went into the Development of the System: Yes, 1 person year for each prototype - Given appropriate resources - Could your system run faster?: Yes - By how much (estimate)?: IR could run 10-20 percent faster, relational could be improved several orders of magnitude by adding additional processors. All initial results have shown the system to be scalable. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: Ability to run on multiple processors. Also, it was not easy to add information about our different variations using relevance feedback.