System Summary and Timing Organization Name: National Security Agency List of Run ID's: ACQADH (ad-hoc), ACQROU (routing), ACQUNC (uncorrupted) ACQC10 (10% corruption) ACQC20 (20% corruption), ACQINT (interactive), ACQSPA (Spanish), ACQHPr (high precision filtering), ACQHRe (high recall filtering) ACQMID (mid-performance filtering) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 0 - Controlled Vocabulary? : none - Stemming Algorithm: - Term Weighting: yes - Phrase Discovery? : - Tokenizer? : Statistics on Data Structures built from TREC Text - Inverted index - Run ID : ACQADH, ACQUNC, ACQC10, ACQC20, ACQINT, ACQSPA, ACQHPr, ACQHRe, ACQMID, ACQROU - Total Storage (in MB): less than 10 - Total Computer Time to Build (in hours): 0.1 to 0.2 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : no - Only Single Terms Used? : yes - Clusters - N-grams, Suffix arrays, Signature Files - Run ID : ACQADH, ACQUNC, ACQC10, ACQC20, ACQINT, ACQSPA, ACQHPr, ACQHRe, ACQMID, ACQROU - Total Storage (in MB): less than 10 - Total Computer Time to Build (in hours): less than 2 - Automatic Process? (If not, number of manual hours):yes - Brief Description of Method: the n-grams in each document were tallied - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Routing) - Topic Fields Used: none - Average Computer Time to Build Query (in cpu seconds): 10 - Method used in Query Construction - Terms Selected From - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Documents with Relevance Judgments: yes - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: description - Average Time to Build Query (in Minutes): 2 minutes - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Method used in Query Construction - Addition of Terms not Included in Topic? : yes - Source of Terms: general knowledge of user Interactive Queries - Initial Query Built Automatically or Manually: manually, as with adhoc - Type of Person doing Interaction - System Expert: yes - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): less than 10 - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): less than 1 - Methods used in Interaction - Automatic Query Expansion from Relevant Documents? : - Other Automatic Methods: ACQUAINTANCE returned the 1000 top scoring documents from the database for each topic. These scores and documents were handed to the PARENTAGE information visualization system, which arranged the documents into clusters of most related documents, and labelled each cluster with the words and phrases which best characterized the significant elements of that cluster. By rapidly scanning the labels on the clusters, those groups of documents most closely matching the topic description were identified. - Manual Methods Searching Search Times - Run ID : ACQADH, ACQUNC, ACQC10, ACQC20, ACQSPA, ACQHPr, ACQHRe, ACQMID, ACQROU - Computer Time to Search (Average per Query, in CPU seconds): roughly 30 seconds per query Machine Searching Methods - Vector Space Model? : yes - Probabilistic Model? : yes - N-gram Matching? : yes - Other: n-gram frequencies are offset according to population means Factors in Ranking - Term Frequency? : yes, where terms are n-grams - N-gram Frequency? : yes Machine Information - Machine Type for TREC Experiment: CRAY C 90 - Was the Machine Dedicated or Shared: heavily shared - Amount of Hard Disk Storage (in MB): 4,000 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 250 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: one-half man year - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : one to three orders of magnitude with improved algorithm design and hardware implementation