System Summary and Timing Organization Name: Logicon List of Run ID's: losPA2, losPA3 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 580 - Controlled Vocabulary? : No - Stemming Algorithm: No - Term Weighting: No - Phrase Discovery? : - Tokenizer? : Yes - Patterns which are tokenized: All alphanumeric strings were tokenized. Only alphabetic tokens of length > 1 were kept. The following were then eliminated: stopwords; tokens serving as SGML tags; tokens occurring <= 2 times on each CDROM. Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID : losPA2, losPA3 - Type of Structure: Corpus word-frequency table - Total Storage (in MB): 15 - Total Computer Time to Build (in hours): 9 - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: Process analyzed all documents on all 3 CDROMs. Each document was tokenized, and a count was kept with each token indicating the number of documents in which it occurred. Query construction Automatically Built Queries (Routing) - Topic Fields Used: None - Average Computer Time to Build Query (in cpu seconds): CPU time was not measured. Average wall-clock time was 324 seconds. - Method used in Query Construction - Terms Selected From - Only Documents with Relevance Judgments: Yes - Term Weighting with Weights Based on terms in - All Training Documents: Yes - Documents with Relevance Judgments: Yes - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Patterns which are tokenized (dates, phone numbers, common patterns, etc): Same as tokenizing rules for Data Structures. - from Documents with Relevance Judgments: Yes - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from - Documents with Relevance Judgments: Yes. Any 2 of the 10 "most descriptive" tokens were specified in an AND condition for document selection. The 1000 "most descriptive" tokens were then used for document ranking. Searching Search Times - Run ID : losPA2, losPA3 - Computer Time to Search (Average per Query, in CPU seconds): CPU time was not measured. Average wall-clock time was 0.45 seconds per document, or 0.009 seconds per profile per document. Machine Searching Methods - Boolean Matching? : Yes - Free Text Scanning? : Yes. Search engine was the Logicon Message Dissemination System (LMDS), a COTS product. Factors in Ranking - Term Frequency? : Yes. Term frequency in the set of relevant training documents vs. the set of all training documents. - Proximity of Terms? : Passage-level searching was employed as part of the scoring algorithm. A passage consisted of 100 contiguous non- stopwords, and each new passage began 50 non-stopwords after the beginning of the previous passage. - Document Length? : Yes. Score from term weights was multiplied by the ratio of unique query tokens in document or passage to total unique tokens in document or passage. Machine Information - Machine Type for TREC Experiment: Sun SPARCstation 10 - Was the Machine Dedicated or Shared: Dedicated - Amount of Hard Disk Storage (in MB): 5,000 MB - Amount of RAM (in MB): 32 MB - Clock Rate of CPU (in MHz): 50 MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: The search engine was the Logicon Message Dissemination System (LMDS), a COTS product. The query creation and document ranking software were research prototypes integrated via its API. - Given appropriate resources - Could your system run faster? : Yes - By how much (estimate)? : 50 percent