System Summary and Timing Organization Name: NMSU/CRL List of Run ID's: nmsuc1, nmsuc2, nmsuc3 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: English 571, Spanish 379 - Controlled Vocabulary?: No - Stemming Algorithm: Spanish implementation of Porter Algorithm - Morphological Analysis: No - Term Weighting: TF-IDF style - Phrase Discovery?: No - Kind of Phrase: Not applicable - Method Used (statistical, syntactic, other): Not applicable - Syntactic Parsing?: No - Word Sense Disambiguation?: Yes, for NMSUC3 corpus disambiguation was applied to translation equivalents - Heuristic Associations (including short definition)?: No - Spelling Checking (with manual correction)?: No - Spelling Correction?: Fuzzy term expansion may serve as spelling correction - Proper Noun Identification Algorithm?: No - Tokenizer?: English/Spanish word tokenizer - Patterns which are tokenized: Words - Manually-Indexed Terms?: No - Other Techniques for building Data Structures: None Statistics on Data Structures built from TREC Text - Inverted index - Run ID: nmsuc1, nmsuc2, nmsuc3 - Total Storage (in MB): 146.7 - Total Computer Time to Build (in hours): 1.3 - Automatic Process? (If not, number of manual hours): Fully Automatic - Use of Term Positions?: No - Only Single Terms Used?: Yes - Clusters - N-grams, Suffix arrays, Signature Files - Run ID: nmsuc1, nmsuc2, nmsuc3 - Total Storage (in MB): 42.8 - Total Computer Time to Build (in hours): 1.3 simultaneous with inverted index - Automatic Process? (If not, number of manual hours): Fully automatic - Brief Description of Method: compressed bit vectors of document-term contents, but not actually used by system for the reported results - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: nmsuc1, nmsuc2, nmsuc3 - Type of Structure: btree words table - Total Storage (in MB): 6 Mb - Total Computer Time to Build (in hours): 1.3 simultaneous with inverted index - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: standard btree to word no mapping Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): Collins Bilingual Dictionary - Total Storage (in MB): 1.2 - Number of Concepts Represented: 24,000 - Type of Representation: lexical entries Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: nmsuc1 used S-DESC only, nmsuc2 and nmsuc3 used E-DESC only - Average Computer Time to Build Query (in cpu seconds): 1.2 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: Yes - Phrase Extraction from Topics?: No - Syntactic Parsing of Topics?: No - Word Sense Disambiguation?: Yes - Proper Noun Identification Algorithm?: No - Tokenizer?: Yes, words - Heuristic Associations to Add Terms?: fuzzy expansion for nmsuc2 and nmsuc3 for terms not in bilingual dictionary - Expansion of Queries using Previously-Constructed Data Structure?: nmsuc2 and nmsuc3 used bilingual dictionary for translation - Structure Used: Bilingual dictionary - Automatic Addition of Boolean Connectors or Proximity Operators?: No - Other: None Searching Search Times - Run ID: nmsuc1 - Computer Time to Search (Average per Query, in CPU seconds): 0.08 on Ultrasparc Machine Searching Methods - Vector Space Model?: Yes - Probabilistic Model?: No - Cluster Searching?: No - N-gram Matching?: No - Boolean Matching?: No - Fuzzy Logic?: No - Free Text Scanning?: No - Neural Networks?: No - Conceptual Graph Matching?: No - Other: None Factors in Ranking - Term Frequency?: Yes - Inverse Document Frequency?: Yes - Other Term Weights?: No - Semantic Closeness?: No - Position in Document?: No - Syntactic Clues?: No - Proximity of Terms?: No - Information Theoretic Weights?: No - Document Length?: No - Percentage of Query Terms which match?: No - N-gram Frequency?: No - Word Specificity?: No - Word Sense Frequency?: No - Cluster Distance?: No - Other: None Machine Information - Machine Type for TREC Experiment: Sun Sparcstation 5 and SunUltra Sparc - Was the Machine Dedicated or Shared: Shared - Amount of Hard Disk Storage (in MB): 4 Gb - Amount of RAM (in MB): 154 - Clock Rate of CPU (in MHz): 190 (??) for Ultrasparc System Comparisons - Amount of "Software Engineering" which went into the Development of the System: 1 person months - Given appropriate resources - Could your system run faster?: oh yeah - By how much (estimate)?: 500% - Features the System is Missing that would be beneficial: faster document score sort algorithm, heavier reliance on system memory