System Summary and Timing Organization Name: FS Consulting List of Run ID's: fsclt1, fsclt2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 626, an initial list of 377 stop words are used and added to during the indexing process - Controlled Vocabulary? : no - Stemming Algorithm: plural stemmer (though one can select not to stem or to use the Porter stemmer as alternatives) - Morphological Analysis: no - Term Weighting: applied at search time, tf.idf - Phrase Discovery? : no - Syntactic Parsing? : no - Word Sense Disambiguation? : no - Heuristic Associations (including short definition)? : no - Spelling Checking (with manual correction)? : no - Spelling Correction? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : no - Manually-Indexed Terms? : no - Other Techniques for building Data Structures: no Statistics on Data Structures built from TREC Text - Inverted index - Run ID : all runs - Total Storage (in MB): 800 - Total Computer Time to Build (in hours): 14 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions? : yes - Only Single Terms Used? : no - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: no applicable - Average Computer Time to Build Query (in cpu seconds): approx. 1 second per query - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : yes, see 'other' section below - Phrase Extraction from Topics? : no - Syntactic Parsing of Topics? : no - Word Sense Disambiguation? : no - Proper Noun Identification Algorithm? : no - Tokenizer? : no - Heuristic Associations to Add Terms? : no - Expansion of Queries using Previously-Constructed Data Structure? : - Automatic Addition of Boolean Connectors or Proximity Operators? : no - Other: Automatic queries were built using the result set generated from the manually built queries (ad-hoc). The ten most important words (using tf.idf) were extracted from the top three documents (the documents were combined) and added as relevance feedback terms to the original queries. While this did not add additional documents to the result set, the set was reranked. Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: Description - Average Time to Build Query (in Minutes): 2 - Type of Query Builder - Domain Expert: no, we simulated an 'average' or 'naive' searcher - Computer System Expert: no - Tools used to Build Query - Word Frequency List? : no, but a stop word list was consulted - Knowledge Base Browser? : no - Other Lexical Tools? : no - Method used in Query Construction - Term Weighting? : no - Boolean Connectors (AND, OR, NOT)? : yes - Proximity Operators? : yes - Addition of Terms not Included in Topic? : yes - Source of Terms: the user searcher allowed to expand terms present in the topic if it made sense, for example 'nanny' was added as a search term for a topic that was about 'au-pairs'. One could also use wildcards in the search terms or use soundex on the search terms. Searching Search Times - Run ID : fsclt1 - Computer Time to Search (Average per Query, in CPU seconds): 18 seconds - Component Times : searching & ranking time: 5 seconds, downloading and saving of results by client: 13 seconds (note these are 'real' times, total run of 50 question took 900 seconds and include server startup time, search & ranking, transmission of results to the client and saving of these results to a file) - Run ID : fsclt2 - Computer Time to Search (Average per Query, in CPU seconds): 36 seconds - Component Times : searching & ranking time: 23 seconds, downloading and saving of results by client: 13 seconds (note these are 'real' times, total run of 50 question took 1800 seconds and include server startup time, search & ranking, transmission of results to the client and saving of these results to a file) Machine Searching Methods - Vector Space Model? : no - Probabilistic Model? : yes - Cluster Searching? : no - N-gram Matching? : no - Boolean Matching? : yes - Fuzzy Logic? : no - Free Text Scanning? : no - Neural Networks? : no - Conceptual Graph Matching? : no - Other: phrase, soundex and wildcard searching are all possible Factors in Ranking - Term Frequency? : yes - Inverse Document Frequency? : yes - Other Term Weights? : no - Semantic Closeness? : no - Position in Document? : no - Syntactic Clues? : no - Proximity of Terms? : no - Information Theoretic Weights? : no - Document Length? : yes - Percentage of Query Terms which match? : yes - N-gram Frequency? : no - Word Specificity? : no - Word Sense Frequency? : no - Cluster Distance? : no Machine Information - Machine Type for TREC Experiment: SparcStation 5 - Was the Machine Dedicated or Shared: dedicated - Amount of Hard Disk Storage (in MB): 4GB - Amount of RAM (in MB): 32 - Clock Rate of CPU (in MHz): 70 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: two person/years - Given appropriate resources - Could your system run faster? : yes - By how much (estimate)? : Linked to the speed of the hardware and the amount of memory available for indexing/searching