System Summary and Timing Organization Name: University of California, San Diego List of Run ID's: sdmix1, sdmix2, sdmix3 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 572 - Stemming Algorithm: SMART triestem - Term Weighting: Term-count and log-normalized tf-idf - Phrase Discovery?: - Tokenizer?: - Other Techniques for building Data Structures: Latent Semantic Indexing; Optimization of a ranking function based on relevance Statistics on Data Structures built from TREC Text - Inverted index - Run ID: sdmix1, sdmix2 - Total Storage (in MB): 438 - Total Computer Time to Build (in hours): 55 - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: sdmix3 - Total Storage (in MB): 372 - Total Computer Time to Build (in hours): 14.5 - Use of Term Positions?: no - Only Single Terms Used?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: sdmix1, sdmix2, sdmix3 - Type of Structure: LSI projection matrix - Total Storage (in MB): 82 - Total Computer Time to Build (in hours): 66 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Used SVDPACK to generate first 300 singular values/vectors for all training documents in training set - Other Data Structures built from TREC text - Run ID: sdmix1, sdmix2, sdmix3 - Type of Structure: Weights on linear mixture of three experts - Total Storage (in MB): 0.000012 - Total Computer Time to Build (in hours): 48 - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Optimized rank-order statistic objective function using relevance feedback and a linear model of combining two vector- space experts and one LSI expert. Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all or DESC only - Average Computer Time to Build Query (in cpu seconds): 0.2 - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: - Other: ignored negated phrases; LSI projection to 300 dimensions Automatically Built Queries (Routing) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): 0.2 - Method used in Query Construction - Terms Selected From - All Training Documents: yes - Term Weighting with Weights Based on terms in - Topics: yes - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from - Other: ignored negated phrases; LSI projection to 300 dimensions Searching Search Times - Run ID: sdmix1, sdmix2, sdmix3 - Computer Time to Search (Average per Query, in CPU seconds): 480 Machine Searching Methods - Vector Space Model?: yes - Other: Latent Semantic Indexing; linear mixture of experts Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Other Term Weights?: yes (boolean) - Other: Relative weighting of three different experts based on relevance Machine Information - Machine Type for TREC Experiment: Sun Sparc 10 - Was the Machine Dedicated or Shared: Shared - Amount of Hard Disk Storage (in MB): 1360 - Amount of RAM (in MB): 83 - Clock Rate of CPU (in MHz): ??? System Comparisons - Amount of "Software Engineering" which went into the Development of the System: minimal - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: 2-10 times - Features the System is Missing that would be beneficial: system is currently only experimental and not easily usable Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: Reported times are generally real time and not CPU. Furthermore, many real time operations were slowed down by up to ten times due to network traffic and a slow disk.