System Summary and Timing Organization Name: City University List of Run ID's: city96a1, city96a2, city96r1, city96r2, city96f1, city96f2, city96f3, city96vlc1, city96vlc2, [more] Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: - a1, a2, r1, r2, f1, f2, f3, vlc1, vlc2 ... - Controlled Vocabulary?: No - Stemming Algorithm: Modified Porter with spelling normalization - Term Weighting: No - Phrase Discovery?: No - Syntactic Parsing?: No - Word Sense Disambiguation?: No - Heuristic Associations (including short definition)?: No - Spelling Checking (with manual correction)?: No - Spelling Correction?: No - Proper Noun Identification Algorithm?: No - Tokenizer?: No - Manually-Indexed Terms?: No - Other Techniques for building Data Structures: None Statistics on Data Structures built from TREC Text - Inverted index - Run ID: a1, a2, vlc1 ... - Total Storage (in MB): 1015 - Total Computer Time to Build (in hours): about 8 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions?: Yes - Only Single Terms Used?: Mainly - Inverted index - Run ID: r1, r2, f1, f2, f3 - Total Storage (in MB): 692 - Total Computer Time to Build (in hours): about 6 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions?: Yes - Only Single Terms Used?: Mainly - Inverted index - Run ID: vlc2 - Total Storage (in MB): 2200 - Total Computer Time to Build (in hours): about 18 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions?: Yes - Only Single Terms Used?: Mainly - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): Somewhat angled towards American data - Type of File (thesaurus, knowledge base, lexicon, etc.): Contains synonym classes, go phrases stop- and semi-stop terms, prefixes - Total Storage (in MB): <<1 - Number of Concepts Represented: about 1000 - Type of Representation: Lookup table - Total Manual Time to Build (in hours): Not known, modifications are made occasionally - Use of Manual Labor - Other: Yes - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - city96a1 - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): several minutes - Method used in Query Construction - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: - Other: Queries built from terms extracted from documents retrieved by a pilot search using the topic statement. Weights based on statistics of occurrence in documents retrieved in pilot search, in topic statement, and in the collection. Automatically Built Queries (Ad-Hoc) - city96a2 - Topic Fields Used: DESC - Average Computer Time to Build Query (in cpu seconds): several minutes - Method used in Query Construction - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: - Other: Queries built from terms extracted from documents retrieved by a pilot search using the topic statement. Weights based on statistics of occurrence in documents retrieved in pilot search, in topic statement, and in the collection. Automatically Built Queries (Ad-Hoc) - city96vlc1, city96vlc2 - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): several minutes - Method used in Query Construction - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: - Other: Queries built from terms extracted from documents retrieved by a pilot search using the topic statement. Weights based on statistics of occurrence in documents retrieved in pilot search, in topic statement, and in the collection. Automatically Built Queries (Routing) - Topic Fields Used: all (but only to midify weights) - Average Computer Time to Build Query (in cpu seconds): About an hour - Method used in Query Construction - Terms Selected From - Only Documents with Relevance Judgments: All known relevant documents - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Interactive Queries - Initial Query Built Automatically or Manually: Manually - Type of Person doing Interaction - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): Unknown - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): 20 - Average Number of Iterations: 3.6 - Average Number of Documents Examined per Iteration: 11.3 - Minimum Number of Iterations: 1 - Maximum Number of Iterations: 7 - What Determines the End of an Iteration: a user initiated search on the current query formulation - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents?: Yes - Automatic Query Expansion from Relevant Documents?: - All Terms in Relevant Documents added: Yes but the user only ever sees the top T terms, T <= 20 - Only Top X Terms Added (what is X): 20 - User Selected Terms Added: The top U user terms (U<= 20) are always included in the query - Manual Methods - Using Individual Judgment (No Set Algorithm)?: Yes Searching Search Times - Run ID: a1, a2, r1, r2, vlc1, vlc2, f1, f2, f3 - Computer Time to Search (Average per Query, in CPU seconds): About 180 (SS20) or 270 (SS10) - Component Times: Not known, but most of the time goes on passage determination and weighting. Machine Searching Methods - Probabilistic Model?: More or less Factors in Ranking - Term Frequency?: Yes - Inverse Document Frequency?: Yes - Other Term Weights?: from relevance information (routing) or top documents retrieved (ad hoc) Machine Information - Machine Type for TREC Experiment: Suns SS10 (2) and SS20 - Was the Machine Dedicated or Shared: Nearly dedicated - Amount of Hard Disk Storage (in MB): About 25 GB - Amount of RAM (in MB): 320, 32, 160 resp - Clock Rate of CPU (in MHz): Not known, they're not top-of-the-range System Comparisons - Amount of "Software Engineering" which went into the Development of the System: Lots of hacking - Given appropriate resources - Could your system run faster?: Yes - By how much (estimate)?: With a few gigabytes of core perhaps X4 - Features the System is Missing that would be beneficial: Intelligence, or at least common sense. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: All runs used search-time passage determination and searching. This is nearly always slightly beneficial. System Summary and Timing Organization Name: City University List of Run ID's: city96c1 and city96c2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Stemming Algorithm: - Phrase Discovery?: - Kind of Phrase: yes - Method Used (statistical, syntactic, other): other - Tokenizer?: - Other Techniques for building Data Structures: Chinese word segmentation Statistics on Data Structures built from TREC Text - Inverted index - Run ID: city96c1 - Total Storage (in MB): 666.7MB - Total Computer Time to Build (in hours): 56 hours - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: yes - Inverted index - Run ID: city96c2 - Total Storage (in MB): 1027.6MB - Total Computer Time to Build (in hours): 16 hours - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: city96c1 and city96c2 - Type of Structure: mapping table - Total Storage (in MB): 3.7MB - Total Computer Time to Build (in hours): 1 hour - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: map a document's name into a integer number Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Type of File (thesaurus, knowledge base, lexicon, etc): Chinese word segmentation lexicon - Total Storage (in MB): 0.61MB - Number of Concepts Represented: 70,000 - Total Computer Time to Build (in hours): 18 hours - Total Manual Time to Build (in hours): 720 hours or more - Use of Manual Labor - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): several minutes - Method used in Query Construction - Phrase Extraction from Topics?: yes - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: Searching Search Times - Run ID: city96c1 - Computer Time to Search (Average per Query, in CPU seconds): 1683.6 - Component Times: Not known, but most of the time goes on accessing invert file, sorting and weighting. - Search Times - Run ID: city96c2 - Computer Time to Search (Average per Query, in CPU seconds): 9546.1 - Component Times: Not known, but most of the time goes on accessing invert file, sorting and weighting. Machine Searching Methods - Probabilistic Model?: yes Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Position in Document?: yes - Document Length?: yes - Other: within-document term frequency, within-query term frequency, the average document length, the number of query terms Machine Information - Machine Type for TREC Experiment: SGI Challeng L - Was the Machine Dedicated or Shared: Shared - Amount of Hard Disk Storage (in MB): Total: 17Gb; I can use 3896.5MB space temporarily - Amount of RAM (in MB): 512MB - Clock Rate of CPU (in MHz): 150MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: a lot - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: 5-6 times