System Summary and Timing Organization Name: U. of California, Berkeley List of Run ID's: Brkly13, Brkly14, Brkly15, Brkly16, Brkly17, Brkly18, BrklySP5, BrklySP6, BrklyCH1, BrklyCH20 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 572 (English) 444 (Chinese) - Stemming Algorithm: English: SMART system (version 10) stemmer Spanish: The stemming algorithm removes standard endings from words (gender, tense, etc.). It also resolves spelling of irregular verbs to the standard (infinitive) spelling - Term Weighting: WEIGHTS DETERMINED FROM FREQUENCY STATISTICS BY LOGISTIC REGRESSION. - Phrase Discovery?: no - Syntactic Parsing?: no - Word Sense Disambiguation?: no - Heuristic Associations (including short definition)?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Manually-Indexed Terms?: /* Chinese? */ Statistics on Data Structures built from TREC Text - Inverted index - Run ID: Brkly13, Brkly14 - Total Storage (in MB): 156 - Total Computer Time to Build (in hours): 3 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: Brkly15, Brkly16, Brkly17, Brkly18 - Total Storage (in MB): 621 - Total Computer Time to Build (in hours): 10 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: BrklySP5, BrklySP6 - Total Storage (in MB): 129 - Total Computer Time to Build (in hours): 20 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: BrklyCH1, BrklyCH2 - Total Storage (in MB): 170 - Total Computer Time to Build (in hours): 5 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File - Type of File (Treebank, WordNet, etc.): dictionary (Chinese) - Total Storage (in MB): 0.8 - Number of Concepts Represented: 90,000 - Type of Representation: list of words Query construction Automatically Built Queries (Ad-Hoc) - Run ID: Brkly13 - Topic Fields Used: desc - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: no - Syntactic Parsing of Topics?: no - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: no - Automatic Addition of Boolean Connectors or Proximity Operators?: no Automatically Built Queries (Ad-Hoc) - Run ID: Brkly14 - Topic Fields Used: title, desc, narr - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: no - Syntactic Parsing of Topics?: no - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: no - Automatic Addition of Boolean Connectors or Proximity Operators?: no Automatically Built Queries (Ad-Hoc) - Run ID: BrklySP5 - Topic Fields Used: desc - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: no - Syntactic Parsing of Topics?: no - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: no - Automatic Addition of Boolean Connectors or Proximity Operators?: no Automatically Built Queries (Ad-Hoc) - Run ID: BrklyCH1 - Topic Fields Used: C-title, C-desc, and C-narr - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: no - Syntactic Parsing of Topics?: no - Word Sense Disambiguation?: no - Proper Noun Identification Algorithm?: no - Tokenizer?: no - Heuristic Associations to Add Terms?: no - Expansion of Queries using Previously-Constructed Data Structure?: no - Automatic Addition of Boolean Connectors or Proximity Operators?: no Automatically Built Queries (Routing) - Topic Fields Used: dom, title, desc, narr, con, def, nat, time - Average Computer Time to Build Query (in cpu seconds): 100 - Method used in Query Construction - Terms Selected From - Topics: yes - Only Documents with Relevance Judgments: yes - Term Weighting with Weights Based on terms in - Topics: yes - Documents with Relevance Judgments: yes - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Run ID: Brkly17, Brkly18 - Topic Fields Used: title, desc, narr - Average Time to Build Query (in Minutes): 90 - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Term Weighting?: yes - Addition of Terms not Included in Topic?: yes - Source of Terms: test collection & parallel collections Manually Constructed Queries (Ad-Hoc) - Run ID: BrklyCH2 - Topic Fields Used: C-title, C-desc, C-narr - Average Time to Build Query (in Minutes): 160 - Type of Query Builder - Computer System Expert: yes - Tools used to Build Query - Word Frequency List?: yes - Knowledge Base Browser?: - Other Lexical Tools?: - Method used in Query Construction - Term Weighting?: yes - Addition of Terms not Included in Topic?: yes - Source of Terms: test collection Searching Search Times - Run ID: Brkly13, Brkly14 - Computer Time to Search (Average per Query, in CPU seconds): 9.382 - Component Times: User time: 7.331 seconds/query System time: 2.050 seconds/query Search Times - Run ID: Brkly15 - Computer Time to Search (Average per Query, in CPU seconds): 39 - Component Times: User time: 38.1 seconds/query System time: 0.9 seconds/query Search Times - Run ID: Brkly16 - Computer Time to Search (Average per Query, in CPU seconds): 39 - Component Times: User time: 38.1 seconds/query System time: 0.9 seconds/query Search Times - Run ID: BrklyCH1, BrklyCH2 - Computer Time to Search (Average per Query, in CPU seconds): 20 - Component Times: User time: 18.56 seconds/query System time: 1.43 seconds/query Machine Searching Methods - Probabilistic Model?: yes Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Document Length?: yes Machine Information - Machine Type for TREC Experiment: Ultrasparc - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage (in MB): 2,000 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 166