System Summary and Timing Organization Name: Cornell University List of Run ID's: CrnlAE CrnlAL CrnlRE CrnlRL CrnlI1 CrnlI2 CrnlBc10 CrnlB CrnlSE CrnlSV Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 592 - Controlled Vocabulary?: No - Stemming Algorithm: SMART Lovin's based - Term Weighting: Yes, tf * idf with new doc normalization - Phrase Discovery?: yes - Kind of Phrase: Two adjacent non-stopwords occurring 25 times in D1 - Method Used (statistical, syntactic, other): statistical - Proper Noun Identification Algorithm?: yes - Tokenizer?: yes, but results not used. Statistics on Data Structures built from TREC Text - Inverted index - Run ID: CrnlAE CrnlAL CrnlI1 CrnlI2 - Total Storage (in MB): 731 - Total Computer Time to Build (in hours): under 3.5 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: no - Inverted index - Run ID: CrnlRE CrnlRL - Total Storage (in MB): ?? - Total Computer Time to Build (in hours): 3 - Inverted index - Run ID: CrnlSE - Total Storage (in MB): 85 - Total Computer Time to Build (in hours): under .5 - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Inverted index - Run ID: CrnlBc10 - Total Storage (in MB): 610 - Total Computer Time to Build (in hours): 1.2 (14 hours elapsed) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: CrnlBc10 - Type of Structure: trie representation of corruption dictionary - Total Storage (in MB): 400 - Total Computer Time to Build (in hours): 4 elapsed hours(but lots of paging) - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: sort words in dictionary and formtrie. - Other Data Structures built from TREC text - Run ID: CrnlAE CrnlAL CrnlI1 CrnlI2 CrnlRE CrnlRL CrnlBc10 - Type of Structure: list of top occurring phrases in Disk 1 - Total Storage (in MB): 1 - Total Computer Time to Build (in hours): 4 elapsed hours - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: index and sort phrases by freq, keeping those occurring in 25 docs. Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): .02 seconds/query for Pass 1 query. Running a retrieval, reindexing top 20 docs, and expanding query using relevance feedback took .95 CPU seconds/query - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: yes - Phrase Extraction from Topics?: yes - Proper Noun Identification Algorithm?: not used - Tokenizer?: - Patterns which are Tokenized: not used - Expansion of Queries using Previously-Constructed Data Structure?: Automatically Built Queries (Routing) - Topic Fields Used: all - Average Computer Time to Build Query (in cpu seconds): CrnlRE: .02 seconds to index original query. 48 seconds/query to gather relevance statistics (amortized and neglible in operational environment) and 240 seconds/query to form final query (almost all DFO). CrnlRL: .02 seconds to index original query. 48 seconds/query to gather statistics (amortized and neglible in operational environment) and 116 seconds/query to form final query. - Method used in Query Construction - Terms Selected From - Topics: yes - All Training Documents: yes - Term Weighting with Weights Based on terms in - Topics: yes - All Training Documents: yes - Documents with Relevance Judgments: yes - Phrase Extraction from - Topics: yes (normal adjacency phrases for CrnlRE CrnlRL) - All Training Documents: yes - Documents with Relevance Judgments: CrnlRL added info about pairs of terms occurring closely together. - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Interactive Queries - Initial Query Built Automatically or Manually: Automatically - Type of Person doing Interaction - Domain Expert: no - System Expert: yes - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): 14 seconds/query - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): CrnlI1:18, CrnlI2:14 - Average Number of Iterations: CrnlI1:4 CrnlI2 6.3 - Average Number of Documents Examined per Iteration: 10 - Minimum Number of Iterations: 2 - Maximum Number of Iterations: 9 - What Determines the End of an Iteration: Looked at all 10 docs presented - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents? : yes - Automatic Query Expansion from Relevant Documents? : yes - Only Top X Terms Added (what is X): 60 - Other Automatic Methods: None. Only user input allowed is relevance judgement - Manual Methods Searching Search Times - Run ID: CrnlAE - Computer Time to Search (Average per Query, in CPU seconds): <28 seconds/query - Component Times: 1 second/query search, 27 seconds keeping track of top 1000 - Search Times - Run ID: CrnlAL - Search Times - Run ID: CrnlSV - Computer Time to Search (Average per Query, in CPU seconds): .05 search plus 22 to keep track of top 1000 docs Machine Searching Methods - Vector Space Model?: yes - Probabilistic Model?: some components Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Other Term Weights?: yes, eg CrnlAE do an initial retrieval and weight terms according to how often they occur in top 20 docs. The ITL runs (adhoc + routing) used distance between terms in both query and doc. - Proximity of Terms?: yes - Document Length?: yes Machine Information - Machine Type for TREC Experiment: Sun SPARC 20/512 - Was the Machine Dedicated or Shared: dedicated - Amount of Hard Disk Storage (in MB): 27,000 - Amount of RAM (in MB): 192 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: several years, mostly re-engineering - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: ??