System Summary and Timing Organization Name: City University List of Run ID's: citya1, citym1, cityr1, cityr2, cityi1 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: - citya1 citym1 cityi1 - cityr1 cityr2 - Controlled Vocabulary? : No - Stemming Algorithm: Modified Porter with spelling normalization - Morphological Analysis: No - Term Weighting: No - Phrase Discovery? : No - Syntactic Parsing? : No - Word Sense Disambiguation? : No - Heuristic Associations (including short definition)? : No - Spelling Checking (with manual correction)? : No - Spelling Correction? : No - Proper Noun Identification Algorithm? : No - Tokenizer? : No - Manually-Indexed Terms? : No - Other Techniques for building Data Structures: None Statistics on Data Structures built from TREC Text - Inverted index - Run ID : citya1 citym1 cityi1 - Total Storage (in MB): 935 - Total Computer Time to Build (in hours): ~~20 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions? : Yes - Only Single Terms Used? : Mainly - Inverted index - Run ID : cityr1 cityr2 - Total Storage (in MB): 395 - Total Computer Time to Build (in hours): ~~8 - Automatic Process? (If not, number of manual hours): Yes - Use of Term Positions? : Yes - Only Single Terms Used? : Mainly - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): Somewhat angled towards American data - Type of File (thesaurus, knowledge base, lexicon, etc.): Contains synonym classes go phrases stop- and semi-stop terms, prefixes - Total Storage (in MB): <<1 - Number of Concepts Represented: about 900 - Type of Representation: Lookup table - Total Manual Time to Build (in hours): Not known, built piecemeal - Use of Manual Labor - Other: Yes - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: DESCRIPTION - Average Computer Time to Build Query (in cpu seconds): Several minutes - Method used in Query Construction - Term Weighting (weights based on terms in topics)? : See below - Tokenizer? : - Expansion of Queries using Previously-Constructed Data Structure? : - Other: Queries built from terms extracted from documents retrieved by a pilot search using the topic statement. Weights based on statistics of occurrence in documents retrieved in pilot search, in topic statement, and in the collection. See text. Automatically Built Queries (Routing) - Topic Fields Used: TCND (but only to modify weights) - Average Computer Time to Build Query (in cpu seconds): ~~ 2 hours - Method used in Query Construction - Terms Selected From - Only Documents with Relevance Judgments: All documents with positive relevance judgments - Term Weighting with Weights Based on terms in - Topics: The weights of topic terms were modified if they occurred more than once in the topic statement. - Documents with Relevance Judgments: All docs with positive relevance judgments - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: DESCRIPTION - Average Time to Build Query (in Minutes): Not recorded, a few minutes. - Type of Query Builder - Domain Expert: No - Computer System Expert: Possibly - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Method used in Query Construction - Addition of Terms not Included in Topic? : Yes - Source of Terms: See below - Other: Procedure the same as the automatic ad hoc except that some query terms were manually removed both from the pilot query and also from the final query resulting from expansion. Interactive Queries - Initial Query Built Automatically or Manually: Manually - Type of Person doing Interaction - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): Unknown - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): 30 - Average Number of Iterations: 4.2 - Average Number of Documents Examined per Iteration: 7.5 - Minimum Number of Iterations: 1 - Maximum Number of Iterations: 8 - What Determines the End of an Iteration: a user initiated search on the current query formulation - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents? : Yes - Automatic Query Expansion from Relevant Documents? : Yes - All Terms in Relevant Documents added: Yes, but only the top 50 of the ranked list of reweighted terms available to the user - Only Top X Terms Added (what is X): 20 after possible removal of terms by the user - User Selected Terms Added: User could remove terms - Manual Methods - Using Individual Judgment (No Set Algorithm)? : Yes Searching Search Times - Run ID : citya1 citym1 cityr1 cityr2 - Computer Time to Search (Average per Query, in CPU seconds): < 60 - Component Times : Usually about 20 seconds (mainly disk wait) to fetch the 1000 documents (we don't have a separate file of DOCNOs); the rest is initializing and searching - Run ID : cityi1 - Computer Time to Search (Average per Query, in CPU seconds): Unknown - Component Times : Unknown Machine Searching Methods - Probabilistic Model? : Yes Factors in Ranking - Term Frequency? : Yes - Inverse Document Frequency? : Yes - Other Term Weights? : From relevance or pseudo-relevance information when available. - Document Length? : Yes Machine Information - Machine Type for TREC Experiment: SS10, IPX, 4/330, IPC; all running 4.1.3 - Was the Machine Dedicated or Shared: SS10 dedicated, others shared - Amount of Hard Disk Storage (in MB): About 20GB - Amount of RAM (in MB): 256, 48, 40, 8 (resp.) - Clock Rate of CPU (in MHz): Various, don't know System Comparisons - Amount of "Software Engineering" which went into the Development of the System: A lot, over the years; more hacking together than software engineering. - Given appropriate resources - Could your system run faster? : Yes - By how much (estimate)? : Up to two orders of magnitude, given the hardware. With current hardware, possibly factor of 2 - Features the System is Missing that would be beneficial: Might include (1) better parsing including phrase determination and/or recognition; (2) better model for processing sub-documents---there are a number of bodges in the way we do this at present, we need better methods for determining parameters (of which there are at least 8) and some way of using information on more than one sub-document of a given document. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: All our runs made use of a facility for searching sub-documents.