System Summary and Timing Organization Name: InTEXT Systems List of Run ID's: INTXT2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 2010 - Controlled Vocabulary? : No - Stemming Algorithm: Modified Lovins - Morphological Analysis: Yes - Term Weighting: Document by document, based on frequency and phrase length - Phrase Discovery? : Yes - Kind of Phrase: Multiple word delimited by Stop Word - Method Used (statistical, syntactic, other): Statistical - Syntactic Parsing? : No - Word Sense Disambiguation? : No - Heuristic Associations (including short definition)? : No - Spelling Checking (with manual correction)? : No - Spelling Correction? : No - Proper Noun Identification Algorithm? : Yes - Tokenizer? : Space and punctuation recognition - Patterns which are tokenized: Compound proper nouns - Manually-Indexed Terms? : No - Other Techniques for building Data Structures: Term selection based on weight and morphology using InTEXT Precision software. PreciseScope documents built with omitted wordsreplaced by noise words. Statistics on Data Structures built from TREC Text - Inverted index - Run ID : INTXT2 - Total Storage (in MB): 600 - Total Computer Time to Build (in hours): 80 - Automatic Process? (If not, number of manual hours): Yes. Using InTEXT Retrieval Engine - Use of Term Positions? : Yes - Only Single Terms Used? : No - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID : INTXT2 - Type of Structure: PreciseScope documents - Total Storage (in MB): 250 - Total Computer Time to Build (in hours): 40 - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: Each document is analysed; words and phrases are selected for indexing as outlined above, and PreciseScope documents and weighted keyword lists are created - Other Data Structures built from TREC text - Run ID : INTXT2 - Type of Structure: Weighted keywords and phrases (generated but not used in tests) - Total Storage (in MB): 100 - Total Computer Time to Build (in hours): Contained in PreciseScope times - Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: See above. Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Use of Manual Labor - Externally-built Auxiliary File Query construction Automatically Built Queries (Ad-Hoc) - Method used in Query Construction - Tokenizer? : - Expansion of Queries using Previously-Constructed Data Structure?: Automatically Built Queries (Routing) - Method used in Query Construction - Terms Selected From - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer - Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: None - Average Time to Build Query (in Minutes): 30. Very variable. (Mainly due to lack of knowledge of US current affairs) - Type of Query Builder - Computer System Expert: Yes - Tools used to Build Query - Word Frequency List? : No - Knowledge Base Browser? : - Other Lexical Tools? : - Method used in Query Construction - Term Weighting? : Yes. Minimal. Usually one group of alternative terms was manadatory - Boolean Connectors (AND, OR, NOT)? : AND, OR for mandatory terms. - Proximity Operators? : Yes. Phrase and same paragraph - Addition of Terms not Included in Topic? : Yes - Source of Terms: General knowledge and research Manually Constructed Queries (Routing) - Type of Query Builder - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Data Used for Building Query from - Method used in Query Construction - Addition of Terms not Included in Topic? : Interactive Queries - Type of Person doing Interaction - Average Time to do Complete Interaction - Methods used in Interaction - Automatic Query Expansion from Relevant Documents? : - Manual Methods Searching Search Times - Run ID : INTXT2 - Computer Time to Search (Average per Query, in CPU seconds): 120 Machine Searching Methods - Machine Searching Methods - Boolean Matching? : Yes, in part Factors in Ranking - Factors in Ranking - Term Frequency? : Yes - Inverse Document Frequency? : No - Other Term Weights? : Generated automatically by determining their discriminating power in query or defined manually. - Position in Document? : Yes - Proximity of Terms? : Yes. Phrases and paragraphs - Percentage of Query Terms which match? : Yes Machine Information - Machine Type for TREC Experiment: Pentium - Was the Machine Dedicated or Shared: Yes - Amount of Hard Disk Storage (in MB): 4000 - Amount of RAM (in MB): 16 - Clock Rate of CPU (in MHz): 90 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: Both InTEXT Retrieval Engine and InTEXT Precision are commercial products based on an elapsed development time of 10+years. Precise figures not available. - Given appropriate resources - Could your system run faster? : Yes. Elimination of intermediate files, multi-threading - By how much (estimate)? : 5-10