System Summary and Timing Organization Name: CLARITECH Corporation List of Run ID's: CLTHES, CLCLUS (manual ad-hoc exp.) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: N/A - Controlled Vocabulary?: N/A - Stemming Algorithm: N/A - Morphological Analysis: A comprehensive inflectional morphology is used to produce word roots. Participles are retained in surface forms. (Although normalization is possible.) No derivational morphology is used. - Term Weighting: 1) TF*IDF over phrases is used for retrieval 2) An importance coefficient is assigned to TF*IDF for query terms 3) A combination of statistics, linguistic-structure analysis, and heuristics is used for thesaurus extraction and term cluster generation; the statistical measures include term frequency and distribution. - Phrase Discovery?: Yes - Kind of Phrase: Simplex noun phrases -- not including post-nominal appositive, prepositional, or participial phrases or relative clauses - Method Used (statistical, syntactic, other): A deterministic, rule- based parser nominates linguistic-constituent structure; a filter retains only simplex noun phrases for indexing purposes. - Syntactic Parsing?: Yes. (see above) - Word Sense Disambiguation?: The parser grammar includes heuristics for syntactic category disambiguation. - Heuristic Associations (including short definition)?: No. - Spelling Checking (with manual correction)?: No. - Spelling Correction?: No. - Proper Noun Identification Algorithm?: Words not identified in the lexicon (about 80,000 root forms of English) are assumed to be "candidate proper nouns". The grammar accommodates structure that includes proper names; this technique does not require case-sensitive clues (e.g., capitalization) to be effective. - Tokenizer?: None. - Manually-Indexed Terms?: No. - Other Techniques for building Data Structures: 1) Thesaurus discovery -- which we use for query-vector augmentation -- involves the identification of core characteristic terminology over a document set. The process ranks all terms according to scores along several parameters and then selects the subset of terminology that optimizes the scores. 2) Term Cluster discovery -- which we use for query-vector augmentation -- involves the identification of seed terminology from the database and generation of term clusters associated with the seed terms. The process involves iterative discovery of terminology sets based on several parameters. Terms in the clusters are ranked based on these parameters. 3) Document Windows -- documents are decomposed into fixed length subdocuments, 12 sentences on average. These windows are applied in thesaurus extraction and as a component in the document similarity calculation. Statistics on Data Structures built from TREC Text - Inverted index We built inverted indices for individual TREC5 databases. Average indexing time for individual TREC5 databases (between 184Mb and 253Mb in size - FinTimes data was divided into three databases FinTimes92, FinTimes93, and FinTimes94) was 1.12 hrs. - Run ID: CLTHES - Total Storage (in MB): 1,339 - Total Computer Time to Build (in hours): 10 hrs. - Automatic Process? (If not, number of manual hours): Yes. - Use of Term Positions?: No. - Only Single Terms Used?: All terms used. - Clusters (Terminology Clusters) For each individual TREC database we automatically identified clusters of terminology and created inverted indices for databases of discovered term clusters. Indices for cluster databases can be built simultaneously with the document database indices or as a post-process. Term clustering generates 4 additional CLARIT index files; this yields a higher storage space for the indices. We used term clusters to augment queries with related terminology from the target corpus. - Run ID: CLCLUS - Total Storage (in MB): 1,366 (index that allows search over both the document and the cluster terminology database) - Total Computer Time to Build (in hours): 4.5 (for terminology clusters only) - Automatic Process? (If not, number of manual hours): Yes. - Brief Description of Method: Term clustering is based on an iterative procedure that starts with the "seed" terminology that is automatically selected from the target corpus based on term distribution, and uses several database parameters to identify terminology related to a given seed term. Terms in the cluster are ranked based on the calculated term scores. - N-grams, Suffix arrays, Signature Files None. - Knowledge Bases None. - Use of Manual Labor - Special Routing Structures N/A. - Other Data Structures built from TREC text - Run ID: CLTHES, CLCLUS - Type of Structure: Automatic Feedback Thesaurus - Total Storage (in MB): not stored - Automatic Process? (If not, number of manual hours): Yes. - Brief Description of Method: Initial NL ad-hoc queries were automatically augmented using the first order thesaurus extracted from 40 best scored subdocuments retrieved from the database that also satisfy the query constraints. Top 50% of the thesaurus for each query were used for augmentation. Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Type of File (thesaurus, knowledge base, lexicon, etc.): lexicon - Total Storage (in MB): 1.5 - Number of Concepts Represented: close to 80,000 root forms - Type of Representation: root/syntactic-category pairs - Total Computer Time to Build (in hours): N/A - Total Computer Time to Modify for TREC (if already built): N/A - Total Manual Time to Build (in hours): The CLARIT lexicon was manually constructed using word-lists extracted from on-line sources during early phases of the CLARIT research project. (1988--89) - Total Manual Time to Modify for TREC (if already built): No modification was required. - Use of Manual Labor - Mostly Manually Built using Special Interface: Yes. - Externally-built Auxiliary File Query construction Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: Title, description, narrative. - Average Time to Build Query (in Minutes): 30 min. -- using CLARIT Interactive Retrieval - Type of Query Builder - Domain Expert: No. - Computer System Expert: Yes. - Tools used to Build Query - Word Frequency List?: No. - Knowledge Base Browser? No.: - Other Lexical Tools?: - CLARIT Retrieval System: Users used CLARIT Interactive system to formulate manual queries for individual topics. CLARIT manual queries consist of two parts, the natural language query and a set of simple, Boolean type constraints. - Method used in Query Construction - Term Weighting?: Topic terms are weighted using TF*IDF * Importance, where the Importance coefficient for query source terminology is assigned manually. - Boolean Connectors (AND, OR, NOT)?: Constraints in the CLARIT queries have a form of simple Boolean queries. They are used as filters over the set of documents retrieved for the NLP query. - Proximity Operators?: No. - Addition of Terms not Included in Topic?: - Other: CLARIT Thesaurus and Terminology Clusters In the CLTHES run users created the queries by iteratively searching over the target database and selecting terms from the thesauri created from retrieved documents or terms from documents themselves. In the CLCLUS run users iteratively performed retrieval over both document and the corresponding term cluster databases and selected terms from related term clusters or documents themselves. Manual queries (NL query and constraints) were submitted for batch mode retrieval over the target database. Best retrieved subdocuments (that satisfy the constraints) were used for automatic augmentation of the queries. CLARIT Thesaurus was extracted from 40 best scoring subdocuments and the top 50% of thesaurus terms were added to the query. Final query included augmented natural language query and the set of constraints formulated by the user. The sets of constraints for the initial manual query and the augmented query were not necessarily the same. They were both formulated by the user during the interactive creation of the initial queries. Searching Search Times - Run ID: CLTHES - Computer Time to Search (Average per Query, in CPU seconds): 14 - Component Times: Retrieval for manually constructed queries was done in the batch mode. Manual queries were stored in the parsed form so the batch processing did not include query parsing. Initial Search 5 Automatic Feedback 2 Final Search 7 - Search Times - Run ID: CLCLUS - Computer Time to Search (Average per Query, in CPU seconds): 16 - Component Times: Initial Search 5 Automatic Feedback 2 Final Search 9 Machine Searching Methods - Vector Space Model?: Yes. -- using a cosine distance measure. Furthermore, the vector space does not fix the document vector length component of the cosine distance formula. Rather, the length of any document vector is allowed to vary depending on the terms present in the query; only the terms in the query vector are considered to be 'active' for any given distance calculation, and all other terms in the document are ignored. - Other: Document windows used for feedback and final retrieval are subjected to the filtering based on Boolean type constraints. Factors in Ranking - Term Frequency?: TF_AUG = (0.5 + 0.5 * TF / MAX_TF) - Inverse Document Frequency?: IDF = Log_2(Number of docs in the corpus / Number of docs containing term) + 1 - Other Term Weights?: An importance coefficient is manually assigned for initial query terms. (see above) - Semantic Closeness?: No. - Position in Document?: No. - Syntactic Clues?: No. - Proximity of Terms?: No. - Information Theoretic Weights?: No. - Document Length?: Only as this is implicitly captured in the cosine distance measure. - Percentage of Query Terms which match?: Only as this is implicitly captured in the cosine distance measure. - N-gram Frequency?: No. - Word Specificity?: No. - Word Sense Frequency?: No. - Cluster Distance?: No. - Other: The final list of retrieved documents represents the combined list of documents retrieved with and without constraints. The set of documents retrieved with the constraints was supplemented by the documents that were retrieved but did not satisfy the constraint. Machine Information - Machine Type for TREC Experiment: DecAlpha 600 - Was the Machine Dedicated or Shared: Dedicated. - Amount of Hard Disk Storage (in MB): 36 - Amount of RAM (in MB): 128 - Clock Rate of CPU (in MHz): 133.33 (?) System Comparisons - Amount of "Software Engineering" which went into the Development of the System: The system used for TREC5 experiments is a commercially available retrieval system from CLARITECH Corporation.