System Summary and Timing
  Organization Name: CLARITECH Corporation
  List of Run ID's: CLTHES, CLCLUS (manual ad-hoc exp.)

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list:  N/A  
    - Controlled Vocabulary?:  N/A 
    - Stemming Algorithm:    N/A            
      - Morphological Analysis: A comprehensive inflectional morphology is used
to produce word roots. Participles are retained in surface forms. (Although 
normalization is possible.) No derivational morphology is used.    
    - Term Weighting:  
1) TF*IDF over phrases is used for retrieval  
2) An importance coefficient is assigned to TF*IDF for query terms  
3) A combination of statistics, linguistic-structure analysis, and heuristics 
is used for thesaurus extraction and term cluster generation; the statistical 
measures include term frequency and distribution.       
    -  Phrase Discovery?:   Yes            
      - Kind of Phrase:   Simplex noun phrases -- not including post-nominal 
appositive, prepositional, or participial phrases or relative clauses   
      - Method Used (statistical, syntactic, other): A deterministic, rule-
based parser nominates linguistic-constituent structure; a filter retains only 
simplex noun phrases for indexing purposes.            
    -  Syntactic Parsing?: Yes. (see above) 
    -  Word Sense Disambiguation?: The parser grammar includes heuristics for 
syntactic category disambiguation. 
    -  Heuristic Associations (including short definition)?: No. 
    -  Spelling Checking (with manual correction)?: No. 
    -  Spelling Correction?: No. 
    -  Proper Noun Identification Algorithm?: Words not identified in the 
lexicon (about 80,000 root forms of English) are assumed to be "candidate 
proper nouns". The grammar accommodates structure that includes proper names;
this technique does not require case-sensitive clues (e.g., capitalization) to 
be effective. 
    -  Tokenizer?:   None.            
    -  Manually-Indexed Terms?:   No.  
    -  Other Techniques for building Data Structures:   
1) Thesaurus discovery -- which we use for query-vector augmentation -- 
involves the identification of core characteristic terminology over a document 
set. The process ranks all terms according to scores along several parameters 
and then selects the subset of terminology that optimizes the scores.  
2) Term Cluster discovery -- which we use for query-vector augmentation -- 
involves the identification of seed terminology from the database and 
generation of term clusters associated with the seed terms. The process 
involves iterative discovery of terminology sets based on several parameters.  
Terms in the clusters are ranked based on these parameters. 
3) Document Windows -- documents are decomposed into fixed length subdocuments,
12 sentences on average. These windows are applied in thesaurus 
extraction and as a component in the document similarity calculation.        

    Statistics on Data Structures built from TREC Text

    - Inverted index We built inverted indices for individual TREC5 databases.
Average indexing time for individual TREC5 databases (between 184Mb and 253Mb 
in size - FinTimes data was divided into three databases FinTimes92, 
FinTimes93, and FinTimes94) was 1.12 hrs.           
      - Run ID: CLTHES 
      - Total Storage (in MB): 1,339 
      - Total Computer Time to Build (in hours):  10 hrs. 
      - Automatic Process? (If not, number of manual hours): Yes. 
      - Use of Term Positions?: No. 
      - Only Single Terms Used?: All terms used. 
    - Clusters (Terminology Clusters) For each individual TREC database we 
automatically identified clusters of terminology and created inverted indices 
for databases of discovered term clusters. Indices for cluster databases can be
built simultaneously with the document database indices or as a post-process. 
Term clustering generates 4 additional CLARIT index files; this yields a higher
storage space for the indices. We used term clusters to augment queries with 
related terminology from the target corpus.           
      - Run ID: CLCLUS 
      - Total Storage (in MB): 1,366 (index that allows search over both the 
document and the cluster terminology database)   
      - Total Computer Time to Build (in hours): 4.5 (for terminology clusters 
only) 
      - Automatic Process? (If not, number of manual hours): Yes. 
      - Brief Description of Method: Term clustering is based on an iterative 
procedure that starts with the "seed" terminology that is automatically 
selected from the target corpus based on term distribution, and uses several
database parameters to identify terminology related to a given seed term. 
Terms in the cluster are ranked based on the calculated term scores.   
    - N-grams, Suffix arrays, Signature Files     None.           
    - Knowledge Bases None.           
      - Use of Manual Labor                  
    - Special Routing Structures  N/A.           
    - Other Data Structures built from TREC text             
      - Run ID: CLTHES, CLCLUS 
      - Type of Structure:   Automatic Feedback Thesaurus 
      - Total Storage (in MB): not stored 
      - Automatic Process? (If not, number of manual hours): Yes. 
      - Brief Description of Method: Initial NL ad-hoc queries were 
automatically augmented using the first order thesaurus extracted from 40 best 
scored subdocuments retrieved from the database that also satisfy the query 
constraints. Top 50% of the thesaurus for each query were used for 
augmentation. 

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Type of File (thesaurus, knowledge base, lexicon, etc.): lexicon
      - Total Storage (in MB):  1.5 
      - Number of Concepts Represented:  close to 80,000 root forms
      - Type of Representation: root/syntactic-category pairs    
      - Total Computer Time to Build (in hours): N/A 
      - Total Computer Time to Modify for TREC (if already built): N/A 
      - Total Manual Time to Build (in hours):  The CLARIT lexicon was 
manually constructed using word-lists extracted from on-line sources during 
early phases of the CLARIT research project. (1988--89) 
      - Total Manual Time to Modify for TREC (if already built): No 
modification was required. 
      - Use of Manual Labor                   
        - Mostly Manually Built using Special Interface: Yes. 
    -  Externally-built Auxiliary File            

  Query construction

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used:  Title, description, narrative.
    - Average Time to Build Query (in Minutes): 30 min. -- using CLARIT 
Interactive Retrieval 
    - Type of Query Builder          
      - Domain Expert: No. 
      - Computer System Expert: Yes. 
    - Tools used to Build Query          
      - Word Frequency List?: No. 
      - Knowledge Base Browser? No.:                 
      - Other Lexical Tools?:                
      - CLARIT Retrieval System: Users used CLARIT Interactive system to 
formulate manual queries for individual topics.  CLARIT manual queries consist 
of two parts, the natural language query and a set of simple, Boolean type 
constraints.  
    - Method used in Query Construction          
      - Term Weighting?:   Topic terms are weighted using TF*IDF * Importance, 
where the Importance coefficient for query source terminology is assigned 
manually.   
      - Boolean Connectors (AND, OR, NOT)?: Constraints in the CLARIT queries 
have a form of simple Boolean queries. They are used as filters over the set of
documents retrieved for the NLP query.  
      - Proximity Operators?: No. 
      - Addition of Terms not Included in Topic?:               
      - Other: CLARIT Thesaurus and Terminology Clusters In the CLTHES run 
users created the queries by iteratively searching over the target database and
selecting terms from the thesauri created from retrieved documents or terms 
from documents themselves. In the CLCLUS run users iteratively performed 
retrieval over both document and the corresponding term cluster databases and 
selected terms from related term clusters or documents themselves.  Manual 
queries (NL query and constraints) were submitted for batch mode retrieval over
the target database. Best retrieved subdocuments (that satisfy the constraints)
were used for automatic augmentation of the queries.  CLARIT Thesaurus was 
extracted from 40 best scoring subdocuments and the top 50% of thesaurus terms 
were added to the query. Final query included augmented natural language query
and the set of constraints formulated by the user. The sets of constraints for 
the initial manual query and the augmented query were not necessarily the same.
They were both formulated by the user during the interactive creation of the 
initial queries.

  Searching

    Search Times

      - Run ID: CLTHES 
      - Computer Time to Search (Average per Query, in CPU seconds): 14
      - Component Times: Retrieval for manually constructed queries was done 
in the batch mode. Manual queries were stored in the parsed form so the batch 
processing did not include query parsing. 
		Initial Search		5 
		Automatic Feedback	2
		Final Search   		7
 
    -  Search Times             
      - Run ID: CLCLUS 
      - Computer Time to Search (Average per Query, in CPU seconds): 16 
      - Component Times: 
		Initial Search		5 
		Automatic Feedback	2 
		Final Search		9                  

    Machine Searching Methods

      - Vector Space Model?:  Yes. -- using a cosine distance  measure. 
Furthermore, the vector space does not fix  the document vector length 
component of the cosine distance  formula. Rather, the length of any document 
vector is allowed  to vary depending on the terms present in the query; only 
the terms in the query vector are considered to be 'active' for any given 
distance calculation, and all other terms in the document are ignored. 
      - Other: Document windows used for feedback and final retrieval are 
subjected to the filtering based on Boolean type constraints.

    Factors in Ranking

      - Term Frequency?:  TF_AUG = (0.5 + 0.5 * TF / MAX_TF) 
      - Inverse Document Frequency?:  IDF = Log_2(Number of docs in the corpus 
/ Number of docs containing term) + 1        
      - Other Term Weights?:  An importance coefficient is manually assigned 
for initial query terms.  (see above)       
      - Semantic Closeness?:   No. 
      - Position in Document?: No. 
      - Syntactic Clues?:   No.  
      - Proximity of Terms?:  No.  
      - Information Theoretic Weights?:  No.  
      - Document Length?:  Only as this is implicitly captured in the cosine 
distance measure.  
      - Percentage of Query Terms which match?:  Only as this is implicitly 
captured in the cosine distance measure.  
      - N-gram Frequency?: No. 
      - Word Specificity?: No.  
      - Word Sense Frequency?: No.  
      - Cluster Distance?: No.  
      - Other: The final list of retrieved documents represents the combined 
list of documents retrieved with and without constraints. The set of documents 
retrieved with the constraints was supplemented by the documents that were 
retrieved but did not satisfy the constraint. 

    Machine Information

    - Machine Type for TREC Experiment: DecAlpha 600 
    - Was the Machine Dedicated or Shared:  Dedicated. 
    - Amount of Hard Disk Storage (in MB): 36 
    - Amount of RAM (in MB):  128 
    - Clock Rate of CPU (in MHz): 133.33 (?) 

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
System:  The system used for TREC5 experiments is a commercially available 
retrieval system from CLARITECH Corporation.