System Summary and Timing
  Organization Name: InTEXT Systems
  List of Run ID's: INTXT2

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: 2010 
    - Controlled Vocabulary? : No 
    - Stemming Algorithm:   Modified Lovins            
      - Morphological Analysis: Yes 
    - Term Weighting: Document by document, based on frequency and 
      phrase length 
    -  Phrase Discovery? :  Yes            
      - Kind of Phrase: Multiple word delimited by Stop Word 
      - Method Used (statistical, syntactic, other): Statistical 
    -  Syntactic Parsing? : No 
    -  Word Sense Disambiguation? : No 
    -  Heuristic Associations (including short definition)? : No 
    -  Spelling Checking (with manual correction)? : No 
    -  Spelling Correction? : No  
    -  Proper Noun Identification Algorithm? : Yes 
    -  Tokenizer? :  Space and punctuation recognition            
      - Patterns which are tokenized: Compound proper nouns 
    -  Manually-Indexed Terms? : No 
    -  Other Techniques for building Data Structures: Term selection 
       based on weight and morphology using InTEXT Precision software. 
       PreciseScope documents built with omitted wordsreplaced by noise words. 

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID : INTXT2 
      - Total Storage (in MB): 600 
      - Total Computer Time to Build (in hours): 80 
      - Automatic Process? (If not, number of manual hours): Yes. Using InTEXT 
        Retrieval Engine 
      - Use of Term Positions? : Yes 
      - Only Single Terms Used? : No 
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
    - Knowledge Bases            
      - Use of Manual Labor                  
    - Special Routing Structures           
    - Other Data Structures built from TREC text           
      - Run ID : INTXT2 
      - Type of Structure: PreciseScope documents 
      - Total Storage (in MB): 250 
      - Total Computer Time to Build (in hours): 40 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: Each document is analysed; words and phrases 
        are selected for indexing as outlined above, and PreciseScope documents 
        and weighted keyword lists are created 
    - Other Data Structures built from TREC text           
      - Run ID : INTXT2 
      - Type of Structure: Weighted keywords and phrases (generated but not used 


        in tests) 
      - Total Storage (in MB): 100 
      - Total Computer Time to Build (in hours): Contained in PreciseScope times 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: See above. 

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Use of Manual Labor                   
    -  Externally-built Auxiliary File            

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Method used in Query Construction          
      - Tokenizer? :                 
      - Expansion of Queries using Previously-Constructed Data Structure?:              

    Automatically Built Queries (Routing)

    - Method used in Query Construction          
      - Terms Selected From            
      - Term Weighting with Weights Based on terms in            
      - Phrase Extraction from            
      - Syntactic Parsing            
      - Word Sense Disambiguation using            
      - Proper Noun Identification Algorithm from            
      - Tokenizer             
      - Heuristic Associations to Add Terms from            
      - Expansion of Queries using Previously-Constructed Data Structure:              
      - Automatic Addition of Boolean connectors or Proximity Operators using 
        information from             

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used: None 
    - Average Time to Build Query (in Minutes): 30. Very variable. (Mainly due 
      to lack of knowledge of US current affairs) 
    - Type of Query Builder          
      - Computer System Expert: Yes 
    - Tools used to Build Query          
      - Word Frequency List? : No 
      - Knowledge Base Browser? :                 
      - Other Lexical Tools? :                
    - Method used in Query Construction          
      - Term Weighting? : Yes. Minimal. Usually one group of alternative terms 
        was manadatory 
      - Boolean Connectors (AND, OR, NOT)? : AND, OR for mandatory terms.
      - Proximity Operators? : Yes. Phrase and same paragraph 
      - Addition of Terms not Included in Topic? : Yes              
        - Source of Terms: General knowledge and research 

    Manually Constructed Queries (Routing)

    - Type of Query Builder          
    - Tools used to Build Query          


      - Knowledge Base Browser? :                 
      - Other Lexical Tools? :               
    - Data Used for Building Query from           
    - Method used in Query Construction          
      - Addition of Terms not Included in Topic? :               

    Interactive Queries

    - Type of Person doing Interaction            
    - Average Time to do Complete Interaction            
    - Methods used in Interaction         
      - Automatic Query Expansion from Relevant Documents? :                 
      - Manual Methods               

  Searching

    Search Times

      - Run ID : INTXT2 
      - Computer Time to Search (Average per Query, in CPU seconds): 120

    Machine Searching Methods

    - Machine Searching Methods       
      - Boolean Matching? : Yes, in part 

    Factors in Ranking

    - Factors in Ranking       
      - Term Frequency? : Yes 
      - Inverse Document Frequency? : No 
      - Other Term Weights? : Generated automatically by determining their 
        discriminating power in query or defined manually. 
      - Position in Document? : Yes 
      - Proximity of Terms? : Yes. Phrases and paragraphs 
      - Percentage of Query Terms which match? : Yes 


    Machine Information

    - Machine Type for TREC Experiment: Pentium 
    - Was the Machine Dedicated or Shared: Yes 
    - Amount of Hard Disk Storage (in MB): 4000 
    - Amount of RAM (in MB): 16 
    - Clock Rate of CPU (in MHz): 90 

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of 
      the System: Both InTEXT Retrieval Engine and InTEXT Precision are
      commercial products based on an elapsed development time of 10+years.
      Precise figures not available. 
    - Given appropriate resources            
      - Could your system run faster? : Yes. Elimination of intermediate           
        files, multi-threading 
      - By how much (estimate)? : 5-10