System Summary and Timing
  Organization Name: Siemens Corporate Research, Inc.
  List of Run ID's: siems1, siems2, siems3

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list:  571 
    - Stemming Algorithm:   standard SMART            
    - Term Weighting:  yes (cosine-normalized tf) 
    -  Phrase Discovery? : yes              
      - Kind of Phrase:  two words 
      - Method Used (statistical, syntactic, other): statistical    
    -  Tokenizer? :              

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID :  siems1 
      - Total Storage (in MB):  727 MB 
      - Total Computer Time to Build (in hours):  13.75 
      - Automatic Process? (If not, number of manual hours):  yes 
      - Use of Term Positions? : no 
      - Only Single Terms Used? :  no 
    - Inverted index           
      - Run ID :  siems2, siems3 
      - Total Storage (in MB): 10 individual indexes totaling 955 MB           
      - Total Computer Time to Build (in hours):  20 hours total 
      - Automatic Process? (If not, number of manual hours):  yes 
      - Use of Term Positions? : no 
      - Only Single Terms Used? :  no 
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
    - Knowledge Bases            
      - Use of Manual Labor                  
    - Special Routing Structures           
    - Other Data Structures built from TREC text           

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Domain (independent or specific): collection specific 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): 
        siems2: training query clusters 
        siems3: ranks of relevant docs in training queries    
      - Total Storage (in MB): siems2: ~ .8 per db siems3: ~ 1 per db
    - Total Computer Time to Build (in hours): siems2: ~ 22 hours to do 
      training query retrieval +~ .75 to cluster siems3: ~ 22 hours to do 
      training query retrieval  +~ .5 to make query collection            
      - Use of Manual Labor                   
    -  Externally-built Auxiliary File            
      - Type of File (Treebank, WordNet, etc.): list of phrases generated by 
        Cornell from disk 1 
      - Total Storage (in MB): 2 
      - Number of Concepts Represented:  158099 

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used:  description 
    - Average Computer Time to Build Query (in cpu seconds): siems1: ~ .3 CPU 
      secs   siems2: .2 CPU secs ave. per query per database siems3: .5 CPU 
      secs aver per query (in Query db)      
    - Method used in Query Construction          
      - Term Weighting (weights based on terms in topics)? :  yes 
      - Phrase Extraction from Topics? : yes 
      - Tokenizer? :                 
      - Expansion of Queries using Previously-Constructed Data Structure? :              
  Searching

    Search Times

      - Run ID :  siems1 
      - Computer Time to Search (Average per Query, in CPU seconds): 74 
      - Component Times : search consists of: run initial query get relevance 
        assessments construct expanded query run expanded query (did not time 
        individual pieces, sorry)    
    -  Search Times             
      - Run ID :  siems2 
      - Computer Time to Search (Average per Query, in CPU seconds): 73 
        (assuming parallel searching of dbs) 
      - Component Times : 72 CPU secs average per query per database searching 
        plus .9 CPU secs average per query for actual merging    
    -  Search Times             
      - Run ID :  siems3 
      - Computer Time to Search (Average per Query, in CPU seconds): 103 
        (assuming parallel searching of dbs) 
      - Component Times : 72 CPU secs average per query per database searching 
        plus 31 CPU secs average per query for actual merging    

    Machine Searching Methods

      - Vector Space Model? :  yes 
      - Cluster Searching? :  only in siems2 (for query clusters) 

    Factors in Ranking

      - Term Frequency? :  yes 
      - Inverse Document Frequency? :  yes 
      - Document Length? :  yes (cosine) 
      - Other:  probabilistic creation of ranked set for merging runs siems2 
        and siems3.  Document for next rank is randomly selected from dbs, 
        with selection biased by the number of documents in each db remaining 
        to be added to final ranking. 

    Machine Information

    - Machine Type for TREC Experiment:  SPARC-10/41 
    - Was the Machine Dedicated or Shared:  mostly dedicated 
    - Amount of Hard Disk Storage (in MB):  ~ 13,000 
    - Amount of RAM (in MB):  128 
    - Clock Rate of CPU (in MHz):  40 MHz 

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
      System:  fusion code completely experimental.  Retrieval done by SMART, 
      a well-tuned research prototype.
    - Given appropriate resources            
      - Could your system run faster? :  yes 
      - By how much (estimate)? :  at least halved for MRDD fusion: need to 
        approximate optimization problem