System Summary and Timing
  Organization Name: Rutgers University
  List of Run ID's: rutfspt; rutfglob; All responses are for Sub-run: DL2

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list:  0
    - Controlled Vocabulary?: no 
    - Stemming Algorithm:   none            
      - Morphological Analysis:  none
    - Term Weighting:  none
    -  Phrase Discovery?: yes - actually "string discovery"            
      - Method Used (statistical, syntactic, other): statistical: dictionary 
of n-grams built up using LZ78 method.
    -  Syntactic Parsing?:  no
    -  Word Sense Disambiguation?: no
    -  Heuristic Associations (including short definition)?: no 
    -  Spelling Checking (with manual correction)?: no 
    -  Spelling Correction?: no 
    -  Proper Noun Identification Algorithm?: no  
    -  Tokenizer?:              
      - Patterns which are tokenized:  no 
    -  Manually-Indexed Terms?: no 
    -  Other Techniques for building Data Structures: no  

    Statistics on Data Structures built from TREC Text

    - Inverted index            
    - Clusters            
    - N-grams, Suffix arrays, Signature Files           
      - Run ID: {{KB}} 
      - Total Storage (in MB): 26 (compressed) 
      - Total Computer Time to Build (in hours): 0.57 
      - Automatic Process? (If not, number of manual hours):  Automatic
      - Brief Description of Method: statistical: dictionaries of n-grams 
built up using the Ziv-Lempel-Welch method (LZW).  Each dictionary is 
generated from all judged relevant documents for a single topic.
    - Knowledge Bases            
      - Use of Manual Labor                  
    - Special Routing Structures           
    - Other Data Structures built from TREC text           

    Automatically Built Queries (Routing)

    - Topic Fields Used: all  
    - Average Computer Time to Build Query (in cpu seconds): 0
    - Method used in Query Construction          
      - Terms Selected From            
        - Topics: All 
        - Only Documents with Relevance Judgments: yes 
      - Term Weighting with Weights Based on terms in            
      - Phrase Extraction from            
      - Syntactic Parsing            
      - Word Sense Disambiguation using            
      - Proper Noun Identification Algorithm from            
      - Tokenizer             
      - Heuristic Associations to Add Terms from            
      - Expansion of Queries using Previously-Constructed Data Structure:              
      - Automatic Addition of Boolean connectors or Proximity Operators using 
information from 
      - Other: non-ASCII characters removed; all white space replaced by spaces.

  Searching

    Search Times

      - Run ID:  {{KB}}
      - Computer Time to Search (Average per Query, in CPU seconds): .01 
      - Component Times:  16 CPU hours 

    Machine Searching Methods

      - Vector Space Model?:  no
      - Probabilistic Model?:  no
      - Cluster Searching?:  no
      - N-gram Matching?:  yes N-gram matching using greedy heuristic (LZW) to 
find a small number ofn-grams from a particular dictionary which parse the 
entire query document.  Ranking is ordered by the ratio (original size in 
characters)/(final size in n-grams).
      - Boolean Matching?:  no
      - Fuzzy Logic?:   no
      - Free Text Scanning?: no 
      - Neural Networks?:  no
      - Conceptual Graph Matching?: no 

    Factors in Ranking

      - Term Frequency?: no 
      - Inverse Document Frequency?: no 
      - Other Term Weights?: no  
      - Semantic Closeness?: no 
      - Position in Document?: no 
      - Syntactic Clues?:  no
      - Proximity of Terms?: yes: n-grams are permitted to straddle word 
boundaries or even include multiple words. 
      - Information Theoretic Weights?: no 
      - Document Length?: yes, rankings are normalized by query document length.
      - Percentage of Query Terms which match?: no 
      - N-gram Frequency?: no 
      - Word Specificity?: no 
      - Word Sense Frequency?: no 
      - Cluster Distance?: no 
      - Other:  Ranking is ordered by the ratio (original size in characters)/
(final size in n-grams).   

    Machine Information

    - Machine Type for TREC Experiment: SuperSPARC II (both)
    - Was the Machine Dedicated or Shared: shared (both)
    - Amount of Hard Disk Storage (in MB): 1024 (machine 1), 914 
(machine 2) 
    - Amount of RAM (in MB):  483 (machine 1) 168 (machine 2)
    - Clock Rate of CPU (in MHz):  60 (machine 1) 75 (machine 2)

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
System:  ~100 staff-hours
    - Given appropriate resources            
      - Could your system run faster?:  yes 
      - By how much (estimate)?:  2-3 times
    - Features the System is Missing that would be beneficial: stripping 
subsumed n-grams. Stop list.