System Summary and Timing
  Organization Name: George Mason University
  List of Run ID's: English: gmu1 (manual), gmu2 (automatic) 
                    Corrupted: gmuc0, gmu10 Spanish: gmumanual, gmuauto

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: 144 
    - Controlled Vocabulary? :  No 
    - Stemming Algorithm:   None            
      - Morphological Analysis: No  
    - Term Weighting:  Yes, tf-idf 
    -  Phrase Discovery? :  yes            
      - Kind of Phrase: two adjacent terms, not separated by stop terms or 
        punctuation 
      - Method Used (statistical, syntactic, other): syntactic 
    -  Syntactic Parsing? :  no 
    -  Word Sense Disambiguation? : no 
    -  Heuristic Associations (including short definition)? : no 
    -  Spelling Checking (with manual correction)? :no  
    -  Spelling Correction? :  no 
    -  Proper Noun Identification Algorithm? : no 
    -  Tokenizer? : no            
    -  Manually-Indexed Terms? :  no 

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID : gmu1, gmu2 
      - Total Storage (in MB):  248.3 
      - Total Computer Time to Build (in hours): 2:52:15  
      - Automatic Process? (If not, number of manual hours): yes 
      - Use of Term Positions? : no 
      - Only Single Terms Used? : no 
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
      - Run ID : gmuc0, gmuc10, gmuman, gmuauto  
      - Automatic Process? (If not, number of manual hours): no 
      - Brief Description of Method: For corrupted data, 4-grams were used with
        automatic query reduction based on term frequency across the entire
        document collection. For Spanish, 5-grams were used with no query 
        reduction.           

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: all  
    - Average Computer Time to Build Query (in cpu seconds): 60  
    - Method used in Query Construction          
      - Term Weighting (weights based on terms in topics)? : yes 
      - Phrase Extraction from Topics? : yes 
      - Syntactic Parsing of Topics? : no 
      - Word Sense Disambiguation? :  no 
      - Proper Noun Identification Algorithm? : no 
      - Tokenizer? :                 


        - Patterns which are Tokenized:  no 
      - Heuristic Associations to Add Terms? : no  
        -  Structure Used:  none 
      - Automatic Addition of Boolean Connectors or Proximity Operators? : no 

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used:  all 
    - Average Time to Build Query (in Minutes): 10 
    - Type of Query Builder          
      - Domain Expert: no 
      - Computer System Expert: yes 
    - Tools used to Build Query          
      - Word Frequency List? : no 
      - Knowledge Base Browser? :                 
        - Structure Used: no 
      - Other Lexical Tools? :                
    - Method used in Query Construction          
      - Term Weighting? : yes 
      - Boolean Connectors (AND, OR, NOT)? : yes 
      - Proximity Operators? : no 
      - Addition of Terms not Included in Topic? : yes              
        - Source of Terms: none 

  Searching

    Search Times

      - Run ID : gmu2  
      - Computer Time to Search (Average per Query, in CPU seconds): 
        approximately 60 

    Machine Searching Methods

      - Vector Space Model? : yes  
      - N-gram Matching? : yes  
      - Boolean Matching? :yes   

    Factors in Ranking

      - Term Frequency? : yes 
      - Inverse Document Frequency? : yes 
      - Document Length? :  yes (normalization)
      - Percentage of Query Terms which match? : no 
      - N-gram Frequency? : yes 
      - Word Specificity? : no  
      - Word Sense Frequency? : no 

    Machine Information: gmuc0, gmuc10, gmu1, gmuauto

    - Machine Type for TREC Experiment:  SUN Sparc 2000, 18 processor 
    - Was the Machine Dedicated or Shared:  dedicated 
    - Amount of Hard Disk Storage (in MB):  54000 
    - Amount of RAM (in MB): 2000 

    Machine Information: gmu1, gmuman

    - Machine Type for TREC Experiment: AT&T DBC-1012 Database Machine 


    - Was the Machine Dedicated or Shared:  dedicated 
    - Amount of Hard Disk Storage (in MB): 25000 

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
      System: 1 person year for IR system, 1 person year for Relational 
    - Given appropriate resources            
      - Could your system run faster? : yes  
      - By how much (estimate)? : IR, 50 percent, relational 25 
    - Features the System is Missing that would be beneficial: IR prototype 
      currently is not parallelized, both lack passage based retrieval and 
      relevance feedback.