System Summary and Timing
  Organization Name: GE Corporate Research & Development School of 
Communication, Information and Library Studies, Rutgers University
Lockheed Martin Corporation 
Department of Computer Science, New York University
  List of Run ID's: genrl1, genrl2, genrl3, genrl4, genrl5, genrl6, genlp1, 
genlp2, genlp3, genlp4, sbase1, sbase2

  Construction of Indices, Knowledge Bases, and other Data Structures

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: 429 
    - Controlled Vocabulary?: no 
    - Stemming Algorithm: yes, lexicon            
      - Morphological Analysis: partial 
    - Term Weighting: yes, tf.idf 
    -  Phrase Discovery?:  yes            
      - Kind of Phrase: syntactic 
      - Method Used (statistical, syntactic, other): lexical noun phrases, 
syntactical pairs with statistical disambiguation
    -  Syntactic Parsing?: yes (for pairs) 
    -  Word Sense Disambiguation?: no 
    -  Heuristic Associations (including short definition)?: no 
    -  Spelling Checking (with manual correction)?: no 
    -  Spelling Correction?: no  
    -  Proper Noun Identification Algorithm?: yes 
    -  Tokenizer?: yes            
      - Patterns which are tokenized: phrases 
    -  Manually-Indexed Terms?: no 
    -  Other Techniques for building Data Structures: none 

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID: adhoc(genrl[1234]) 
      - Total Storage (in MB): 1,732MB (4 streams total)
      - Total Computer Time to Build (in hours): 14h (4 streams total) 
      - Automatic Process? (If not, number of manual hours): yes 
      - Use of Term Positions?: no 
      - Only Single Terms Used?: no 
      - Run ID: routing(genrl[56]) 
      - Total Storage (in MB): 1,395MB (4 streams total)
      - Total Computer Time to Build (in hours): 3.5 (4 streams total) 
      - Automatic Process? (If not, number of manual hours): yes 
      - Use of Term Positions?: no 
      - Only Single Terms Used?: no 
      - Run ID: adhoc NLP track (genlp[14]) 
      - Total Storage (in MB): 1,268 (4 streams total)
      - Total Computer Time to Build (in hours): 3h (4 streams total) 
      - Automatic Process? (If not, number of manual hours): yes 
      - Use of Term Positions?: no 
      - Only Single Terms Used?: no 
      - Run ID: adhoc NLP track (genlp[23]); sbase[12] 
      - Total Storage (in MB): 768 
      - Total Computer Time to Build (in hours): 2.3h 
    - Clusters 
    - N-grams, Suffix arrays, Signature Files 
    - Knowledge Bases 
    - Special Routing Structures 

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): a list of 
hyphenated words extracted from corpus 
      - Total Storage (in MB): 0.782642 
      - Number of Concepts Represented: no 
      - Type of Representation: no 
      - Total Computer Time to Build (in hours): 0.14 
      - Total Computer Time to Modify for TREC (if already built): 0 
      - Total Manual Time to Build (in hours): 0 
      - Total Manual Time to Modify for TREC (if already built): 0 
      - Use of Manual Labor 
    -  Externally-built Auxiliary File 

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: DESC field for genrl1; all the fields for 
genrl[234]. 
    - Average Computer Time to Build Query (in cpu seconds): 5 
    - Method used in Query Construction          
      - Term Weighting (weights based on terms in topics)?: yes 
      - Phrase Extraction from Topics?: yes 
      - Syntactic Parsing of Topics?: yes 
      - Word Sense Disambiguation?: no 
      - Proper Noun Identification Algorithm?: yes 
      - Tokenizer?: yes              
        - Patterns which are Tokenized: phrases, names 
      - Heuristic Associations to Add Terms?: no 
      - Expansion of Queries using Previously-Constructed Data Structure?: no
      - Automatic Addition of Boolean Connectors or Proximity Operators?: no 

    Automatically Built Queries (Routing)

    - Topic Fields Used: all the fields for genrl[56]. 
    - Average Computer Time to Build Query (in cpu seconds): 473 
    - Method used in Query Construction          
      - Terms Selected From            
        - Topics: yes 
        - All Training Documents: no 
        - Only Documents with Relevance Judgments: a subset of them 
      - Term Weighting with Weights Based on terms in            
        - Topics: yes 
        - All Training Documents: no 
        - Documents with Relevance Judgments: yes 
      - Phrase Extraction from            
        - Topics: yes 
        - All Training Documents: no 
        - Documents with Relevance Judgments: yes 
      - Syntactic Parsing            
        - Topics: yes 
        - All Training Documents: no 
        - Documents with Relevance Judgments: yes 
      - Word Sense Disambiguation using            
        - Topics: no 
        - All Training Documents: no 
        - Documents with Relevance Judgments: no 
      - Proper Noun Identification Algorithm from            
        - Topics: yes 
        - All Training Documents: no 
        - Documents with Relevance Judgments: yes 
      - Tokenizer             
        - Patterns which are tokenized (dates, phone numbers, common patterns, 
etc): phrases, names 
        - from Topics: yes 
        - from All Training Documents: no 
        - from Documents with Relevance Judgments: yes 
      - Heuristic Associations to Add Terms from            
        - Topics: no 
        - All Training Documents: no 
        - Documents with Relevance Judgments: no  
      - Expansion of Queries using Previously-Constructed Data Structure:
        -  Structure Used: no 
      - Automatic Addition of Boolean connectors or Proximity Operators using 
information from             
        - Topics: no 
        - All Training Documents: no 
        - Documents with Relevance Judgments: no 

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used: all 
    - Average Time to Build Query (in Minutes): 60
    - Type of Query Builder          
      - Domain Expert: no 
      - Computer System Expert: no 
    - Tools used to Build Query          
      - Word Frequency List?: no 
      - Knowledge Base Browser?:   no              
        - Structure Used: no 
      - Other Lexical Tools?: ComLex              
    - Method used in Query Construction          
      - Term Weighting?: yes 
      - Boolean Connectors (AND, OR, NOT)?: no 
      - Proximity Operators?: no 
      - Addition of Terms not Included in Topic?: yes              
        - Source of Terms: top 10 retrieved documents 
      - Other: no 

    Interactive Queries (Adhoc manual)

    - Initial Query Built Automatically or Manually: Automatically 
    - Type of Person doing Interaction            
      - Domain Expert: no 
      - System Expert: no 
    - Average Time to do Complete Interaction            
      - Clock Time from Initial Construction of Query to Completion of Final 
Query (in minutes): 60 
    - Average Number of Iterations: 2 
    - Average Number of Documents Examined per Iteration: 10 
    - Minimum Number of Iterations: 2 
    - Maximum Number of Iterations: 3 for about 10 hard queries
    - What Determines the End of an Iteration: out-of-time 
    - Methods used in Interaction         
      - Automatic Term Reweighting from Relevant Documents?: no 
      - Automatic Query Expansion from Relevant Documents?: no               
        - All Terms in Relevant Documents added: no 
        - Only Top X Terms Added (what is X): no 
        - User Selected Terms Added: yes 
      - Other Automatic Methods: no 
      - Manual Methods               
        - Using Individual Judgment (No Set Algorithm)?: yes 
        - Following a Given Algorithm (Brief Description)?: no 

  Searching

    Search Times

      - Run ID: genrl1 
      - Computer Time to Search (Average per Query, in CPU seconds): 8 
      - Run ID: genrl[2] 
      - Computer Time to Search (Average per Query, in CPU seconds): 10 
      - Run ID: genrl3 
      - Computer Time to Search (Average per Query, in CPU seconds): 90
      - Run ID: genrl4 
      - Computer Time to Search (Average per Query, in CPU seconds): 100
      - Run ID: genrl5 
      - Computer Time to Search (Average per Query, in CPU seconds): 42 
      - Run ID: genlp1 
      - Computer Time to Search (Average per Query, in CPU seconds): 5 
      - Run ID: genlp2 
      - Computer Time to Search (Average per Query, in CPU seconds): 6
      - Run ID: genlp3 
      - Computer Time to Search (Average per Query, in CPU seconds): 8
      - Run ID: genlp4 
      - Computer Time to Search (Average per Query, in CPU seconds): 9
      - Run ID: sbase1 
      - Computer Time to Search (Average per Query, in CPU seconds): 4
      - Run ID: sbase2 
      - Computer Time to Search (Average per Query, in CPU seconds): 6

    Machine Searching Methods

      - Vector Space Model?: yes 
      - Probabilistic Model?: yes (only in genrl6)
      - Cluster Searching?: no 
      - N-gram Matching?: SMART bi-gram is used in NLP track runs (sbase[12], 
genlp[23])
      - Boolean Matching?: no 
      - Fuzzy Logic?: no  
      - Free Text Scanning?: no 
      - Neural Networks?: no 
      - Conceptual Graph Matching?: no 
      - Other: no 

    Factors in Ranking

      - Term Frequency?: yes 
      - Inverse Document Frequency?: yes 
      - Other Term Weights?: no 
      - Semantic Closeness?: no 
      - Position in Document?: no 
      - Syntactic Clues?: pairs (2 words), phrases (<= 7 words) 
      - Proximity of Terms?: (bi-gram) 
      - Information Theoretic Weights?: no 
      - Document Length?: yes 
      - Percentage of Query Terms which match?: no 
      - N-gram Frequency?: no
      - Word Specificity?: no 
      - Word Sense Frequency?: yes (the weight of the term having a single 
sense is increased) 
      - Cluster Distance?: no 
      - Other: no 

    Machine Information

    - Machine Type for TREC Experiment: sparc 1000 server 
    - Was the Machine Dedicated or Shared: Shared (it is the file server)  
    - Amount of Hard Disk Storage (in MB): 12,000 
    - Amount of RAM (in MB): 2,000 
    - Clock Rate of CPU (in MHz): 40Mhz

    System Comparisons 

    - Amount of "Software Engineering" which went into the Development of the 
System: none