System Summary and Timing
  Organization Name: Queens College, City University of New York
  List of Run ID's: pircsAAS, pircsAAL, pircsAM1, pircsAM2, pircsCw, pircsCwc, 
pircsRG0, pircsRG6

  Construction of Indices, Knowledge Bases, and other Data Structures

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: English - 630; Chinese - 2000 
    - Stemming Algorithm: Porter's algorithm            
    - Term Weighting: yes 
    -  Phrase Discovery?:            
      - Kind of Phrase: adjacent 2-word 
      - Method Used (statistical, syntactic, other): statistical           
    -  Tokenizer?:            
    -  Manually-Indexed Terms?: yes for pircsAM1 & 2 
    -  Other Techniques for building Data Structures:

    Statistics on Data Structures built from TREC Text

    - Inverted index           
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
    - Knowledge Bases           
      - Use of Manual Labor                
    - Special Routing Structures           
      - Run ID: pircsRG0, pircsRG6 
      - Type of Structure: Network of linked NODE and EDGE files capturing the 
query expansion terms and learnt weights built dynamically
      - Total Storage (in MB): about 90MB for 10 queries per 500MB textbase 
      - Total Computer Time to Build (in hours): 2 
      - Automatic Process? (If not, number of manual hours): yes 
      - Brief Description of Method: built from direct files of queries and 
documents and known relevant document information
    - Other Data Structures built from TREC text           
      - Run ID: pircsAAS & L, pircsAM1 & 2, pircsCw & wc 
      - Type of Structure: Compressed, truncated direct file; network of 
linked NODE and EDGE files built during query time
      - Total Storage (in MB): direct file - about 80MB per 500MB raw text; 
network - about 60MB for 10 queries per 500MB textbase
      - Total Computer Time to Build (in hours): direct file - about 20 
minutes; network - about 5 min per 10 query 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: built from direct files of queries and 
documents

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): English Stop 
words 
      - Total Storage (in MB): .004 
      - Number of Concepts Represented: 630 
      - Type of Representation: array 
      - Total Computer Time to Modify for TREC (if already built): none already exists 
      - Use of Manual Labor                 
    -  Internally-built Auxiliary File            
      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): English 
adjacent 2-word Phrases 
      - Total Storage (in MB): 1 
      - Number of Concepts Represented: 80134 
      - Type of Representation: array 
      - Total Computer Time to Build (in hours): already exists 
      - Total Computer Time to Modify for TREC (if already built): 
      - Use of Manual Labor                 
    -  Internally-built Auxiliary File            
      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): Chinese Stop 
words & Lexicon List
      - Total Storage (in MB): .04 
      - Number of Concepts Represented: about 15000 
      - Type of Representation: array 
      - Total Computer Time to Build (in hours): 0.5 
      - Total Manual Time to Build (in hours): 200 
      - Use of Manual Labor                 
        - Initial Core Manually Built to "bootstrap" for Completely Machine-
Built Completion:  yes; initial about 2000 entries 
    -  Externally-built Auxiliary File            

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: pircsAAS - desc; pircsAAL, pircsCw & wc - all sections
    - Average Computer Time to Build Query (in cpu seconds): 3 sec 
    - Method used in Query Construction         
      - Term Weighting (weights based on terms in topics)?: yes 
      - Phrase Extraction from Topics?: English 2-word 
      - Tokenizer?:              
      - Expansion of Queries using Previously-Constructed Data Structure?: yes              
        -  Structure Used: Network 
      - Automatic Addition of Boolean Connectors or Proximity Operators?: 

    Automatically Built Queries (Routing)

    - Topic Fields Used: title, desc, narr, con 
    - Average Computer Time to Build Query (in cpu seconds): 4 hours 
    - Method used in Query Construction         
      - Terms Selected From            
        - Topics: yes 
        - All Training Documents: select training documents via genetic 
algorithm from top retrieved docs with relevance judgement 
        - Only Documents with Relevance Judgments: yes 
      - Term Weighting with Weights Based on terms in            
        - Documents with Relevance Judgments: yes 
      - Phrase Extraction from            
        - Topics: yes 
        - Documents with Relevance Judgments: yes 
      - Syntactic Parsing            
      - Word Sense Disambiguation using            
      - Proper Noun Identification Algorithm from            
      - Tokenizer            
      - Heuristic Associations to Add Terms from            
      - Expansion of Queries using Previously-Constructed Data Structure:              
        -  Structure Used: Network 
      - Automatic Addition of Boolean connectors or Proximity Operators using 
information from 

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used: pircsAM1 - desc; pircsAM2 - all 
    - Average Time to Build Query (in Minutes): pircsAM1 about 30; pircsAM2 
about 3 hrs for 50 queries 
    - Type of Query Builder          
      - Computer System Expert: yes 
    - Tools used to Build Query          
      - Knowledge Base Browser?:              
      - Other Lexical Tools?:              
    - Method used in Query Construction         
      - Term Weighting?: yes 
      - Addition of Terms not Included in Topic?:              
        - Source of Terms: pircsAM2 - human knowledge 

  Searching

    Search Times

      - Run ID: pircsAAS, pircsAAL, pircsAM1, pircsAM2 
      - Computer Time to Search (Average per Query, in CPU seconds): about 108 
on 2GB using 2-stage retrieval.  
      - Component Times: 1st stage:  48    2nd stage:  60 
    -  Search Times            
      - Run ID: pircsRG0, pircsRG6 
      - Computer Time to Search (Average per Query, in CPU seconds): about 
8 min clock time for pircsRG0 and pircsRG6. 
      - Component Times: Build network 8 min (per 10 query) 
Retrieval 80 min (per 10 query)  Sort, merge reformat results  2 min 

    Machine Searching Methods

      - Probabilistic Model?: yes 
      - Boolean Matching?: yes 
      - Neural Networks?: yes 

    Factors in Ranking

      - Term Frequency?: yes 
      - Other Term Weights?: yes, within-doc term frequency, inverse collection
term frequency. 
      - Proximity of Terms?: yes, 2 word phrases 
      - Document Length?: yes 

    Machine Information

    - Machine Type for TREC Experiment: 2 x Sparc 10 model 30  
    - Was the Machine Dedicated or Shared: Dedicated 
    - Amount of Hard Disk Storage (in MB): 13000 
    - Amount of RAM (in MB): 128 

    System Comparisons

    - Amount of "Software Engineering" which went into the Development of the 
System: not much change from TREC4. 
    - Given appropriate resources           
      - Could your system run faster?: yes 
      - By how much (estimate)?: probably half the time. 
    - Features the System is Missing that would be beneficial: more 
specificity in representation; ability to differentiate contexts 

    Significant Areas of System

    - Brief Description of features in your system which you feel impact the 
system and are not answered by above questions: 1. handles varied length 
documents by segmenting into subdocuments of about 550 words; 2. training with 
subdocuments in routing; 3. training documents are selected by a ga search for 
routing. 4. system does not build full inverted file; 5. system retrieve from 
subcollections and combine ranked lists from each into one final retrieval 
list; subcollections are served by one master lexicon; 6. can combine multiple 
retrieval methods. 7. supports Chinese text processing & retrieval in GB 
encoding.