System Summary and Timing
  Organization Name: Queens College, City University of New York
  List of Run ID's: Pircs1, Pircs2, PircsL, PircsC

  Construction of Indices, Knowledge Bases, and other Data Structures

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: 630 
    - Stemming Algorithm: Porter's algorithm            
    - Term Weighting: yes 
    -  Phrase Discovery? :            
      - Kind of Phrase: 2-word 
      - Method Used (statistical, syntactic, other): statistical           
    -  Tokenizer? :            
    -  Manually-Indexed Terms? : yes for pircs2 
    -  Other Techniques for building Data Structures:011011

    Statistics on Data Structures built from TREC Text

    - Inverted index           
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
    - Knowledge Bases           
      - Use of Manual Labor                
    - Special Routing Structures           
      - Run ID : pircsL, pircsC 
      - Type of Structure: Network of linked NODE and EDGE files capturing the 
        query expansion terms and learnt weights built dynamically
      - Total Storage (in MB): about 90MB for 10 queries per 500MB textbase 
      - Total Computer Time to Build (in hours): 2 
      - Automatic Process? (If not, number of manual hours): yes 
      - Brief Description of Method: built from direct files of queries and 
        documents and known relevant document information
    - Other Data Structures built from TREC text           
      - Run ID : Pircs1,pircs2,pircsL,pircsC 
      - Type of Structure: Compressed, truncated direct file; network of linked
        NODE and EDGE files built during query time
      - Total Storage (in MB): direct file - about 80MB per 500MB raw text; 
        network - about 60MB for 10 queries per 500MB textbase
      - Total Computer Time to Build (in hours): direct file - about 20 
        minutes; network - about 5 min per 10 query 
      - Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: built from direct files of queries and 
        documents

    Data Built from Sources Other than the Input Text

    -  Internally-built Auxiliary File            
      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): Stop words 
      - Total Storage (in MB): .004 
      - Number of Concepts Represented: 630 
      - Type of Representation: array 
      - Total Computer Time to Modify for TREC (if already built): none 
        already exists 
      - Use of Manual Labor                 
    -  Internally-built Auxiliary File            


      - Domain (independent or specific): independent 
      - Type of File (thesaurus, knowledge base, lexicon, etc.): 2-word Phrases 
      - Total Storage (in MB): .005 
      - Number of Concepts Represented: 55599 
      - Type of Representation: array 
      - Total Computer Time to Build (in hours): already exists 
      - Total Computer Time to Modify for TREC (if already built): 
      - Use of Manual Labor                 
    -  Externally-built Auxiliary File            

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: desc 
    - Average Computer Time to Build Query (in cpu seconds): 3 sec 
    - Method used in Query Construction         
      - Term Weighting (weights based on terms in topics)? : yes 
      - Phrase Extraction from Topics? :yes, 2-word 
      - Tokenizer? :              
      - Expansion of Queries using Previously-Constructed Data Structure? : yes 
        -  Structure Used: Network 
      - Automatic Addition of Boolean Connectors or Proximity Operators? : 

    Automatically Built Queries (Routing)

    - Topic Fields Used: title, desc, narr, con 
    - Average Computer Time to Build Query (in cpu seconds): 18 
    - Method used in Query Construction         
      - Terms Selected From            
        - Topics: yes 
        - Only Documents with Relevance Judgments: yes 
      - Term Weighting with Weights Based on terms in            
        - Documents with Relevance Judgments: yes 
      - Phrase Extraction from            
        - Topics: yes 
        - Documents with Relevance Judgments: yes 
      - Syntactic Parsing            
      - Word Sense Disambiguation using            
      - Proper Noun Identification Algorithm from            
      - Tokenizer            
      - Heuristic Associations to Add Terms from            
      - Expansion of Queries using Previously-Constructed Data Structure:      
        -  Structure Used: Network 
      - Automatic Addition of Boolean connectors or Proximity Operators using 
        information from            

    Manually Constructed Queries (Ad-Hoc)

    - Topic Fields Used: desc 
    - Average Time to Build Query (in Minutes): about 3 hrs for 50 queries 
    - Type of Query Builder          
      - Computer System Expert: yes 
    - Tools used to Build Query          
      - Knowledge Base Browser? :              
      - Other Lexical Tools? :              
    - Method used in Query Construction         
      - Term Weighting? : yes 


      - Addition of Terms not Included in Topic? :              
        - Source of Terms: human knowledge 

  Searching

    Search Times

      - Run ID :PircsL,PircsC 
      - Computer Time to Search (Average per Query, in CPU seconds): about 4 
        min clock time for pircsL, 3 times this for pircsC. 
      - Component Times : Build network 4 min (per 10 query) Retrieval 33 min
        (per 10 query) Sort, merge reformat results 3 min 

    Machine Searching Methods

      - Probabilistic Model? : yes 
      - Boolean Matching? : yes 
      - Neural Networks? : yes 

    Factors in Ranking

      - Term Frequency? : yes 
      - Other Term Weights? : yes, within-doc term frequency, inverse 
        collection term frequency. 
      - Proximity of Terms? : yes, 2 word phrases 
      - Document Length? : yes 

    Machine Information

    - Machine Type for TREC Experiment:Sparc 10 model 30  
    - Was the Machine Dedicated or Shared: Dedicated 
    - Amount of Hard Disk Storage (in MB): 7000 
    - Amount of RAM (in MB): 128 

    System Comparisons

    - Amount of "Software Engineering" which went into the Development of the 
      System: some space and time efficiency factors were made 
    - Given appropriate resources           
      - Could your system run faster? :yes 
      - By how much (estimate)? : probably half the time. 
    - Features the System is Missing that would be beneficial: ability to 
      differentiate contexts 

    Significant Areas of System

    - Brief Description of features in your system which you feel impact the 
      system and are not answered by above questions:
      1. handles varied length documents by segmenting into subdocuments of 
         about 550 words;
      2. training with subdocuments in routing;
      3. system does not build full inverted file;
      4. system retrieve from subcollections and combine ranked lists from each
         into one final retrieval list; subcollections are served by one master
         lexicon;
      5. can combine multiple retrieval methods.