System Summary and Timing
  Organization Name: Computer Technology Institute (CTI)
  List of Run ID's: Ctifr1  Ctifr2

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Length (in words) of the stopword list: 429 
    - Stemming Algorithm: Porter's stemming algorithm              
    - Term Weighting: Inverse Document Frequency method 
    -  Phrase Discovery?: yes             
      - Kind of Phrase: pairs of words within a sentence 
      - Method Used (statistical, syntactic, other): statistical 
    -  Tokenizer?:              
    -  Other Techniques for building Data Structures: Document clustering
appropriate for automatic extraction of small and tight thesaurus classes 

    Statistics on Data Structures built from TREC Text

    - Inverted index           
      - Run ID: Ctifr1 
      - Total Storage (in MB): 220 MB 
      - Total Computer Time to Build (in hours): 2.5 hours 
      - Automatic Process? (If not, number of manual hours): yes 
      - Only Single Terms Used?: no (phrases and automatically constructed
thesaurus classes are used also)                 
    - Inverted index           
      - Run ID: Ctifr2 
      - Total Storage (in MB): 180 MB 
      - Total Computer Time to Build (in hours): 2 hours 
      - Automatic Process? (If not, number of manual hours): yes 
      - Only Single Terms Used?: no (phrases and automatically constructed
thesaurus classes are used also)            
    - Clusters           
    - N-grams, Suffix arrays, Signature Files           
    - Knowledge Bases            
      - Use of Manual Labor                  
    - Special Routing Structures           
    - Other Data Structures built from TREC text           
      - Run ID: Ctifr1 
      - Type of Structure: Automatically constructed thesaurus 
      - Total Storage (in MB): 20 MB 
      - Total Computer Time to Build (in hours): 3 hours 
      - Automatic Process? (If not, number of manual hours): yes 
      - Brief Description of Method: Our automatic thesaurus construction
approach is based on statistical document clustering. Initially, the cosine-
similarities among all the documents are computed. Afterwards, a set of document
clusters are extracted with use of a Connected Components based algorithm. These
document clusters are expected to process relevant to each other documents. 
Finally, for each such cluster, a thesaurus class is extracted which consists of
the common low-frequency terms (if there are such common terms) of all the
documents in that cluster. Each thesaurus class is then statistically weighted
(frequency-based), separately for each document in the corresponding cluster,
and the final weighted classes are organized as an inverted index (together with
the single terms and phrases refered above)                
    - Other Data Structures built from TREC text           
      - Run ID: Ctifr2 
      - Type of Structure: Automatically constructed thesaurus 
      - Total Storage (in MB): 14 MB 
      - Total Computer Time to Build (in hours): 2.5 hours 
      - Automatic Process? (If not, number of manual hours): yes 
      - Brief Description of Method: The same as described above. The difference
in Ctifr2 run lies on the parameter values used. Specifically, in Ctifr2 we use 
parameter values (size of document clusters, threshold for identification of the
low-frequency terms etc.) that lead to fewer and more tight thesaurus classes 
(comparing to Ctifr1 run).                 

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: Description fields 
    - Average Computer Time to Build Query (in cpu seconds): 0.05 seconds 
    - Method used in Query Construction          
      - Term Weighting (weights based on terms in topics)?: no 
      - Phrase Extraction from Topics?: yes (with use of the same statistical
method as mentioned for the documents) 
      - Tokenizer?:                 
      - Expansion of Queries using Previously-Constructed Data Structure?: yes
      -  Structure Used:  Automatically constructed thesaurus classes 

    Searching

    Search Times

      - Run ID: Ctifr1 
      - Computer Time to Search (Average per Query, in CPU seconds): 
      - 0.8 seconds for parallel searching in GCel3/512 Parsytec Machine         
      - 60 seconds for serial searching in SPARC 10 station 
      - Component Times: (for parallel searching in GCel3/512 machine            
        NOTE 1: the TREC data for Category B fit appropriately into the GCel     
        512 transputer's main memory capacity, thus no disk I/O overhead exists) 
        - query construction time: 0.05 sec.
        - local scoring time: 0.2 sec.
        - local ranking time: 0.25 sec.
        - merging time: 0.1 sec.
        - communication times: 0.2 sec.
        (NOTE 2: similar times could hold for bigger collections under the 
condition that they fit into the total main memory of the GCel machine -- up to
2 GB -- approximate estimations could be given by the corresponding multiples of
local scoring and local ranking times -- 0.45 sec. -- the query construction 
time, the merging time and the communication times would remain constant)       
      -  Search Times             
      - Run ID: Ctifr2 
      - Computer Time to Search (Average per Query, in CPU seconds): A little 
better than the times mentioned above for Ctifr1, due to the smaller in size 
index data 

    Machine Searching Methods

      - Vector Space Model?: yes 

    Factors in Ranking

      - Term Frequency?: yes 
      - Inverse Document Frequency?: yes 
      - Document Length?: yes 

    Machine Information

    - Machine Type for TREC Experiment: - Sun/Sparc Station 10 (for both 
preprocessing and searching)
    - GCel3/512 Parsytec Parallel Machine (only for searching)               
    - Was the Machine Dedicated or Shared: - Sparc 10 Shared   
    - GCel3/512 Dedicated during experiment time 
    - Amount of Hard Disk Storage (in MB): 2 GB 
    - Amount of RAM (in MB): - Sparc 10, 32 MB   
    - GCel3/512, 2 GB total memory capacity of 512 transputers 
    - Clock Rate of CPU (in MHz): 110 MHz 

    System Comparisons 

    - Given appropriate resources            
      - Could your system run faster?: yes (concerning the preprocessing phase) 
      - By how much (estimate)?: 50% (preprocessing) 
    - Features the System is Missing that would be beneficial:  
      1. Syntactic documents' parsing and syntactic phrase indexing
      2. Tokenizer for dates, phone numbers in documents
      3. Externally built auxiliary files (thesaurus)
      4. Advanced automatic query extraction methods
         - appropriate query terms weighting
         - syntactic query parsing and tokenizer
         - query expansion via auxiliary files
         - relevance interaction methods (our system already uses a relevance 
feedback method but we had not enough time to use it for TREC experiment)
         - other query construction methods (for TREC)     
                     
    Significant Areas of System

    - Brief Description of features in your system which you feel impact the 
system and are not answered by above questions:  
      1. Our work for TREC'96 was separated in two parts.  
         - The first part had to do with testing of our conventional serial 
processing FIRE system with use of the large TREC collection.
         - The second main part (not indicated above) was the integration of our
corresponding parallel system (PFIRE) which runs on the GCel3/512 machine in 
such a way that it could manage TREC collections (appropriate data distribution 
algorithms in order to balance the whole shared collection, appropriate 
progressive merging methods for large RD sets -- 1000 docs -- in order to have 
the minimum possible overhead, improvements concerning the synchronization of
our binary-tree VSM-based algorithm in order to further minimize the total 
communication times, etc.). The goal was to get very good speed-up measurements 
and very fast average response times per query and to prepare our system to be 
able to manage and perform very fast searching on even greater text collections.
      2. Our group was new to TREC (Category B) and 
         - a lot of effort had to be done in order to manage the text collection
for our systems (serial and parallel), a lot of changes had to be done, we had 
not estimated the required system resources correctly, we had not realized the 
advanced TREC's documents' and topics' philosophy from the beginning.
         - we had a lot of other additional things (IR features) in plan for 
TREC'96 but finally we had not the necessary time available due to the 
difficulties mentioned above.