System Summary and Timing Organization Name: Computer Technology Institute (CTI) List of Run ID's: Ctifr1 Ctifr2 Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 429 - Stemming Algorithm: Porter's stemming algorithm - Term Weighting: Inverse Document Frequency method - Phrase Discovery?: yes - Kind of Phrase: pairs of words within a sentence - Method Used (statistical, syntactic, other): statistical - Tokenizer?: - Other Techniques for building Data Structures: Document clustering appropriate for automatic extraction of small and tight thesaurus classes Statistics on Data Structures built from TREC Text - Inverted index - Run ID: Ctifr1 - Total Storage (in MB): 220 MB - Total Computer Time to Build (in hours): 2.5 hours - Automatic Process? (If not, number of manual hours): yes - Only Single Terms Used?: no (phrases and automatically constructed thesaurus classes are used also) - Inverted index - Run ID: Ctifr2 - Total Storage (in MB): 180 MB - Total Computer Time to Build (in hours): 2 hours - Automatic Process? (If not, number of manual hours): yes - Only Single Terms Used?: no (phrases and automatically constructed thesaurus classes are used also) - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: Ctifr1 - Type of Structure: Automatically constructed thesaurus - Total Storage (in MB): 20 MB - Total Computer Time to Build (in hours): 3 hours - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Our automatic thesaurus construction approach is based on statistical document clustering. Initially, the cosine- similarities among all the documents are computed. Afterwards, a set of document clusters are extracted with use of a Connected Components based algorithm. These document clusters are expected to process relevant to each other documents. Finally, for each such cluster, a thesaurus class is extracted which consists of the common low-frequency terms (if there are such common terms) of all the documents in that cluster. Each thesaurus class is then statistically weighted (frequency-based), separately for each document in the corresponding cluster, and the final weighted classes are organized as an inverted index (together with the single terms and phrases refered above) - Other Data Structures built from TREC text - Run ID: Ctifr2 - Type of Structure: Automatically constructed thesaurus - Total Storage (in MB): 14 MB - Total Computer Time to Build (in hours): 2.5 hours - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: The same as described above. The difference in Ctifr2 run lies on the parameter values used. Specifically, in Ctifr2 we use parameter values (size of document clusters, threshold for identification of the low-frequency terms etc.) that lead to fewer and more tight thesaurus classes (comparing to Ctifr1 run). Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: Description fields - Average Computer Time to Build Query (in cpu seconds): 0.05 seconds - Method used in Query Construction - Term Weighting (weights based on terms in topics)?: no - Phrase Extraction from Topics?: yes (with use of the same statistical method as mentioned for the documents) - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: yes - Structure Used: Automatically constructed thesaurus classes Searching Search Times - Run ID: Ctifr1 - Computer Time to Search (Average per Query, in CPU seconds): - 0.8 seconds for parallel searching in GCel3/512 Parsytec Machine - 60 seconds for serial searching in SPARC 10 station - Component Times: (for parallel searching in GCel3/512 machine NOTE 1: the TREC data for Category B fit appropriately into the GCel 512 transputer's main memory capacity, thus no disk I/O overhead exists) - query construction time: 0.05 sec. - local scoring time: 0.2 sec. - local ranking time: 0.25 sec. - merging time: 0.1 sec. - communication times: 0.2 sec. (NOTE 2: similar times could hold for bigger collections under the condition that they fit into the total main memory of the GCel machine -- up to 2 GB -- approximate estimations could be given by the corresponding multiples of local scoring and local ranking times -- 0.45 sec. -- the query construction time, the merging time and the communication times would remain constant) - Search Times - Run ID: Ctifr2 - Computer Time to Search (Average per Query, in CPU seconds): A little better than the times mentioned above for Ctifr1, due to the smaller in size index data Machine Searching Methods - Vector Space Model?: yes Factors in Ranking - Term Frequency?: yes - Inverse Document Frequency?: yes - Document Length?: yes Machine Information - Machine Type for TREC Experiment: - Sun/Sparc Station 10 (for both preprocessing and searching) - GCel3/512 Parsytec Parallel Machine (only for searching) - Was the Machine Dedicated or Shared: - Sparc 10 Shared - GCel3/512 Dedicated during experiment time - Amount of Hard Disk Storage (in MB): 2 GB - Amount of RAM (in MB): - Sparc 10, 32 MB - GCel3/512, 2 GB total memory capacity of 512 transputers - Clock Rate of CPU (in MHz): 110 MHz System Comparisons - Given appropriate resources - Could your system run faster?: yes (concerning the preprocessing phase) - By how much (estimate)?: 50% (preprocessing) - Features the System is Missing that would be beneficial: 1. Syntactic documents' parsing and syntactic phrase indexing 2. Tokenizer for dates, phone numbers in documents 3. Externally built auxiliary files (thesaurus) 4. Advanced automatic query extraction methods - appropriate query terms weighting - syntactic query parsing and tokenizer - query expansion via auxiliary files - relevance interaction methods (our system already uses a relevance feedback method but we had not enough time to use it for TREC experiment) - other query construction methods (for TREC) Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: 1. Our work for TREC'96 was separated in two parts. - The first part had to do with testing of our conventional serial processing FIRE system with use of the large TREC collection. - The second main part (not indicated above) was the integration of our corresponding parallel system (PFIRE) which runs on the GCel3/512 machine in such a way that it could manage TREC collections (appropriate data distribution algorithms in order to balance the whole shared collection, appropriate progressive merging methods for large RD sets -- 1000 docs -- in order to have the minimum possible overhead, improvements concerning the synchronization of our binary-tree VSM-based algorithm in order to further minimize the total communication times, etc.). The goal was to get very good speed-up measurements and very fast average response times per query and to prepare our system to be able to manage and perform very fast searching on even greater text collections. 2. Our group was new to TREC (Category B) and - a lot of effort had to be done in order to manage the text collection for our systems (serial and parallel), a lot of changes had to be done, we had not estimated the required system resources correctly, we had not realized the advanced TREC's documents' and topics' philosophy from the beginning. - we had a lot of other additional things (IR features) in plan for TREC'96 but finally we had not the necessary time available due to the difficulties mentioned above.