System Summary and Timing Organization Name: University of North Carolina List of Run ID's: uncis1, uncis2 (both are Category B manual ad hoc runs) Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Length (in words) of the stopword list: 571 - Controlled Vocabulary?: no - Stemming Algorithm: Modified Lovins (SMART v. 11.0) - Morphological Analysis: no - Term Weighting: SMART's "lnc" weights for document term weights. SMART's "ltc" weights for query term weights for initial ranking. - Phrase Discovery?: - Tokenizer?: Statistics on Data Structures built from TREC Text - Inverted index - Run ID: uncis1, uncis2 - Total Storage (in MB): 184 - Total Computer Time to Build (in hours): approx. 1.3 hours total (Estimated 53 min. to pre-process documents for SMART using SPARCcenter 1000. 25.5 min. for SMART to index documents and topics [here, used SPARCcenter but time is from Sun Ultra].) - Automatic Process? (If not, number of manual hours): yes - Use of Term Positions?: no - Only Single Terms Used?: yes - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Special Routing Structures - Other Data Structures built from TREC text - Run ID: uncis1, uncis2 - Type of Structure: Sequential document index for each topic - Total Storage (in MB): Average of 13.6 MB per topic - Total Computer Time to Build (in hours): Average of 11 minutes per topic. (Time does not include some pre-processing. Total time: ?? - Greater than an hour on SPARCcenter 1000.) - Automatic Process? (If not, number of manual hours): yes - Brief Description of Method: Inverted index created by SMART used for initial ranking of documents only. A sequential document index is made for each topic consisting of top 5000 documents of initial ranking. Documents in index are sorted in increasing order of rank (i.e., 1, 2, 3, ...). Query construction Interactive Queries - Initial Query Built Automatically or Manually: automatically (DESC field used), using SMART v. 11.0 - Type of Person doing Interaction - Domain Expert: no - System Expert: yes - Average Time to do Complete Interaction - CPU Time (Total CPU Seconds for all Iterations): uncis1 - estimated at approx. 125 minutes (involves reading from and writing to files) uncis2 - real time no more than 5 seconds - Clock Time from Initial Construction of Query to Completion of Final Query (in minutes): approx. 50 minutes, but varied radically - Average Number of Iterations: An "iteration" is defined as the number of sets of retrieved documents examined. The documents were ranked again after the last set of retrieved documents examined. Therefore, there would be 3 separate "iterations," for example, but 4 separate "rankings." uncis1 - 2.28 uncis2 - 2.32 - Average Number of Documents Examined per Iteration: Documents retrieved in a previous iteration were not examined again, and, therefore, are not included in determining the average. uncis1 - 23.42 - Varied radically. uncis2 - approx. 30, but varied radically - Minimum Number of Iterations: (no relevant documents found in initial ranking) - Maximum Number of Iterations: uncis1 - 4; uncis2 - 5 - What Determines the End of an Iteration: No further benefit anticipated. - Methods used in Interaction - Automatic Term Reweighting from Relevant Documents?: yes - Automatic Query Expansion from Relevant Documents?: yes - All Terms in Relevant Documents added: yes uncis1 - all terms also added from selected non-relevant documents - Manual Methods: Only manual intervention is relevance assessments. Searching Search Times - Run ID: uncis1, uncis2 - Computer Time to Search (Average per Query, in CPU seconds): 24 seconds (for SMART initial ranking); time includes writing to files - Component Times: ?? Machine Searching Methods - Vector Space Model?: yes - initial ranking; uncis1 - yes for feedback iterations - Probabilistic Model?: uncis2 - yes for feedback iterations Factors in Ranking - Term Frequency?: yes (within-document frequency in "lnc" weights; within-query frequency in "ltc" weights) - Inverse Document Frequency?: yes (in "ltc" weights) - Other Term Weights?: unics2 - relevance term weights Also see "Interactive Queries" section. - Document Length?: yes (cosine normalization) Machine Information - Machine Type for TREC Experiment: Sun Ultra running Solaris 2.5 (Some initial work done on SPARCcenter 1000 running Solaris 2.5. All times, however, are from Sun Ultra unless otherwise stated. All feedback iterations were done on the Sun Ultra. The Ultra was substantially faster than the SPARCcenter 1000. - Was the Machine Dedicated or Shared: Shared (but we were primary users) - Amount of Hard Disk Storage (in MB): Ultra - 14 GB (partitioned). - Amount of RAM (in MB): Ultra - 128 MB (SPARCcenter - 512 MB) - Clock Rate of CPU (in MHz): Ultra - 173 MHz System Comparisons - Amount of "Software Engineering" which went into the Development of the System: about 9 months - Given appropriate resources - Could your system run faster?: Yes. - By how much (estimate)?: ??