System Summary and Timing Name: University of Illinois List of Run ID's: ispFa, ispFb ispFc,ispaR Construction of Indices, Knowledge Bases,and other Data Structures Methods Used to build Data Structures: Multidimensional "information space" constructed by principal componenents ananysis (PCA) of term-by-term co-occurrence matrix. Terms for the co-occurrence matrix are selected at the center of the frequency distribution of terms in the corpus. - Length (in words) of the stopword list: 1500 - Controlled Vocabulary?: no - Stemming Algorithm: terms with 2 or fewer letters are dropped; terms with greater than 8 characers are truncated at 8 characters; terms ending in 's' or 'es' are stemmed. - Morphological Analysis: none -Term Weighting: none - Phrase Discovery?: -Tokenizer?: - Patterns which are tokenized: any non-alphabetical character (a-z) is dropped, including all punctuation, numbers, and tags Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Use of Manual Labor - Initial Core Manually Built to "bootstrap" for Completely Machine- Built Completion: Yes. The only non-automatic portion is the selection of terms to use. My system was limited to about 2200 unique terms for the co-occurrence and PCA. So, I would experiment with different lower and upper frequency values to choose a cutoff for a particular collection. For example, the Ziff documents might have included terms with raw frequencies between 500 and 1800. - Number of Concepts Represented: About 1800 to 2200 per information space. Note that "concepts" are single-word terms taken directly from the corpi. - Type of Representation: metric multidimensional space - Auxiliary Files Needed: none other than stoplist; document files were used for retrieval (as opposed to building an index of the document files instead) - Special Routing Structures - Run ID: ispaR - Type of Structure: metric multidimensional space - Total Storage (in MB): co-occurrence matrix is about 60MB (plain ASCII), multidimenional space coordinates file is about 1.5MB. - Total Computer Time to Build (in hours): 50 (28 to build the co-occurrence matrix on a HP 735 workstation with 256M RAM; 2 to extract the information space via PCA on a Convex C-220 with 512MBRAM, another 20 to examine the routing documents and rank them back on the HP workstation) -Automatic Process? (If not, number of manual hours): Yes - Brief Description of Method: All training documents were examined. Co-occurrence scores for terms that had been pre-identified were calculated. A principal components analysis was performed on the co-occurrence matrix to build an information space. Then, the routing documents were located in the space at the centroid of the terms from the space that they contained. Routing documents were then ranked according to their distance from the query in the space (the query was located as any other document was); the closest 1000 documents to the query were retrieved. Query construction Automatically Built Queries (Ad-Hoc) - Topic Fields Used: field tags were stripped. The entire query description (without tags and processed to remove punctuation and non-alphabetic characters) was used as an ad-hoc query. Processing was entirely automatic. - Average Computer Time to Build Query (in cpu seconds): close to none - Method used in Query Construction - Tokenizer?: - Expansion of Queries using Previously-Constructed Data Structure?: Automatically Built Queries (Routing) - Topic Fields Used: same as for ad-hoc - Method used in Query Construction - Terms Selected From - Topics: all routing topics - Only Documents with Relevance Judgments: Yes; actually, only documents with POSITIVE relevance judgments (docs that had been judged as relevant) - Term Weighting with Weights Based on terms in - Phrase Extraction from - Syntactic Parsing - Word Sense Disambiguation using - Proper Noun Identification Algorithm from - Tokenizer -Heuristic Associations to Add Terms from - Expansion of Queries using Previously-Constructed Data Structure: - Automatic Addition of Boolean connectors or Proximity Operators using information from Searching Search Times - Run ID: ispaR - Computer Time to Search (Average per Query, in CPU seconds): 3600 - Component Times: Build information space: 80% Locate documents in the space: 15% Assess distance from documents to query: 4.5% Rank distances to determine output: .5% Machine Searching Methods - Other: Geometric distance (I've assumed you meant to include an 'Other' category here) Factors in Ranking - Other: geometric distance between terms or between documents in the information space. Queries are treated as documents. Distances between terms and documents are not used. Machine Information - Machine Type for TREC Experiment: HP 735 workstation, mostly (a Convex mainframe was also used for one phase) - Was the Machine Dedicated or Shared: shared - Amount of Hard Disk Storage(in MB): 6000 - Amount of RAM (in MB): 256 - Clock Rate of CPU (in MHz): 60 System Comparisons - Amount of "Software Engineering" which went into the Development of the System: 150 hours specifically for the TREC-5 experiments - Given appropriate resources - Could your system run faster?: yes - By how much (estimate)?: 1000 to 100,000 times faster. This is based on my techniques being at a comparable level of computational complexity as Latent Semantic Indexing. - Features the System is Missing that would be beneficial: Standards for meta-data (e.g., to make archives of information spaces useful for the future); unified interface (instead of 8-10 separate programs); mathematical techniques for "folding in" new data to an information space, rather than re-calculating it. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: Positive features: The information space is useful for visualization. It is also useful as a thesaurus, perhaps in conjunction with other methods for IR. Negative features: The co-occurrence process is unwieldy, and the number of terms that may be included is too small (however, I estimate that at least a 5x increase, from 2K to 10K terms, is possible with only sightly better hardware and coding). The PCA technique to extract the information space is not yet possible in a reasonable time except on fairly heavy-duty machines.