System Summary and Timing
  Name: University of Illinois
  List of Run ID's: ispFa, ispFb ispFc,ispaR

  Construction of Indices, Knowledge Bases,and other Data Structures

    Methods Used to build Data Structures: Multidimensional "information space"
constructed by principal componenents ananysis (PCA) of term-by-term 
co-occurrence matrix. Terms for the co-occurrence matrix are selected at the 
center of the frequency distribution of terms in the corpus.

    - Length (in words) of the stopword list: 1500
    - Controlled Vocabulary?: no
    - Stemming Algorithm: terms with 2 or fewer letters are dropped; terms with
greater than 8 characers are truncated at 8 characters; terms ending in 's' or 
'es' are stemmed.
      - Morphological Analysis: none
    -Term Weighting: none 
    - Phrase Discovery?:
    -Tokenizer?:
      - Patterns which are tokenized: any non-alphabetical character (a-z) is 
dropped, including all punctuation, numbers, and tags

    Statistics on Data Structures built from TREC Text

    - Inverted index
    - Clusters
    - N-grams, Suffix arrays, Signature Files
    - Knowledge Bases
      - Use of Manual Labor
        - Initial Core Manually Built to "bootstrap" for Completely Machine-
Built Completion: Yes.  The only non-automatic portion is the selection of 
terms to use.  My system was limited to about 2200 unique terms for the 
co-occurrence and PCA.  So, I would experiment with different lower and upper 
frequency values to choose a cutoff for a particular collection.  For example, 
the Ziff documents might have included terms with raw frequencies between 500 
and 1800.
      - Number of Concepts Represented: About 1800 to 2200 per information 
space.  Note that "concepts" are single-word terms taken directly from the 
corpi.
      - Type of Representation: metric multidimensional space
      - Auxiliary Files Needed: none other than stoplist; document files were 
used for retrieval (as opposed to building an index of the document files 
instead)
    - Special Routing Structures
      - Run ID: ispaR
      - Type of Structure: metric multidimensional space
      - Total Storage (in MB): co-occurrence matrix is about 60MB (plain 
ASCII), multidimenional space coordinates file is about 1.5MB.
      - Total Computer Time to Build (in hours): 50 (28 to build the 
co-occurrence matrix on a HP 735 workstation with 256M RAM; 2 to extract the
information space via PCA on a Convex C-220 with 512MBRAM, another 20 to 
examine the routing documents and rank them back on the HP workstation)
      -Automatic Process? (If not, number of manual hours): Yes 
      - Brief Description of Method: All training documents were examined.  
Co-occurrence scores for terms that had been pre-identified were calculated.  A
principal components analysis was performed on the co-occurrence matrix to 
build an information space. Then, the routing documents were located in the 
space at the centroid of the terms from the space that they contained.  Routing
documents were then ranked according to their distance from the query in the 
space (the query was located as any other document was); the closest 1000 
documents to the query were retrieved. 

  Query construction

    Automatically Built Queries (Ad-Hoc)

    - Topic Fields Used: field tags were stripped. The entire query description
(without tags and processed to remove punctuation and non-alphabetic 
characters) was used as an ad-hoc query.  Processing was entirely automatic.
    - Average Computer Time to Build Query (in cpu seconds): close to none
    - Method used in Query Construction
      - Tokenizer?:
      - Expansion of Queries using Previously-Constructed Data Structure?:

    Automatically Built Queries (Routing)

    - Topic Fields Used: same as for ad-hoc
    - Method used in Query Construction
      - Terms Selected From
        - Topics: all routing topics
        - Only Documents with Relevance Judgments: Yes; actually, only 
documents with POSITIVE relevance judgments (docs that had been judged as
relevant)
      - Term Weighting with Weights Based on terms in
      - Phrase Extraction from
      - Syntactic Parsing
      - Word Sense Disambiguation using
      - Proper Noun Identification Algorithm from
      - Tokenizer
      -Heuristic Associations to Add Terms from
      - Expansion of Queries using Previously-Constructed Data Structure:
      - Automatic Addition of Boolean connectors or Proximity Operators using 
information from

  Searching

    Search Times

      - Run ID: ispaR
      - Computer Time to Search (Average per Query, in CPU seconds): 3600
      - Component Times: Build information space:  80% 
        Locate documents in the space:  15%
        Assess distance from documents to query:  4.5%
        Rank distances to determine output:  .5%

    Machine Searching Methods

        - Other: Geometric distance (I've assumed you meant to include an 
'Other' category here)

    Factors in Ranking

      - Other: geometric distance between terms or between documents in the 
information space.  Queries are treated as documents.  Distances between terms 
and documents are not used.

    Machine Information

    - Machine Type for TREC Experiment: HP 735 workstation, mostly (a Convex 
mainframe was also used for one phase)
    - Was the Machine Dedicated or Shared: shared
    - Amount of Hard Disk Storage(in MB): 6000
    - Amount of RAM (in MB): 256
    - Clock Rate of CPU (in MHz): 60

    System Comparisons

    - Amount of "Software Engineering" which went into the Development of the 
System: 150 hours specifically for the TREC-5 experiments
    - Given appropriate resources
      - Could your system run faster?: yes
      - By how much (estimate)?: 1000 to 100,000 times faster. This is based on
my techniques being at a comparable level of computational complexity as Latent
Semantic Indexing.
    - Features the System is Missing that would be beneficial: Standards for 
meta-data (e.g., to make archives of information spaces useful for the future);
unified interface (instead of 8-10 separate programs); mathematical techniques 
for "folding in" new data to an information space, rather than re-calculating 
it.

    Significant Areas of System

    - Brief Description of features in your system which you feel impact the 
system and are not answered by above questions: 
Positive features:  The information space is useful for visualization.  It is 
also useful as a thesaurus, perhaps in conjunction with other methods for IR.
Negative features:  The co-occurrence process is unwieldy, and the number of 
terms that may be included is too small (however, I estimate that at least a 
5x increase, from 2K to 10K terms, is possible with only sightly better 
hardware and coding).  The PCA technique to extract the information space is 
not yet possible in a reasonable time except on fairly heavy-duty machines.