System Summary and Timing
  Organization Name: University of Central Florida
  List of Run ID's: UCF100 (routing run).  UCFSP0 (a run on Spanish Topics 1-25).
     UCFSP1 (ADHOC run on TREC-4 Spanish Topics 25-26, submitted for evaluation).
     UCFSP2 (extra ADHOC run on Spanish Topics 25-26). 

  Construction of Indices, Knowledge Bases, and other Data Structures 

    Methods Used to build Data Structures 

    - Controlled Vocabulary? :  Yes. 
    - Stemming Algorithm:   No, but for the Spanish runs (UCFSP0, UCFSP1, and
         UCFSP2) an auxiliary program was written to "expand" infinitives into 
         listings of other possible verb forms, and to similarly produce various 
         forms of certain adjectives, when given in their masculine singular form. 
    -  Phrase Discovery? :  Yes, but it is a manual process when training a
         filter for routing runs.  During training, known relevant and not-relevant
         documents are read and phrases that appear useful are collected.  Phrases 
         with similar meaning are then manually constructed and added to the collection.  
         For the Spanish ADHOC runs, phrases were sometimes extracted from the document
         collection by "grepping" single lines containing key words.  Lists of phrases
         with similar meaning were then constructed manually.  
    - Kind of Phrase:  Any kind of phrase (a sequence of words) that could be
         useful.  For example, "chief executive officer", "due to", "back to committee",
         "plan that would insure Americans", and "shut down trading".  
    - Method Used (statistical, syntactic, other):  Manual observation of viewed
         training text. 
    - Word Sense Disambiguation? :  Yes, but only for the Spanish run   UCFSP2,
         where lists of words were developed whose presence would appear to indicate that
         other search words were being taken out of context.      
    - Tokenizer? :      
    - Manually-Indexed Terms? :  Each topic has its own knowledge base which is
         derived from an Entity Relationship (ER) schema for the topic.  For each topic,
         the knowledge base primarily takes the form of one or more lists (files).  There
         are two types of files.  There is a synonym file for each structure component of
         the ER schema, and there is a domain file for each attribute specified in the ER
         schema.  A phrase (a sequence of words) can be an entry in a domain or synonym
         file.  Different forms of an entry (such as carry, carries, carried,...) are also
         put in these files.  These files are initially built from information found in a
         dictionary, a thesaurus, or any specialized reference source.  For routing runs,
         training text is manually viewed to make modifications to these files.  The
         knowledge base for a topic also includes another information (INF) file.  The INF
         file specifies the size of a window for evaluating text, along with the
         importance of the individual domain and synonym files for determining relevancy
         of the text in the window. 

  Statistics on Data Structures built from TREC Text

  - Inverted index     
  - Clusters     
  - N-grams, Suffix arrays, Signature Files     
  - Knowledge Bases    
  - Run ID :  UCF100 (ROUTING).  UCFSP0, UCFSP1, and UCFSP2 (Spanish ADHOC). 
  - Use of Manual Labor      
  - Special Routing Structures     
  - Run ID :  UCF100.  UCFSP0, UCFSP1, and UCFSP2 are Spanish ADHOC runs, but
       also use the data structure and method described below.     
  - Type of Structure:  Hash table in memory to store entries in the synonym and


       domain files of a particular topic filter before beginning input text scan. 
  - Total Storage (in MB):  Insignificant. 
  - Total Computer Time to Build (in hours):  A few seconds. 
  - Automatic Process? (If not, number of manual hours):  Yes. 
  - Brief Description of Method:  Before a filter starts scanning input text
       documents, its synonym and domain files are read and each entry is placed in a
       memory resident hash table. 
  - Other Data Structures built from TREC text     

  Data Built from Sources Other than the Input Text

  - Internally-built Auxiliary File    
  - Domain (independent or specific):  Domain specific, a set of synonym and
       domain files are built for each topic. 
  - Type of File (thesaurus, knowledge base, lexicon, etc.):  Each file is a list
       of words or phrases (a sequence of words).  A synonym file is constructed for a
       component of an ER schema.  A domain file is constructed for an attribute in an
       ER schema.  Alternate forms of words or phrases are also placed in these files.
  - Total Storage (in MB):  For the fifty routing topics, total storage for the
       synonym and domain files was 606K.  For the Spanish runs (25 topics each), total
       storage for synonym and domain files was about 200K for the UCFSP0 run, and about
       600K apiece for the UCFSP1 and UCFSP2 runs.     
  - Number of Concepts Represented:  The concepts represented by a filter's
       synonym and domain files are ER schema entities, attributes, relationships,
       roles, subset predicates, specializations, generalizations, and categories.  
  - Type of Representation:  An Entity Relationship (ER) Schema for a topic.    
  - Total Manual Time to Build (in hours):  Manually building the synonym and
       domain files for a single topic ranged from three hours to fifty hours, the
       average time was twenty hours.  About twenty filters were done in a rush.  For
       about forty filters, we still feel they are not as good as they cold be if we had
       more time. 
  - Use of Manual Labor       
  - Mostly Manually Built using Special Interface:  Yes, the files were manually
       built using only an editor.  Initially, some files were established using
       reference material such as a dictionary, a thesaurus, or any specific reference
       book.  For routing runs, the files were later modified after viewing training
       text.  A few special interfaces were used during the training process. 
  - Internally-built Auxiliary File    
  - Domain (independent or specific):  Domain specific, an information (INF)
       file.  One is built for each topic. 
  - Type of File (thesaurus, knowledge base, lexicon, etc.):  The INF file
       specifies the insertion criteria for a topic's ER schema.  It represents a
       statement of what is relevant to the topic. 
  - Total Storage (in MB):  Small, about 100 bytes each. 
  - Number of Concepts Represented:  The INF file specifies the size of a sliding
       window (the number of words) used to determine membership in specified
       combinations of synonym and domain files.  The importance of each synonym and
       domain file is also indicated in the INF file. 
  - Type of Representation:  The insertion criteria for an    Entity Relationship
       schema.    
  - Total Manual Time to Build (in hours):  A few minutes to establish one, but
       an hour or so of wait time to see how good the INF file was for the filter.  This
       year we tried to determine the best possible window size and best domain and
       synonym file weights for a filter.  But, we still did not have enough time.  For
       Spanish ADHOC runs, a few minutes to establish one, since no training is allowed.
  - Use of Manual Labor       
  - Mostly Manually Built using Special Interface:  An INF file is manually built
       using only an editor.  For routing runs, an INF file is usually modified after


       viewing training text.  In an INF file, the window size and weights of individual
       synonym and domain files were also modified by observing successive performance
       evaluations over training text.  This was done (when we had time) to obtain
       optimum performance of a filter over the training text.  We did not have enough
       time to build optimum filters.  We did not use all of the training data and we
       were rushed to finish for about twenty topics. 
  -  Externally-built Auxiliary File    

  Query construction

  Manually Constructed Queries (Ad-Hoc)

  - Topic Fields Used: All. 
  - Average Time to Build Query (in Minutes):  This is the time to sketch an ER
      schema for a topic (typically, a few minutes for short Spanish descriptions) plus
      the time to build synonym and domain files for the schema (average is ten hours)
      plus a few minutes to create the INF file for the topic. 
  - Type of Query Builder    
  - Domain Expert:  An undergraduate Computer Science student doing independent
      study researched and constructed all Spanish ADHOC queries. Several of this
      student's Mexican-American friends were consulted for information on Mexican
      dialog, politics, sports, culture, etc.  
  - Computer System Expert:  The same basic system was used for Spanish ADHOC as
      for routing.  The undergraduate student that constructed the Spanish queries
      modified the system to handle Spanish. 
  - Tools used to Build Query    
  - Knowledge Base Browser? :       
  - Other Lexical Tools? :  A lexical analyzer was used to recognize specially
      marked Spanish infinitives and adjectives in synonym and domain lists and expand
      these parts of speech to other forms.      
  - Method used in Query Construction    
  - Term Weighting? :  Yes. As with our routing experiments, weight is assigned
      to each synonym and domain file. 
  - Boolean Connectors (AND, OR, NOT)? :  Yes.  For the Spanish ADHOC runs, the
      filter uses OR logic when selecting the highest weight from weights generated by
      a series of weight patterns.  Negative weights in weight patterns implement a
      form of NOT.  Only the UCFSP2 run uses multiple patterns and negative weights.   
  - Proximity Operators? : Yes.  As in our routing experiments, a sliding window 
      of user-specified size (number of words) is used. 
  - Addition of Terms not Included in Topic? :  Yes, many!      
  - Source of Terms:  Any kind of reference material.  Some phrases were found by
      grepping single lines of text containing key words out of the document
      collection.    

  Manually Constructed Queries (Routing)

  - Topic Fields Used:  All. 
  - Average Time to Build Query (in Minutes):  This is the time to sketch an ER
      schema for a topic (this should be about one hour for topic descriptions like
      those for Topic 001 through Topic 200) plus the time to build synonym and domain
      files for the schema (average time was twenty hours) plus a few minutes to create
      the INF file for the topic.  For the fifty ROUTING queries, we started drawing ER
      diagrams in March.  There were close to twenty students doing filters for the
      ROUTING topics.  Explaining ER diagrams to each student was more difficult than
      anticipated.  By late April, we were not drawing ER diagrams.  The synonym and
      domain file concepts were still used because the students understood their 
      purpose and it helped them decide on what to search for in a filter. 
  - Type of Query Builder    


  - Computer System Expert:  The person constructing the synonym and domain files
      for a topic was an undergraduate student in a Computer Science independent study
      course. 
  - Tools used to Build Query    
  - Knowledge Base Browser? :       
  - Other Lexical Tools? :     
  - Data Used for Building Query from     
  - All Training Documents:  No, we used training documents from just the Vol 1
      and Vol 2 CDs.  We did not use Vol 3 training documents.  The reason we used just
      Vol 1 and Vol 2 documents is that not all of the ROUTING topics had training
      documents on Vol 3. This was a mistake because the Vol 3 training documents were
      probably the best, and part of the Vol 3 CD (the Ziff directory) was used for the
      ROUTING document collection.  It would have been extremely beneficial to train on
      the Vol 3 Ziff directory.  All of our filters for ROUTING topics in the range of 
      Topic 051 through Topic 150 had performance at or below median performance.
      Topics in the range of Topic 051 through Topic 150 were the ones that had
      training documents in the Ziff directory of the Vol 3 CD, and we feel our filter 
      performance was adversely affected because we did not use training documents from
      the Ziff directory of the Vol 3 CD. 
  - Documents with Relevance Judgments: Yes. 
  - Other Sources:  Hardcopy references (such as a dictionary, a thesaurus, or a 
      specialized reference book)  were used.  During training, some documents were
      retrieved that had no definite relevance judgment, so these documents were read
      and used if the student felt they were relevant. 
  - Method used in Query Construction    
  - Term Weighting? :  A weight can be assigned to each synonym file and each
      domain file of a filter. 
  - Boolean Connectors (AND, OR, NOT)? :  A form of AND and NOT is used when a
      combination of synonym and domain files is specified.  A form of OR is used when
      different combinations of synonym and domain files are listed. 
  - Proximity Operators? :  The sliding window (number of words) to evaluate
      relevancy.
  - Addition of Terms not Included in Topic? :  Yes!      
  - Source of Terms:  Any kind of reference material and viewed training text.

  Searching

  Search Times

  - Search Times     
  - Run ID :  UCF100 (a ROUTING run). 
  - Computer Time to Search (Average per Query, in CPU seconds):  Since each
      routing query was a true filter that scanned across the entire document
      collection, we kept track of wall clock time.  To train filters across Vol 1 and
      Vol 2 document collections, we trained with Vol 1 in a CD drive and Vol 2 copied 
      to a hard drive.  Typically, four filters were run at once for an elapsed time of
      eight hours.  These were runs made on a Sun SPARCserver 690MP (4 processors).  We
      also trained filters by running them on RISC machines which accessed a hard drive
      copy of Vol 1 on a 386 PC running Linux and the hard drive copy of Vol 2 on the
      SPARCserver.  Only one filter was run per RISC machine.  If only one RISC machine
      was activated, a filter training run took nine hours.  Two RISC filters took 13
      hours.  Three RISC filters took 18 hours to finish.  For the run across the
      ROUTING document collection (stored in compressed form on a hard drive), five
      filters were run simultaneously and it took 3.5 hours for them to finish.  The
      runs were made on the SPARCserver.  The time varied depending on other use of the
      SPARCserver.  For the Spanish ADHOC runs no record was kept of CPU time, but it
      generally took right at half an hour of real time to run from one to four filters
      simultaneously across the Spanish text using the SPARCserver, provided other


      network traffic was light. 
  - Component Times :  No component times, just run the filter across the
      document collection. 

  Machine Searching Methods

  - Machine Searching Methods   
  - Boolean Matching? :  Somewhat. 
  - Free Text Scanning? :  Yes. 
  - Other:  A window (number of words) to view was moved across a document
      collection and the window was evaluated in regard to words that satisfied the
      insertion criteria for an Entity Relationship (ER) schema of a topic description.
      This could be Conceptual Graph Matching.        

  Factors in Ranking

  - Factors in Ranking   
  - Term Frequency? :  Yes. 
  - Other Term Weights? :  Yes.  Each synonym or domain file can be assigned an
      integer "importance" determined by optimum performance over training text.  We
      did not have enough time to determine optimum numbers.   
  - Position in Document? :  Yes, sliding window of words to evaluate. 
  - Proximity of Terms? :  Yes, sliding window of words to evaluate. 
  - Other:  1.  Number of synonym and domain files for a filter.  2.  Local 
      evaluations (in the window) and a global evaluation of the entire document are
      used.  3.  Multiple combinations of synonym and domain files are allowed for a
      filter.        

 

  Machine Information

  - Machine Type for TREC Experiment:    For training:     1. The Vol. 1 CD was
      copied to the hard drive of a PC running Linux (a public domain version of Unix)
      and functioning as an NFS node.  2. The Vol. 2 CD was copied to the hard drive of
      a SPARCserver 690MP (4 processors).  3. Students ran filters and viewed training 
      text from 32 RISC 6000 machines across a network.  For the UCF100 ROUTING run:   
  1. The ROUTING document collection was placed on the hard drive of a 
      SPARCserver 690MP (4 processors).  2. Final filter runs were made on the
      SPARCserver 690MP.  For UCFSP0, UCFSP1, UCFSP2 (Spanish ADHOC runs):  1. The 
      Spanish text was copied onto the hard drive of the SPARCserver 690MP.  2. Final
      runs were made on the SPARCserver 690MP.       
  - Was the Machine Dedicated or Shared:  Shared, except for the NFS node running
      Linux. 
  - Amount of Hard Disk Storage (in MB):  We had access to 1000 MB on the NFS
      node, and 1000 MB on the SPARCserver. 
  - Amount of RAM (in MB):  16 MB on each of the 32 RISC 6000 machines. 16 MB on
      the NFS node. 128 MB on the SPARCserver. 
  - Clock Rate of CPU (in MHz):  33 MHz for the NFS node.  Not known for the RISC
      6000 machines.  Not known for the SPARCserver 690MP. 

  System Comparisons 

  - Amount of "Software Engineering" which went into the Development of the
      Filter System during 1994:     80 hours: Purchase and install hardware and
      establish network access.  160 hours: Design, code, and test the basic filter
      scanner.  40 hours: Design, code, and test a few utilities for training.  40
      hours: Figure out the rules for drawing an "atomic" ER diagram.  ROUTING:    1000


      hours: Establish synonym and domain files for the routing topics for the UCF100
      run.  Spanish ADHOC:    200 hours: Develop lex program for Spanish verb
      expansion.  2 hours: Modify filter to handle special Spanish characters.  40
      hours: Modify filter to choose between multiple patterns, rather than summing
      pattern weights, and to allow for negative weighting.  500 hours: Establish ER
      schemas and related synonym and domain files for Spanish topics (50 in all).  40
      hours:  Index Spanish documents and implement utility for viewing specific
      documents. (Documents were read when doing the UCFSP0 run for Topic SP1 through
      Topic SP25.)       
  - Given appropriate resources    
  - Could your system run faster? :  Yes. 
  - By how much (estimate)? :  It is a function of how many machines are
      available for running a filter, and how much traffic the network will tolerate.
      It might be possible to put a filter on each processor of a machine like the
      MASPAR, and in four iterations, filter the documents on a CD in about four
      minutes. 
  - Features the System is Missing that would be beneficial:     1. A
      human-computer dialog interface to automate the development of an ER "atomic"
      schema from a person with a search request.  2. Access to electronic
      dictionaries, thesauri, and reference material for initial filter construction
      from the ER schema.  3. Utility programs to help train the filters using training
      documents and relevancy judgments.  4. An interface for filter modification
      during interactive queries. 

  Significant Areas of System

  - Brief Description of features in your system which you feel impact the system
    and are not answered by above questions: