System Summary and Timing Organization Name: University of Central Florida List of Run ID's: UCF100 (routing run). UCFSP0 (a run on Spanish Topics 1-25). UCFSP1 (ADHOC run on TREC-4 Spanish Topics 25-26, submitted for evaluation). UCFSP2 (extra ADHOC run on Spanish Topics 25-26). Construction of Indices, Knowledge Bases, and other Data Structures Methods Used to build Data Structures - Controlled Vocabulary? : Yes. - Stemming Algorithm: No, but for the Spanish runs (UCFSP0, UCFSP1, and UCFSP2) an auxiliary program was written to "expand" infinitives into listings of other possible verb forms, and to similarly produce various forms of certain adjectives, when given in their masculine singular form. - Phrase Discovery? : Yes, but it is a manual process when training a filter for routing runs. During training, known relevant and not-relevant documents are read and phrases that appear useful are collected. Phrases with similar meaning are then manually constructed and added to the collection. For the Spanish ADHOC runs, phrases were sometimes extracted from the document collection by "grepping" single lines containing key words. Lists of phrases with similar meaning were then constructed manually. - Kind of Phrase: Any kind of phrase (a sequence of words) that could be useful. For example, "chief executive officer", "due to", "back to committee", "plan that would insure Americans", and "shut down trading". - Method Used (statistical, syntactic, other): Manual observation of viewed training text. - Word Sense Disambiguation? : Yes, but only for the Spanish run UCFSP2, where lists of words were developed whose presence would appear to indicate that other search words were being taken out of context. - Tokenizer? : - Manually-Indexed Terms? : Each topic has its own knowledge base which is derived from an Entity Relationship (ER) schema for the topic. For each topic, the knowledge base primarily takes the form of one or more lists (files). There are two types of files. There is a synonym file for each structure component of the ER schema, and there is a domain file for each attribute specified in the ER schema. A phrase (a sequence of words) can be an entry in a domain or synonym file. Different forms of an entry (such as carry, carries, carried,...) are also put in these files. These files are initially built from information found in a dictionary, a thesaurus, or any specialized reference source. For routing runs, training text is manually viewed to make modifications to these files. The knowledge base for a topic also includes another information (INF) file. The INF file specifies the size of a window for evaluating text, along with the importance of the individual domain and synonym files for determining relevancy of the text in the window. Statistics on Data Structures built from TREC Text - Inverted index - Clusters - N-grams, Suffix arrays, Signature Files - Knowledge Bases - Run ID : UCF100 (ROUTING). UCFSP0, UCFSP1, and UCFSP2 (Spanish ADHOC). - Use of Manual Labor - Special Routing Structures - Run ID : UCF100. UCFSP0, UCFSP1, and UCFSP2 are Spanish ADHOC runs, but also use the data structure and method described below. - Type of Structure: Hash table in memory to store entries in the synonym and domain files of a particular topic filter before beginning input text scan. - Total Storage (in MB): Insignificant. - Total Computer Time to Build (in hours): A few seconds. - Automatic Process? (If not, number of manual hours): Yes. - Brief Description of Method: Before a filter starts scanning input text documents, its synonym and domain files are read and each entry is placed in a memory resident hash table. - Other Data Structures built from TREC text Data Built from Sources Other than the Input Text - Internally-built Auxiliary File - Domain (independent or specific): Domain specific, a set of synonym and domain files are built for each topic. - Type of File (thesaurus, knowledge base, lexicon, etc.): Each file is a list of words or phrases (a sequence of words). A synonym file is constructed for a component of an ER schema. A domain file is constructed for an attribute in an ER schema. Alternate forms of words or phrases are also placed in these files. - Total Storage (in MB): For the fifty routing topics, total storage for the synonym and domain files was 606K. For the Spanish runs (25 topics each), total storage for synonym and domain files was about 200K for the UCFSP0 run, and about 600K apiece for the UCFSP1 and UCFSP2 runs. - Number of Concepts Represented: The concepts represented by a filter's synonym and domain files are ER schema entities, attributes, relationships, roles, subset predicates, specializations, generalizations, and categories. - Type of Representation: An Entity Relationship (ER) Schema for a topic. - Total Manual Time to Build (in hours): Manually building the synonym and domain files for a single topic ranged from three hours to fifty hours, the average time was twenty hours. About twenty filters were done in a rush. For about forty filters, we still feel they are not as good as they cold be if we had more time. - Use of Manual Labor - Mostly Manually Built using Special Interface: Yes, the files were manually built using only an editor. Initially, some files were established using reference material such as a dictionary, a thesaurus, or any specific reference book. For routing runs, the files were later modified after viewing training text. A few special interfaces were used during the training process. - Internally-built Auxiliary File - Domain (independent or specific): Domain specific, an information (INF) file. One is built for each topic. - Type of File (thesaurus, knowledge base, lexicon, etc.): The INF file specifies the insertion criteria for a topic's ER schema. It represents a statement of what is relevant to the topic. - Total Storage (in MB): Small, about 100 bytes each. - Number of Concepts Represented: The INF file specifies the size of a sliding window (the number of words) used to determine membership in specified combinations of synonym and domain files. The importance of each synonym and domain file is also indicated in the INF file. - Type of Representation: The insertion criteria for an Entity Relationship schema. - Total Manual Time to Build (in hours): A few minutes to establish one, but an hour or so of wait time to see how good the INF file was for the filter. This year we tried to determine the best possible window size and best domain and synonym file weights for a filter. But, we still did not have enough time. For Spanish ADHOC runs, a few minutes to establish one, since no training is allowed. - Use of Manual Labor - Mostly Manually Built using Special Interface: An INF file is manually built using only an editor. For routing runs, an INF file is usually modified after viewing training text. In an INF file, the window size and weights of individual synonym and domain files were also modified by observing successive performance evaluations over training text. This was done (when we had time) to obtain optimum performance of a filter over the training text. We did not have enough time to build optimum filters. We did not use all of the training data and we were rushed to finish for about twenty topics. - Externally-built Auxiliary File Query construction Manually Constructed Queries (Ad-Hoc) - Topic Fields Used: All. - Average Time to Build Query (in Minutes): This is the time to sketch an ER schema for a topic (typically, a few minutes for short Spanish descriptions) plus the time to build synonym and domain files for the schema (average is ten hours) plus a few minutes to create the INF file for the topic. - Type of Query Builder - Domain Expert: An undergraduate Computer Science student doing independent study researched and constructed all Spanish ADHOC queries. Several of this student's Mexican-American friends were consulted for information on Mexican dialog, politics, sports, culture, etc. - Computer System Expert: The same basic system was used for Spanish ADHOC as for routing. The undergraduate student that constructed the Spanish queries modified the system to handle Spanish. - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : A lexical analyzer was used to recognize specially marked Spanish infinitives and adjectives in synonym and domain lists and expand these parts of speech to other forms. - Method used in Query Construction - Term Weighting? : Yes. As with our routing experiments, weight is assigned to each synonym and domain file. - Boolean Connectors (AND, OR, NOT)? : Yes. For the Spanish ADHOC runs, the filter uses OR logic when selecting the highest weight from weights generated by a series of weight patterns. Negative weights in weight patterns implement a form of NOT. Only the UCFSP2 run uses multiple patterns and negative weights. - Proximity Operators? : Yes. As in our routing experiments, a sliding window of user-specified size (number of words) is used. - Addition of Terms not Included in Topic? : Yes, many! - Source of Terms: Any kind of reference material. Some phrases were found by grepping single lines of text containing key words out of the document collection. Manually Constructed Queries (Routing) - Topic Fields Used: All. - Average Time to Build Query (in Minutes): This is the time to sketch an ER schema for a topic (this should be about one hour for topic descriptions like those for Topic 001 through Topic 200) plus the time to build synonym and domain files for the schema (average time was twenty hours) plus a few minutes to create the INF file for the topic. For the fifty ROUTING queries, we started drawing ER diagrams in March. There were close to twenty students doing filters for the ROUTING topics. Explaining ER diagrams to each student was more difficult than anticipated. By late April, we were not drawing ER diagrams. The synonym and domain file concepts were still used because the students understood their purpose and it helped them decide on what to search for in a filter. - Type of Query Builder - Computer System Expert: The person constructing the synonym and domain files for a topic was an undergraduate student in a Computer Science independent study course. - Tools used to Build Query - Knowledge Base Browser? : - Other Lexical Tools? : - Data Used for Building Query from - All Training Documents: No, we used training documents from just the Vol 1 and Vol 2 CDs. We did not use Vol 3 training documents. The reason we used just Vol 1 and Vol 2 documents is that not all of the ROUTING topics had training documents on Vol 3. This was a mistake because the Vol 3 training documents were probably the best, and part of the Vol 3 CD (the Ziff directory) was used for the ROUTING document collection. It would have been extremely beneficial to train on the Vol 3 Ziff directory. All of our filters for ROUTING topics in the range of Topic 051 through Topic 150 had performance at or below median performance. Topics in the range of Topic 051 through Topic 150 were the ones that had training documents in the Ziff directory of the Vol 3 CD, and we feel our filter performance was adversely affected because we did not use training documents from the Ziff directory of the Vol 3 CD. - Documents with Relevance Judgments: Yes. - Other Sources: Hardcopy references (such as a dictionary, a thesaurus, or a specialized reference book) were used. During training, some documents were retrieved that had no definite relevance judgment, so these documents were read and used if the student felt they were relevant. - Method used in Query Construction - Term Weighting? : A weight can be assigned to each synonym file and each domain file of a filter. - Boolean Connectors (AND, OR, NOT)? : A form of AND and NOT is used when a combination of synonym and domain files is specified. A form of OR is used when different combinations of synonym and domain files are listed. - Proximity Operators? : The sliding window (number of words) to evaluate relevancy. - Addition of Terms not Included in Topic? : Yes! - Source of Terms: Any kind of reference material and viewed training text. Searching Search Times - Search Times - Run ID : UCF100 (a ROUTING run). - Computer Time to Search (Average per Query, in CPU seconds): Since each routing query was a true filter that scanned across the entire document collection, we kept track of wall clock time. To train filters across Vol 1 and Vol 2 document collections, we trained with Vol 1 in a CD drive and Vol 2 copied to a hard drive. Typically, four filters were run at once for an elapsed time of eight hours. These were runs made on a Sun SPARCserver 690MP (4 processors). We also trained filters by running them on RISC machines which accessed a hard drive copy of Vol 1 on a 386 PC running Linux and the hard drive copy of Vol 2 on the SPARCserver. Only one filter was run per RISC machine. If only one RISC machine was activated, a filter training run took nine hours. Two RISC filters took 13 hours. Three RISC filters took 18 hours to finish. For the run across the ROUTING document collection (stored in compressed form on a hard drive), five filters were run simultaneously and it took 3.5 hours for them to finish. The runs were made on the SPARCserver. The time varied depending on other use of the SPARCserver. For the Spanish ADHOC runs no record was kept of CPU time, but it generally took right at half an hour of real time to run from one to four filters simultaneously across the Spanish text using the SPARCserver, provided other network traffic was light. - Component Times : No component times, just run the filter across the document collection. Machine Searching Methods - Machine Searching Methods - Boolean Matching? : Somewhat. - Free Text Scanning? : Yes. - Other: A window (number of words) to view was moved across a document collection and the window was evaluated in regard to words that satisfied the insertion criteria for an Entity Relationship (ER) schema of a topic description. This could be Conceptual Graph Matching. Factors in Ranking - Factors in Ranking - Term Frequency? : Yes. - Other Term Weights? : Yes. Each synonym or domain file can be assigned an integer "importance" determined by optimum performance over training text. We did not have enough time to determine optimum numbers. - Position in Document? : Yes, sliding window of words to evaluate. - Proximity of Terms? : Yes, sliding window of words to evaluate. - Other: 1. Number of synonym and domain files for a filter. 2. Local evaluations (in the window) and a global evaluation of the entire document are used. 3. Multiple combinations of synonym and domain files are allowed for a filter. Machine Information - Machine Type for TREC Experiment: For training: 1. The Vol. 1 CD was copied to the hard drive of a PC running Linux (a public domain version of Unix) and functioning as an NFS node. 2. The Vol. 2 CD was copied to the hard drive of a SPARCserver 690MP (4 processors). 3. Students ran filters and viewed training text from 32 RISC 6000 machines across a network. For the UCF100 ROUTING run: 1. The ROUTING document collection was placed on the hard drive of a SPARCserver 690MP (4 processors). 2. Final filter runs were made on the SPARCserver 690MP. For UCFSP0, UCFSP1, UCFSP2 (Spanish ADHOC runs): 1. The Spanish text was copied onto the hard drive of the SPARCserver 690MP. 2. Final runs were made on the SPARCserver 690MP. - Was the Machine Dedicated or Shared: Shared, except for the NFS node running Linux. - Amount of Hard Disk Storage (in MB): We had access to 1000 MB on the NFS node, and 1000 MB on the SPARCserver. - Amount of RAM (in MB): 16 MB on each of the 32 RISC 6000 machines. 16 MB on the NFS node. 128 MB on the SPARCserver. - Clock Rate of CPU (in MHz): 33 MHz for the NFS node. Not known for the RISC 6000 machines. Not known for the SPARCserver 690MP. System Comparisons - Amount of "Software Engineering" which went into the Development of the Filter System during 1994: 80 hours: Purchase and install hardware and establish network access. 160 hours: Design, code, and test the basic filter scanner. 40 hours: Design, code, and test a few utilities for training. 40 hours: Figure out the rules for drawing an "atomic" ER diagram. ROUTING: 1000 hours: Establish synonym and domain files for the routing topics for the UCF100 run. Spanish ADHOC: 200 hours: Develop lex program for Spanish verb expansion. 2 hours: Modify filter to handle special Spanish characters. 40 hours: Modify filter to choose between multiple patterns, rather than summing pattern weights, and to allow for negative weighting. 500 hours: Establish ER schemas and related synonym and domain files for Spanish topics (50 in all). 40 hours: Index Spanish documents and implement utility for viewing specific documents. (Documents were read when doing the UCFSP0 run for Topic SP1 through Topic SP25.) - Given appropriate resources - Could your system run faster? : Yes. - By how much (estimate)? : It is a function of how many machines are available for running a filter, and how much traffic the network will tolerate. It might be possible to put a filter on each processor of a machine like the MASPAR, and in four iterations, filter the documents on a CD in about four minutes. - Features the System is Missing that would be beneficial: 1. A human-computer dialog interface to automate the development of an ER "atomic" schema from a person with a search request. 2. Access to electronic dictionaries, thesauri, and reference material for initial filter construction from the ER schema. 3. Utility programs to help train the filters using training documents and relevancy judgments. 4. An interface for filter modification during interactive queries. Significant Areas of System - Brief Description of features in your system which you feel impact the system and are not answered by above questions: