Site Report for the Text REtrieval Conference by ConQuest Software Inc. Paul E. Nelson VP of Research & Development CS?NFQTWUEASRTETM 9700 Patuxent Woods Drive, Suite 140, Columbia, Maryland 21046 (410)-290-7 150 287 Introduction ConQuest software has a commercially avallable text search and retrieval system * called "ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based search system, with processing enhancements drawn from the field of Natural Language Processing (NLP). ConQuest participated in Category A of TREC, and so produced results for 50 test queries over the entire 2.3 Gigabyte database. In this category, we constructed queries and submitted results for two methods: Method 1 where queries were automatically generated from the TREC topics, and Method 2 where queries were manually constructed by the software engineers at ConQuest We were extremely pleased with our performance in TREC, as the Category A system with the highest 11 point averages. This performance is not indicative of our full potential, however, for the system is still relatively young. We are continuing to evaluate, test, and tune our dictionaries, ranking algorithms, and search methods. This paper describes our background, the system architecture used for TREC, some features of ConQuest, and a discussion of the results. This paper is written for those with a background in computers with some exposure to artificial intelligence and text retrieval. Background ConQuest has been working on text retrieval since 1988. From the beginning, we meant to use Natural Language Processing (NLP) to better understand the text database and improve retrieval accuracy. This was a natural approach, since both founders of ConQuest, Edwin Addison & Paul Nelson, teach NLP at Johns Hopkins University. During development, we concentrated on solving the two biggest problems of Natural Language Processing: 1) Most NLP requires large, hand-crafted knowledge bases, and 2) traditional techniques are not robust in the face of text errors. To solve the first problem, we relied on machine readable dictionaries and thesauri for all knowledge data. Machine readable dictionaries provide ample information on syntax, word variations, and inflected forms. Thesauri and similar sources provide semantic information (word and meaning relationships) which were compiled into structured networks. Both sources were judged to be useful for text retrieval. Combining the resources, however, required significant engineering effort. The second problem, that of robust processing, required work in two areas. First, NLP development was directed towards statistical approaches and away from rule-based approaches. Statistical approaches typically use heuristics or probabilities to provide confidence factors that accumulate evidence. Unlike rules, which are typically passifail, statistical approaches can handle unexpected or variable input without causing total system failure. But to fully solve the problem of robust processing, we had to have good, solid software engineering. Most NLP systems are ad-hoc affairs, often thrown together at the last minute and patched up. ConQuest has concentrated on producing concrete bullet-proof software, * For additional inforrnation on ConQuest, please contact the author 288 even if this means delaying some advanced or exotic modules. Our philosophy is that simple programs, well executed, will always out-perform complex tools poorly done. System Architecture ConQuest uses pre-built indexes to perform text database searches at fast speeds. In such a system, all text to be searched must first be indexed. The indexes built in this process can then be used by the text search engine to produce results. Both indexing and search use a dictionary with a semantic network to perform various NLP tasks. Queries Resufts Figure 1 ConQuest System Architecture There are other modules in the ConQuest system not shown in Figure 1. These include the library manager, which is responsible for system parameters, database configuration, resource allocation, and physical partitioning of the indexes. Also the dictionary editor, which can be used to edit words, meanings, links, and definitions. The Dictionary ConQuest uses a dictionary augmented with a semantic network to perform indexing and queries. The dictionary is a list of words where each word contains multiple meanings. Each meaning contains syntactic information (part-of-speech, feature values), and a dictionary defmition. The semantic network contains nodes which correspond to meanings of words. These nodes are linked to other related nodes. Relationships between nodes are extracted from machine re~dable dictionaries. Some example relationship types include synonym, antonym, child-of, parent-of, related-to, part-of, substance-of, contrasting, and similar-to. ConQuest uses the dictionary for morphological analysis (see below) and idiom processing. The semantic networks are used to expand the query to include related terms. Since connections are made between meanings of words, both in the dictionary and the semantic networks, processing is much more accurate compared to a simple thesaurus. 289 aardvark aback abacus Dictionary 1: (n) a slab which forms the uppermost member of the capital of a column zwieback - zygote zymosis - Semantic Network 2: (n) an instrinnent for performing calculations by sliding counters along rods counter:4 computation: 1 abacu5:2~~\ __________ computer:2 ~calc;1ato¾3~/ Figure 2 Example Structures for the Dictionary & Semantic Networks Indexing ConQuest performs several steps of normalization and information extraction during indexing to help queries run faster. Words are normalized and enhanced before indexing with tokenization, morphology, and idiom processing. ConQuest Dictionary Text Parse Database Document Tokenize Morphology A Find A Idioms Index Indexes Figure 3 The Indexing Process 290 The modules used for indexing are as follows: * Parse Document: Looks for codes in the text database to locate fields such as title, headline, authors, etc. These fields can be indexed, ignored, or stored in a special database for fast access. Parse document takes a command file which describes the structure of the text database to be indexed and can handle a wide variety of different text file formats. * Tokenize: Divides a string of characters into words. This may include special processing for dates, phone numbers, floating point numbers, hyphens, etc. * Morphology: An advanced form of stemming, attempts to remove suffixes and perform spelling changes to reduce words to simpler forms which are found in the dictionary. For example, one morphology rule will take "babies," strip the "ies," add "y", and produce "baby" which is found in the dictionary. Irregular forms of words are stored direcdy in the dictionary and are not subject to morphological analysis. Morphology is a much more accurate form of word reduction than stemming, because the dictionary can be used to validate the transformations. Morphology will not reduce proper nouns, and will produce much more accurate reductions, especially for words endingin "e." * Find Idioms: Idioms are phrases which have a meaning beyond that of the individual words added together. For example, "kangaroo court" has nothing whatsoever to do with kangaroos. Also proper nouns, such as "United States," have a meaning beyond the sum of their component parts. This module finds idioms in the text and indexes the idiom as a single unit. This prevents idioms such as "Dow Jones Industrial Average" from getting confused with queries on "industrial history." Words inside of idioms can still be located individually, if desired. * Index: The final step is to store the reduced words and collected idioms into the indexes. The index is an inverted positional word index, which is conceptually similar to the index at the back of a textbook. The speed of indexing by ConQuest has been measured to be approximately 40 megabytes per hour. This evaluation was done on a Sun IPc Sparc-Station (a 14 MIP computer), with 32 megabytes of RAM. The same speed has been achieved on a 486 IBM-PC, running at a clock rate of 33 Mhz, with 16 megabytes of RAM. Query The query process is more complex than indexing, due to word expansion and ranking. Generally speaking, ConQuest attempts to refme and enhance the user's query. The result is then matched against the indexes to look for documents which contain similar patterns. Queries are not "understood" in the traditional sense of natural language processing. ConQuest makes no attempt to deeply understand the objects in the query, their interaction, or the user's intent. Rather, ConQuest attempts to understand the meaning of each individual word and the importance of the word. It then uses the set of meanings and their related terms (retrieved from the semantic networks) as a statistical set which is matched against document information stored in the indexes. 291 rnconQuestI \&¼ctionaIY Find A Query Remove A Enhancement Stop Words B Expand earch & B Meanings Documents Semantic [Networks I mindexes Figure 4 The Query Process The following is a description of the modules used for query: * Tokenize, Morphology, Find Idioms: These modules are the same as for indexing. * Query Enhancement: The user is given the opportunity to enhance the query for additional improvement in precision and recall. There are many options available here, but the two most important are to choose meanings and weight query terms. Choosing a meaning of a word will restrict the expansion of words to only related terms which are relevant to the chosen meanings. This reduces noise in the query. When running in automatic mode, ConQuest expands all meanings of all words. Weighting query terms identifies the importance of the various words in the query. These weights are used by the search engine when ranlling documents and computing document relevance factors. * Remove Stop Words: Small function words-such as determiners, conjunctions, auxiliary verbs, and small adverbs-are removed from the query, just as they were during indexing. Removing these terms makes queries faster, and also reduce ambiguous noise in the query. * Expand Meanings: Words in the query are expanded to include related terms. * Search and Rank: ConQuest uses an integrated search and rank algorithm which considers the relevance rankings of documents throughout the search process. Since ranking and search are integrated, the search engine automatically produces the most relevant documents right away. This is different than past approaches, which typically retrieve all matching documents and then rank and sort them as a separate step. 292 The speed of queries is based on many factors, including database size, amount of expansion, size of dictionaries, etc. Many of these parameters can be modified to achieve the appropriate trade-off between accuracy and speed. For TREC, a query over a 1 Gigabyte database with 15 terms took roughly 9 seconds to retrieve 500 documents on a Sparc-IPC with 32 megabytes of RAM. Fquivalent speeds have been measured on 486 IBM-PC computers running at 33 Mhz with 16 megabytes of RAM. Queries can be expanded to a very large number of terms, if desired. If the user wishes for the greatest amount of recall, a 5 word query can be expanded to 200 or 300 related terms. Ranking Ran~~g and retrieval with ConQuest uses a variety of statistics and critenon, which are flexible and can be modified to handle varying requirements. The following are some of the factors used in ranking: Completeness: A good document should contain at least one term or related term for each word in the original query. Contextual Evidence: Words are supported by their related terms. If a document contains a word and its related terms, then the word is given a higher weight because it is surrounded by supporting evidence. Semantic Distance: The semantic network contains information on how closely two terms are related. Proximity: A document is considered to be more relevant if it contains matching terms which occur close together, preferably in the same paragraph or sentence. Quantity: The absolute quantity of hits in the document is also included, but is not as strong a discriminator of relevance as the other factors. ConQuest is the first truly "concept-based" search system to operate over unrestricted domains. If a document contains the word and some of the related terms, the word is more likely to be used in the correct meaning, using the "contextual evidence" factor above. In this way, ConQuest can determine word meanings at query time. Other Features ConQuest contains a number of other search features, used to handle specific situations and search requirements. These were not used for the Text REtrieval Conference. Some of these features are listed below: * Wildcards: Are useful for locating misspellings in the queries or in the database. The user can specify a word with a wildcard, such as "compute*," and then choose which of the matching terms from the indexes should be included in the query. Query words derived from wildcards are not otherwise expanded. * Boolean: The traditional boolean query mode is also available, and contains all of the basic operators. These include AND, OR, NOT, WITHIN, thesaurus expansion, and nested expressions. 293 * Query By Example: It is possible to submit an entire document as a query, which is called "query by example." Essentially, the document is used as an example, and documents similar to it are retrieved. This function works even when the example document is very large. * Search Within: A query can be directed to search only within a small set of documents. The set of documents is often selected from a previous query. This function is also called "recursive" search or "recurrent" search. This function is especially useful when a statistical query searches over the results of an earlier boolean query, or visa-versa. * Numeric and Date Ranges: ConQuest can fmd documents which contain numbers or dates within a specified range. Numbers are subject to the standard proximity tests in the boolean and natural language queries, just like other words. * Fielded Searches: A search can be restricted to any particular field in a documenL For example, a users often wish to search only over the "authors" field, and not over the full body of the text. * Document Categories: Documents in ConQuest can be categorized as appropriate. Users can target searches to occur only over a single category, or over multiple categories. Evaluation Results ConQuest had the highest overall score in ThEC for Category A systems (the full 2.5 Gigabyte database) using the 11 point averages. In comparing ConQuest to other systems, we found the following two graphs to be useful. 100% 80% 60% 40% 20% II ii II Best no, `1l0 -20% -40% -60% -80% II `I I- -II -I I I I III II III III I- II dl dl dl . . - - - - - - - Average -100% Worst Figure 5 ConQuest vs All Systems for the 39 Non-Zero Queries Category A, Manual Mode This first graph (figure 5) shows the results of ConQuest for the 39 non-zero queries (the remaining 11 queries had no relevant documents in the database). Ascore of 100% represents the best of all Category A systems, a score of - 100% represents the worst of all Category A systems. A score of 0% represents average performance. 294 The results show that ConQuest was best on five of the 39 queries, worst on one query, and well above average for most. EssentiallY, the area above average is much greater than the area below. The next chart (figure 6) compares ConQuest to the next two highest Category A manual systems. This chart was originally produced in the TREC proceedings. 1.0000 0.9000 0.8000 , 0.7000 Cu -~ 0.6000 0 0.5000 0.4000 0 ; 0.3000 e~ 0.2000 0.1000 0.0000 0 o 0 0 0 0 0 0 0 0 o Lc~ o 0 0 0 0 0 0 0 0 0 ~ConQuest - . - Recall percentage Figure 6 precision at Recall for the Top 3 Scorers Category A, Manual Mode This chart shows ConQuest having be~r perf0rmance in the high-recall region (where a large percentage of the relevant documents are retrieved). We suspect that the performance in this region is enhanced by our aggressive expansion of terms coupled with the flexible ranking algorithm. Good performance in the high recall region is important to us, because studies have shown that high recall is the most difficult problem for text retrieval systems. Many systems achieve relatively high precisiofl by discarding documents, but very few achieve high recall. 295 Conclusions I Future The performance of ConQuest exceeded all our expectations. The system is still young, and we fully expected that much more time would be required to rework the ranking algorithms and evaluate the system. These positive results from ThEC indicate that we are well on our way to superior retrieval performance. Since TREC, ConQuest has been working on detailed analysis of all factors which drive ranking and retrieval. This includes the dictionary, morphology, idiom processing, term expansion, and the ranking engine itself. We feel that our performance on ThEC is good start, but that it does not represent our full potential. We look forward to IREC-2 to see how much further we can progress. 296