The ConQuest System Paul E. Nelson VP of Research & Development CONQUESTTM SOFTWARE 9705 Patuxent Woods Drive, Columbia, Maryland 21046 (410)-290-6290 Introduction ConQuest software has a commercially available text search and retrieval system* called "ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based search system, with processing enhancements drawn from the field of Natural Language Processing (NLP). ConQuest participated in Category A of TREC, and so produced results for 50 test queries over the entire 2.3 Gigabyte database. In this category, we constructed queries and submitted results for two different ranking functions. These two functions tested the difference between local and global document relevancy, and are fully described later. In TREC-2, ConQuest had a very strong showing. Our recall scores in particular improved by about 18 percentage points over the adjusted TREC- I scores. Our precision scores were also very competitive. The purpose of this paper is to discuss how we prepared for TREC-2: how queries were performed, what initial judgments were made and why, and interpretation of the results. Then, I will cover the tests which were performed after TREC-2, and how these tests clearly identify the areas where ConQuest could most effectively be improved. System Architecture For a complete discussion of the system architecture of ConQuest, see the TREC- 1 conference proceedings, or call the author. The following overview is meant as a brief refresher. ConQuest uses pre-built indexes to perform text database searches at fast speeds. In such a system, all text to be searched must first be indexed. These indexes are then used for all searching; the original document data is not required. ConQuest uses a dictionary augmented with a semantic network for both indexing and queries. The dictionary is a list of words where each word contains multiple meanings. Each meaning contains syntactic information (part-of- speech, feature values), and a dictionary definition. * For additional information on ConQuest, please contact the author. 265 The semantic network contains nodes which correspond to meanings of words. These nodes are linked to other related nodes. Relationships between nodes are extracted from machine readable dictionaries. Some example relationship types include synonym, antonym, child-of, parent-of, related-to, part-of, substance-of, contrasting, and similar-to. The ConQuest dictionary was generated automatically from several Machine Readable Dictionary (MRDs) sources, commercially available. This gives ConQuest the most robust and thorough coverage of English available. It is the completeness of coverage that drives performance gains in recall and precision. Since ConQuest is a commercially available product, many additional components, not required for TREC-2, are also available, such as true client/server, graphical user interfaces, routing and dissemination, and sophisticated application program interfaces. Query Generally speaking, ConQuest attempts to refine and enhance the user's query. The result is then matched against the indexes to look for documents which contain similar concepts. Queries are not "understood" in the traditional sense of natural language processing. ConQuest makes no attempt to deeply understand the objects in the query, their interaction. or the user's intent. Rather, ConQuest attempts to understand the meaning of each individual word and the importance of the word. It then uses the set of meanings and their related terms (retrieved from the semantic networks) as a statistical set which is matched against document information stored in the indexes. i~iuest Query I Dictio1nary I A ~ I-I B A Enhancemen Documents \<~`JMeanings~~ ~tic &\exes LNetworksj ~ Figure 1 The Query Process The following is a description of the modules used for query: * Tokenize: Divides a string of characters into words. Morphology: An advanced form of stemming; attempts to remove suffixes and perform spelling changes to reduce words to simpler forms which are found in the dictionary. For example, one morphology rule will take "babies," strip the "ies," add "y," and produce "baby," which is found in the dictionary. Find Idioms: This module finds idioms in the text and indexes the idiom as a single unit. This prevents idioms such as "Dow Jones Industrial Average" from getting confused with queries on "industrial history." Words inside of idioms can still be located individually, if desired. Query Enhancement: The user is given the opportunity to enhance the query for additional improvement in precision and recall. There are many options available here, but the two most important are to choose meanings and weight query terms. Choosing a meaning of a word will restrict the expansion of words to only related terms which are relevant to the chosen meanings. This reduces noise in the query. When running in automatic mode, ConQuest expands all meanings of all words. Weighting query terms identifies the importance of the various words in the query. These weights are used by the search engine when ranking documents and computing document relevance factors. Remove Stop Words: Small function words-such as determiners, conjunctions, auxiliary verbs, and small adverbs-are removed from the query. * Expand Meanings: Words in the query are expanded to include related terms. 266 * Search and Rank: ConQuest uses an integrated search and rank algorithm (described in the next section) which considers the relevance rankings of documents throughout the search process. Since ranking and search are integrated, the search engine automatically produces the most relevant documents right away. Queries can be expanded to a very large number of terms, if desired. If the user wishes for the greatest amount of recall, a 5 word query can be expanded to 200 or 300 related terms. Many other query features are also available in ConQuest, including wildcards, fuzzy spelling expansion, numeric and date range searching, boolean, mixed boolean and statistical, fielded searching (a variety of types), and searching over document categories. Ranking Factors Ranking and retrieval with ConQuest uses a variety of statistics and criteria, which are flexible and can be modified to handle varying requirements. The following are some of the factors used in ranking: Completeness: A good document should contain at least one term or related term for each word in the original query. Contextual Evidence: Words are supported by their related terms. If a document contains a word and its related terms, then the word is given a higher weight because it is surrounded by supporting evidence. Semantic Distance: The semantic network contains information on how closely two terms are related. Proximity: A document is considered to be more relevant if it contains matching terms which occur close together, preferably in the same paragraph or sentence. Quantity: The absolute quantity of hits in the document is also included, but is not as strong a discriminator of relevance as the other factors. ConQuest is the first truly "concept-based" search system to operate over unrestricted domains. If a document contains the word and some of its related terms, the word is more likely to be used in the correct context, using the "contextual evidence" factor above. In this way, ConQuest can determine word meanings at query time. Coarse and Fine Grain Ranking To further improve retrieval speed, ConQuest performs the search in two phases. The first is "coarse-grain." This phase is integrated with the document search process. Documents are output from the ConQuest search engine in descending coarse-grain rank order. To compute the coarse-grain rank for a document, the statistics for the words contained in the document are combined using the coarse-grain ranking function. The inputs to this function include the semantic network strength of each word, frequency in query, expansion terms, inverse document frequency, and query structure. Once a document is found using coarse-grain rank, a second phase of relevancy ranking is applied, called "fine-grain" rank. This second phase uses a different ranking function which has access to more local information within the document. The inputs to this function include all of the inputs used in coarse-grain ranking, plus word location, proximity, frequency in document, and document structure. Query FTh~e~rain Lmj Document List `Ti-ime-Grain Corn Final Document Rank Figure 2 Fine and Coarse Grain Ranking In general, the coarse-grain rank of a document represents global information on the document. It is a score that applies more to the document as a whole. The coarse-grain rank will be high for a document if it contains a large number of query words and related terms, ignoring the position of those terms in the document. The fine-grain rank, on the other hand, represents local information, because the proximity (physical closeness) of the terms is the strongest contributor. The fine-grain rank of a document will be high if there is a single strong reference contained in the document. As shown in Figure 2 above, the final document score is computed as a combination of the coarse- and fine-grain scores. Pre-TREC Experiments In preparation for ThEC-2, ConQuest performed numerous experiments to improve the coarse-grain ranking algorithms and data. These experiments included the following: 1. Statistical word studies (statistical regressions to predict the probability that a document containino a word is relevant) 2. Statistical word-pair studies 3. Various weighting formulae 4. Various query structuring techniques 267 These studies were all performed under the assumption that the coarse-grain ranking formula used for TREC was weaker than the fine-grain ranking formula. The concern was that coarse-grain ranking did not retrieve a large enough percentage of relevant documents in the initial retrieval set. It was thought that once these documents were retrieved, the fine-grain algorithms would effectively use proximity and term frequency information to sort the documents and put all of the truly relevant ones at the top of the list. Unlike other systems, ConQuest did not have funding for these TREC studies. This put the TREC studies in direct conflict with other more pressing concerns, such as supporting customers, or providing new functionality such as client/server. As a result, the testing from these early studies proved ambiguous and unreliable. We believe that this was due to the following: * Since time and resources were limited, tests were performed on only a small number of queries (5-10). This did not provide a large enough sample set of queries to produce reliable test results. * ConQuest never tested the original assumption that coarse-grain was the limiting step in improving accuracy. * The queries for this testing were taken from the TREC-1 final test queries. However, many of these queries were hastily constructed and thus added noise to the test results. Just before the TREC-2 results were due, ConQuest decided to concentrate most of its effort on improving the tools used to generate queries. The tools and processes created are described in the next section. Generating Queries for TREC-2 Generating queries was primarily an automatic process, based on the initial TREC-2 topic descriptions. Manual input was used primarily to remove things: Words, word meanings, and expansions. This produced queries with only the terms that are relevant. If needed, a user can also set weights for query terms. Note that all manual steps were performed for all queries before any documents were retrieved. In other words, no feedback information was used in generating the queries. This makes ConQuest fully compliant with the rules for ad- hoc queries in TREC-2. Automatic Query Generation Steps A special program was created to convert TREC-2 topic descriptions into ConQuest query log files. The architecture of this program is show in Figure 3. TREC-2 Topic____ Topic Descriptions SGML Codes ~ ~ and Processing Stop Words other TREC-2 Expand All ConQuest Function Words Word Meanings Query Logs Figure 3 Program to Automatically Generate Query Log Files The modules in the program are as follows: Parse Topic - Reads through the topic looking for the SGML codes (such as ). The location within the topic for all words in the query are preserved in the final query log files. Tokenize - Divides up strings into tokens. Morphology - Locates all words in the dictionary and reduces them to root words if possible. Idiom Processing - Collects idioms together as single terms, such as "United States." Remove Stop Words - Removes conjunctions, determiners, auxiliary verbs, prepositions, etc. Remove Function Words - Removes words such as "document," "relevant," and "retrieve" which are used often in TREC-2 narratives but do not help retrieval. Expand Word Meanings - All word meanings are expanded using the ConQuest semantic network and all expansions are added to the query. Note that all of these steps occur automatically with no manual input. The program also generates other statistics, such as the count of each term in the query, a count for each term for each section of the query (sections being the topic, description, narrative, concepts, and factors), and the total number of words in the query. Manual Query Generation Steps There were two manual steps used to generate quenes: 1. Remove words, word meanings, and/or expansions 2. Set term weights (if necessary) Fortunately, ConQuest has graphical user interfaces (Guls) for removing words, word meanings, and expansions from the queries automatically generated. A user merely brings up the query and uses the mouse to select items to be deleted. In TREC-2, terms were not weighted in the traditional sense, but rather were categorized into three sets: 1. Terms that embody the entire query, which would make good search terms if used by themselves 2. Terms which embody a necessary portion of the query. but not the entire concept 3. All other related terms 268 These categories provide simple guidelines for setting term weights, which make it much easier to generate queries. Evaluations using the TREC-2 test topics determined the functions for the actual term weights. To emphasize once more, no document feedback was used for these manual steps. All query adjustments were performed without executing any query. Only after all queries were generated were the final results generated. The TREC-2 Results ConQuest scored very well in TREC-2. In particular, our recall percentages were quite high. Our average precision scores were not as good, but still competitive. ConQuest submitted two sets of results for TREC-2, CnQstland CnQst2. Both sets used the same coarse-grain algorithm which retrieved the best 5000 documents from the database. The difference between the two results was how these 5000 documents were sorted to derive the top 1000 documents which were used for the official results. The first set (CnQstl) used fine-grain as the only sorting algorithm. This algorithm primarily depends on local proximity information, although word statistics and query structure are also incorporated. The second set of results (CnQst2) was a weighted average of the fine-grain and coarse-grain statistics for each document. As it turned out, this combination of local (fine- grain) and global (coarse-grain) statistics provided significantly better statistics. The relatively modest addition of global information improved the results more than expected. Previous experience had always indicated that fine-grain information, especially the proximity test, was the strongest contributor to document relevancy. Some additional insights can be extracted from topic analyses presented at the TREC-2 conference. Specifically, the topics where ConQuest excelled over other systems were also those which tended to have fewer relevant documents in the database. This indicates that local proximity statistics (used by ConQuest) are more important for these queries, since most other systems in TREC-2 are heavily weighted towards global document statistics. In other words, ConQuest appears to perform better for queries where one needs to find the "needle in the haystack." Post TREC Analysis After TREC-2, we had the chance to clean up our initial tests, gather new statistics, and perform some additional analysis. The first step in this process was to prove the accuracy of the coarse-grain algorithm. Remember that initial tests attempted to improve the coarse-grain algorithm. But did the coarse-grain algorithm really need improvement? One indication that coarse-grain was accurate was provided by the CnQst2 run, which performed better than expected. To check out the coarse-grain rank, we constructed graphs which more clearly shows its performance. Since fine-grain can only work on the results of the coarse-grain algorithm, what is the 1055 in recall for coarse-grain? The following graph shows the cumulative recall percentage as documents are retrieved from coarse-grain rank. Every time a relevant document is retrieved, the recall percentage gradually inches up towards 100%. Note: these tests were run on just the Category B data. 100% 90% 80% 70% ~ 60% a) 2 50% ~ 40% 30% ~ 20% a) ~ 10% 0% 100% 95% - - - .90% - - 85% - 2 80% --~------------------------------ L 75% 70% ~ 65% m~ 60% ,~ 55% 50% Documents Retrieved by Coarse~Grnin Rank Figure 5 Cumulative Recall as Documents are Retrieved using Coarse-Grain Rank This figure is an average over all queries. The average strongly correlates with the results from query #110. This verifies the two discoveries identified above. Documents Retrieved from Coarse-Grain Rank Figure 4 Cumulative Recall Percentage for Query #110 Figure 4 shows two exciting discoveries. The first is that the coarse-grain performance achieves over 95% recall. This strongly contradicts our initial fears that coarse-grain was not retrieving enough relevant documents. The second discovery is that the high recall figures are achieved quickly. This implies that ConQuest can retrieve fewer documents (greatly improving speed) and still achieve high accuracy. To further establish these claims, we repeated the analysis on all queries in the TREC-2 topic set, then averaged the results together, as shown in the next graph: Some initial studies also more clearly show the difference between fine-grain and coarse-grain sorting of documents. The following figure shows both graphs superimposed: 90 50 70 60 w -w a,- a, ~ E 50 ~oWooO40 a, ~ 30 ~ 20 E ~ 10 0 (I- ________ Coarse Grain Sorting Fine Grain Sorting Documents Retrieved from Coarse~rnin Rank Figure 6 Coarse-Grain Sorting vs. Fine-Grain Sorting for TREC-2 Topic #135 In this diagram, we see that fine-grain sorting is in fact better than coarse-grain. In other queries, the results are more mixed. Clearly, the difference is not as great as was initially assumed. This suggests that the area where ConQuest can most improve is not in the coarse-grain ranking algorithm, but rather in improving the fine-grain algorithm, or providing a better combination of the two. Upon further study, we believe we now know why. When the fine-grain algorithm was developed, the programmers assumed an average query length of about 5 words. Studies of typical users indicate that their preferred query type is a 269 simple two or three word phrase. Therefore, a fine-grain rank tuned for short queries has provided the best and most accurate system for our commercial users. In TREC-2, however, the queries are often 40-50 words long. These analyses have shown us that our initial fine- grain algorithms are not as accurate for such long queries. Future Analysis Our tests indicate that more study of other fine-grain algorithms is where ConQuest can most likely improve its scores for TREC-3. New ways of looking at proximity and positional information in the document will be explored and compared against the existing coarse-grain ranking results. We still feel that our existing fine-grain algorithm is best for the typical commercial user, and we are looking for ways to fully test this hypothesis. Finally, we are now much more sensitive to the effects of query size on fine-grain algorithms and are looking more closely at ways to desensitize our fine-grain algorithms, or to adapt them easily to different query lengths. 270