Site Report for the
Text REtrieval Conference

        by
ConQuest Software Inc.


      Paul E. Nelson
   VP of Research & Development


     CS?NFQTWUEASRTETM

 9700 Patuxent Woods Drive, Suite 140, Columbia, Maryland 21046
       (410)-290-7 150


        287

Introduction
ConQuest software has a commercially avallable text search and retrieval system * called
"ConQuest" (for Concept Quest). ConQuest is primarily an advanced statistical based
search system, with processing enhancements drawn from the field of Natural Language
Processing (NLP).
ConQuest participated in Category A of TREC, and so produced results for 50 test queries
over the entire 2.3 Gigabyte database. In this category, we constructed queries and
submitted results for two methods: Method 1 where queries were automatically generated
from the TREC topics, and Method 2 where queries were manually constructed by the
software engineers at ConQuest
We were extremely pleased with our performance in TREC, as the Category A system with
the highest 11 point averages. This performance is not indicative of our full potential,
however, for the system is still relatively young. We are continuing to evaluate, test, and
tune our dictionaries, ranking algorithms, and search methods.
This paper describes our background, the system architecture used for TREC, some
features of ConQuest, and a discussion of the results. This paper is written for those with a
background in computers with some exposure to artificial intelligence and text retrieval.

Background
ConQuest has been working on text retrieval since 1988. From the beginning, we meant to
use Natural Language Processing (NLP) to better understand the text database and improve
retrieval accuracy. This was a natural approach, since both founders of ConQuest, Edwin
Addison & Paul Nelson, teach NLP at Johns Hopkins University.
During development, we concentrated on solving the two biggest problems of Natural
Language Processing: 1) Most NLP requires large, hand-crafted knowledge bases, and
2) traditional techniques are not robust in the face of text errors.
To solve the first problem, we relied on machine readable dictionaries and thesauri for all
knowledge data. Machine readable dictionaries provide ample information on syntax, word
variations, and inflected forms. Thesauri and similar sources provide semantic information
(word and meaning relationships) which were compiled into structured networks. Both
sources were judged to be useful for text retrieval. Combining the resources, however,
required significant engineering effort.
The second problem, that of robust processing, required work in two areas. First, NLP
development was directed towards statistical approaches and away from rule-based
approaches. Statistical approaches typically use heuristics or probabilities to provide
confidence factors that accumulate evidence. Unlike rules, which are typically passifail,
statistical approaches can handle unexpected or variable input without causing total system
failure.
But to fully solve the problem of robust processing, we had to have good, solid software
engineering. Most NLP systems are ad-hoc affairs, often thrown together at the last minute
and patched up. ConQuest has concentrated on producing concrete bullet-proof software,


* For additional inforrnation on ConQuest, please contact the author


                               288

even if this means delaying some advanced or exotic modules. Our philosophy is that
simple programs, well executed, will always out-perform complex tools poorly done.

System Architecture
ConQuest uses pre-built indexes to perform text database searches at fast speeds. In such a
system, all text to be searched must first be indexed. The indexes built in this process can
then be used by the text search engine to produce results. Both indexing and search use a
dictionary with a semantic network to perform various NLP tasks.


                                              Queries


                                               Resufts
               Figure 1 ConQuest System Architecture

There are other modules in the ConQuest system not shown in Figure 1. These include the
library manager, which is responsible for system parameters, database configuration,
resource allocation, and physical partitioning of the indexes. Also the dictionary editor,
which can be used to edit words, meanings, links, and definitions.

The Dictionary
ConQuest uses a dictionary augmented with a semantic network to perform indexing and
queries. The dictionary is a list of words where each word contains multiple meanings.
Each meaning contains syntactic information (part-of-speech, feature values), and a
dictionary defmition.
The semantic network contains nodes which correspond to meanings of words. These
nodes are linked to other related nodes. Relationships between nodes are extracted from
machine re~dable dictionaries. Some example relationship types include synonym,
antonym, child-of, parent-of, related-to, part-of, substance-of, contrasting, and similar-to.
ConQuest uses the dictionary for morphological analysis (see below) and idiom
processing. The semantic networks are used to expand the query to include related terms.
Since connections are made between meanings of words, both in the dictionary and the
semantic networks, processing is much more accurate compared to a simple thesaurus.


                            289

       aardvark
       aback
       abacus

                          Dictionary

                          1: (n) a slab which forms the
                          uppermost member of the capital
                          of a column

zwieback -
zygote
zymosis -


 Semantic Network

2: (n) an instrinnent for
performing calculations by
sliding counters along rods

                      counter:4         computation: 1


                                 abacu5:2~~\        __________
                                            computer:2
              ~calc;1atož3~/


      Figure 2 Example Structures for the Dictionary & Semantic Networks

Indexing
ConQuest performs several steps of normalization and information extraction during
indexing to help queries run faster. Words are normalized and enhanced before indexing
with tokenization, morphology, and idiom processing.


                                                     ConQuest
                                                     Dictionary


      Text            Parse
    Database        Document          Tokenize      Morphology      A


                          Find
               A        Idioms           Index       Indexes


                      Figure 3 The Indexing Process


                                    290

The modules used for indexing are as follows:
* Parse Document: Looks for codes in the text database to locate fields such as title,
 headline, authors, etc. These fields can be indexed, ignored, or stored in a special
 database for fast access. Parse document takes a command file which describes the
 structure of the text database to be indexed and can handle a wide variety of different text
 file formats.
* Tokenize: Divides a string of characters into words. This may include special
 processing for dates, phone numbers, floating point numbers, hyphens, etc.
* Morphology: An advanced form of stemming, attempts to remove suffixes and perform
 spelling changes to reduce words to simpler forms which are found in the dictionary.
 For example, one morphology rule will take "babies," strip the "ies," add "y", and
 produce "baby" which is found in the dictionary. Irregular forms of words are stored
 direcdy in the dictionary and are not subject to morphological analysis.
 Morphology is a much more accurate form of word reduction than stemming, because
 the dictionary can be used to validate the transformations. Morphology will not reduce
 proper nouns, and will produce much more accurate reductions, especially for words
 endingin "e."
* Find Idioms: Idioms are phrases which have a meaning beyond that of the individual
 words added together. For example, "kangaroo court" has nothing whatsoever to do
 with kangaroos. Also proper nouns, such as "United States," have a meaning beyond
 the sum of their component parts.
 This module finds idioms in the text and indexes the idiom as a single unit. This
 prevents idioms such as "Dow Jones Industrial Average" from getting confused with
 queries on "industrial history." Words inside of idioms can still be located individually,
 if desired.
* Index: The final step is to store the reduced words and collected idioms into the indexes.
 The index is an inverted positional word index, which is conceptually similar to the
 index at the back of a textbook.
The speed of indexing by ConQuest has been measured to be approximately 40 megabytes
per hour. This evaluation was done on a Sun IPc Sparc-Station (a 14 MIP computer), with
32 megabytes of RAM. The same speed has been achieved on a 486 IBM-PC, running at a
clock rate of 33 Mhz, with 16 megabytes of RAM.

Query
The query process is more complex than indexing, due to word expansion and ranking.
Generally speaking, ConQuest attempts to refme and enhance the user's query. The result
is then matched against the indexes to look for documents which contain similar patterns.
Queries are not "understood" in the traditional sense of natural language processing.
ConQuest makes no attempt to deeply understand the objects in the query, their interaction,
or the user's intent. Rather, ConQuest attempts to understand the meaning of each
individual word and the importance of the word. It then uses the set of meanings and their
related terms (retrieved from the semantic networks) as a statistical set which is matched
against document information stored in the indexes.


                            291

                                   rnconQuestI


                                   \&źctionaIY


                                                     Find       A


                      Query              Remove
            A       Enhancement         Stop Words      B


                     Expand             earch &
            B       Meanings                          Documents


                     Semantic
                    [Networks I        mindexes

                     Figure 4     The Query Process
The following is a description of the modules used for query:
* Tokenize, Morphology, Find Idioms: These modules are the same as for indexing.
* Query Enhancement: The user is given the opportunity to enhance the query for
  additional improvement in precision and recall. There are many options available here,
  but the two most important are to choose meanings and weight query terms.
  Choosing a meaning of a word will restrict the expansion of words to only related terms
  which are relevant to the chosen meanings. This reduces noise in the query. When
  running in automatic mode, ConQuest expands all meanings of all words.
  Weighting query terms identifies the importance of the various words in the query.
  These weights are used by the search engine when ranlling documents and computing
  document relevance factors.
* Remove Stop Words: Small function words-such as determiners, conjunctions,
  auxiliary verbs, and small adverbs-are removed from the query, just as they were
  during indexing. Removing these terms makes queries faster, and also reduce
  ambiguous noise in the query.
* Expand Meanings: Words in the query are expanded to include related terms.
* Search and Rank: ConQuest uses an integrated search and rank algorithm which
  considers the relevance rankings of documents throughout the search process. Since
  ranking and search are integrated, the search engine automatically produces the most
  relevant documents right away. This is different than past approaches, which typically
  retrieve all matching documents and then rank and sort them as a separate step.


                                   292

The speed of queries is based on many factors, including database size, amount of
expansion, size of dictionaries, etc. Many of these parameters can be modified to achieve
the appropriate trade-off between accuracy and speed. For TREC, a query over a 1
Gigabyte database with 15 terms took roughly 9 seconds to retrieve 500 documents on a
Sparc-IPC with 32 megabytes of RAM. Fquivalent speeds have been measured on 486
IBM-PC computers running at 33 Mhz with 16 megabytes of RAM.
Queries can be expanded to a very large number of terms, if desired. If the user wishes for
the greatest amount of recall, a 5 word query can be expanded to 200 or 300 related terms.

Ranking
Ran~~g and retrieval with ConQuest uses a variety of statistics and critenon, which are
flexible and can be modified to handle varying requirements. The following are some of the
factors used in ranking:
     Completeness: A good document should contain at least one term or related
     term for each word in the original query.
     Contextual Evidence: Words are supported by their related terms. If a
     document contains a word and its related terms, then the word is given a
     higher weight because it is surrounded by supporting evidence.
     Semantic Distance: The semantic network contains information on how
     closely two terms are related.
     Proximity: A document is considered to be more relevant if it contains
     matching terms which occur close together, preferably in the same
     paragraph or sentence.
     Quantity: The absolute quantity of hits in the document is also included, but
     is not as strong a discriminator of relevance as the other factors.
ConQuest is the first truly "concept-based" search system to operate over unrestricted
domains. If a document contains the word and some of the related terms, the word is more
likely to be used in the correct meaning, using the "contextual evidence" factor above. In
this way, ConQuest can determine word meanings at query time.

Other Features
ConQuest contains a number of other search features, used to handle specific situations and
search requirements. These were not used for the Text REtrieval Conference. Some of
these features are listed below:
* Wildcards: Are useful for locating misspellings in the queries or in the database. The
  user can specify a word with a wildcard, such as "compute*," and then choose which of
  the matching terms from the indexes should be included in the query. Query words
  derived from wildcards are not otherwise expanded.
* Boolean: The traditional boolean query mode is also available, and contains all of the
  basic operators. These include AND, OR, NOT, WITHIN, thesaurus expansion, and
  nested expressions.


                               293

    * Query By Example: It is possible to submit an entire document as a query, which is
      called "query by example." Essentially, the document is used as an example, and
      documents similar to it are retrieved. This function works even when the example
      document is very large.
    * Search Within: A query can be directed to search only within a small set of documents.
      The set of documents is often selected from a previous query. This function is also
      called "recursive" search or "recurrent" search. This function is especially useful when a
      statistical query searches over the results of an earlier boolean query, or visa-versa.
    * Numeric and Date Ranges: ConQuest can fmd documents which contain numbers or
      dates within a specified range. Numbers are subject to the standard proximity tests in the
      boolean and natural language queries, just like other words.
    * Fielded Searches: A search can be restricted to any particular field in a documenL For
      example, a users often wish to search only over the "authors" field, and not over the full
      body of the text.
    * Document Categories: Documents in ConQuest can be categorized as appropriate. Users
      can target searches to occur only over a single category, or over multiple categories.

    Evaluation Results
    ConQuest had the highest overall score in ThEC for Category A systems (the full 2.5
    Gigabyte database) using the 11 point averages. In comparing ConQuest to other systems,
    we found the following two graphs to be useful.

 100%

  80%

  60%

  40%

  20%

                                                                                                                                                                                                                                                                                                                          II

                                                                                                                                                                                                   ii

                                                                                                                                                                                                                                                                                                                                                            II                      Best

  no,
  `1l0


-20%

-40%

-60%

-80%

                        II
`I  I-  -II  -I I I I   III II III III I- II dl     dl dl  .  . - - - - - - -

Average

-100%                                                               Worst

              Figure 5 ConQuest vs All Systems for the 39 Non-Zero Queries
                       Category A, Manual Mode
     This first graph (figure 5) shows the results of ConQuest for the 39 non-zero queries (the
     remaining 11 queries had no relevant documents in the database). Ascore of 100%
     represents the best of all Category A systems, a score of - 100% represents the worst of all
     Category A systems. A score of 0% represents average performance.


                                  294

The results show that ConQuest was best on five of the 39 queries, worst on one query,
and well above average for most. EssentiallY, the area above average is much greater than
the area below.
The next chart (figure 6) compares ConQuest to the next two highest Category A manual
systems. This chart was originally produced in the TREC proceedings.

   1.0000

   0.9000
   0.8000
, 0.7000
Cu
-~ 0.6000
0
   0.5000
   0.4000
0
; 0.3000
e~ 0.2000

   0.1000
   0.0000                                                    0
           o              0    0    0    0    0    0    0    0
           o                        Lc~
           o    0    0    0    0    0    0    0    0    0

                              ~ConQuest - . -

                          Recall percentage

             Figure 6   precision at Recall for the Top 3 Scorers
                        Category A, Manual Mode

This chart shows ConQuest having be~r perf0rmance in the high-recall region (where a
large percentage of the relevant documents are retrieved). We suspect that the performance
in this region is enhanced by our aggressive expansion of terms coupled with the flexible
ranking algorithm.
Good performance in the high recall region is important to us, because studies have shown
that high recall is the most difficult problem for text retrieval systems. Many systems
achieve relatively high precisiofl by discarding documents, but very few achieve high
recall.


                              295

Conclusions I Future
The performance of ConQuest exceeded all our expectations. The system is still young, and
we fully expected that much more time would be required to rework the ranking algorithms
and evaluate the system. These positive results from ThEC indicate that we are well on our
way to superior retrieval performance.
Since TREC, ConQuest has been working on detailed analysis of all factors which drive
ranking and retrieval. This includes the dictionary, morphology, idiom processing, term
expansion, and the ranking engine itself.
We feel that our performance on ThEC is good start, but that it does not represent our full
potential. We look forward to IREC-2 to see how much further we can progress.


                              296