The QA System

                  James Driscoll, Jennifer Lautenschlager, Mimi zhao
                         Department of Computer Science
                           University of Central Florida
                           Orlando, Florida 32816 USA

               Abstract
 In the QA system, semantic information is combined with
keywords to measure similarity between natural language
queries and documents. Acombination of keyword relevance
and semantic relevance is achieved by treating keyword
weights and semantic weights alike using the vector pro-
cessing model as a basis for the combination. The approach
is based on (1) the database concept of semantic modeling
and (2) the linguistic concept of thematic roles. Semantic
information is stored in a lexicon built manually using
information found in Roget's Thesaurus.

Keywords: vector prossing model, semantic data model,
semantic lexicon, thematic roles, entity attributes.
1. Introduction
 The QA system is based on the semantic approach to text
search reported in [9). The QA system accepts natural
language queries against collections of documents. The
system uses keywords as document identifiers in order to sort
retrieved documents based on their similarity to a query. The
system also imposes a semantic data model upon the "surface
level" knowledge found in text (unstructured information
from a database point of view).

 The intent of the QA System has been to provide conve-
nient access to information contained in the numerous and
large public information documents maintained by Public
Affairs at NASA Kennedy Space Center (KSC). During a
launch at KSC, about a dozen NASA employees access these
printed documents to answer media questions. The planned
document storage for NASA KSC Public Affairs is around
300,000 pages (approximately 900 megabytes of disk stor-
age).
 Because of our environment, the performance of our
system is measured by a count of the number of documents
one must read in order to find an answer to a natural language
question. Consequently, the traditional precision and recall
measures for IR have not been used to measure the per-
formance of the QA System.

 We have had success using semantics to improve the
ranking of documents when searching for an "answer" to a
query in adocument collection of size less than 50 megabytes.
However, it is important to note that our success has been
demonstrated only in a real-world situation where queries are
the length of a sentence and documents are either a sentence
or, at most, a paragraph [8,9].

 Our reasons for participating in ~fl~EC have been to (1)
learn now our semantic approach fares when traditional IR
measures of performance are used, and (2) test our system on
larger collections of documents. In this paper, we describe
our system, the experiments we performed, our results, and
failure analysis.

2. Overview of the QA System Modified for ThEC

 The QA System has been restricted to an IBM compatible
PC platform running under the DOS 5.0 operating system and
without the use of any other licensed commercial software
such as a DOS extender. The DOS version of the QA System
is available at nominal cost from [3). About 2,000 hours of
programming have been used to develop the current software
which includes a pleasant user interface; just as many hours
have been used testing the basic keyword operation of the
system. In addition, approximately 1,000 hours have been
used performing experiments invoWing the semantic aspect
of the QA System. The SQIWI)S relational data base system
has been used to carry out some of these semantic experi-
ments.

 The QA System is implemented in C and uses B+ tree
structures for the inverted files. We felt the speed of the
system and its storage overhead was not efficient enough to
appear reasonable for ~I~EC, so we designed a separate
system without a pleasant user interface which uses a hashing
scheme to establish codes for strings. This was done to cut
down on storage space and eliminate the use of B+ trees.
Approximately 400 hours of programming and debugging
effort was used to modify the system for the TREC experi-
ments. We kept the DOS environment.

This work has been supported in part by NASA KSC Cooperative Agreement NCC 104)03 Project 2, Florida High Technol-
ogy and Industry Council Grants 494O~ll-28-72l and 4940-11-728, and DARPA Grant 4940-11-28-808.


                                   199

   For the TREC experiments, weused nine IBMPS/2 Model
95 computers. These were 50MHz 486 computers, each with
8 megabytes of RAM and one 400 megabyte hard drive. Two
of the machines had 16 megabytes of RAM and another one
had two 400 megabyte hard drives. A 33 Mf~ 486 PC was
used to distribute text to the nine IBM machines, and then
collect information for merging and redistribution.

   Our plan was to put each of the nine Vol.1 and Vol.2
document collections on a separate machine, determine the
document frequency for each term tj on each machine, and
then merge the results and determine the inverse document
frequency for each term in the entire collection. Next, we
would index and query each document collection separately,
and finally merge the retrieval results.

   The vector processing model is the basis for our approach.
Terms used as document identifiers are keywords modified
by various techniques such as stop lists, stemming, synonyms,
and query reformulation.

   The calculation of the weighting factor (w) for a term in a
document is a combination of term frequency (ti), document
frequency (dJ), and inverse document frequency (id]). The
basic definitions are as follows:

tj~, = number of occurrences of term tj in document D~
df1 - number of documents in a collection which contain tj
idf~ - Iog(NIdf,), where N = total number of documents
W.. =~f.*jdf
 Łj    `J,5J
   When the system is used to query acollection ofdocuments
with I terms, the system computes a vector Q equal to
(Wql,Wq2,..., Wqg) representing the weights for each term in the
query. The retrieval of a document with vector D~ equal to
         4') representing the weights of each term in the
document is based on the value of a similarity measure
between the query vector and the document vector. For
the NIST experiments, we used the Jaccard similarity
coefficient [7] to retrieve documents for the September
deadline.

Jaccard Coefficient
                       X Wqjdjj
                      j-1
       sim~,D~) = I    I      *
               X(41)2+ X(w~1)2- X Wq~j
               j-1    j-1    1-1


3. SemantIc App~ach

   Mthough the basic IR approach has shown some success
in regard to natural language queries, it ignores some valuable
information. For instance, consider the following query:

       How long does the payload crew go through
             training before a launch?

The typical IR system dismisses the following words in the
query as useless or empty: "how", "does", "the", "through",


                                         200

"before", and "a". Some of these words contain valuable
semantic information. The following list indicates some of
the semantic information triggered by a few of these words:

how long    Duration, Time
through     IŁ~tion/Space, Motion with Reference
            to Direction, Time
before      1~tion~pace, Time

The database concept of semantic modeling and the linguistic
concept of thematic roles can help in this regard.

3.1 Semantic Modeling

 Semantic modeling was an object of considerable database
research in the late 1970's and early 1980's. Abriefoverview
can be found in [2]. Essentially, the semantic modeling
approach identified concepts useful in talking informally
about the real world. These concepts included the two notions
of entities (objects in the real world) and relationships among
entities (actions in the real world). Both entities and rela-
tionships have properties.

 The properties of entities are often called attributes. There
are basic or surface level attributes for entities in the real
world. Examples of surface level entity attributes are General
Dimensions, Color, and Position.  These properties are
prevalent in natural language. For example, consider the
phrase "large, black book on the table" which indicates the
General Dimensions, Color, and Position of the book.

 In linguistic research, the basic properties of relationships
are discussed and called thematic roles. Thematic roles are
also referred to in the literature as partidpant roles, semantic
roles and case roles. Examples of thematic roles are Bene-
ficiary and Time. Thematic roles are prevalent in natural
language; they reveal how sentence phrases and clauses are
semantically related to the verbs in a sentence. For example,
consider the phrase "purchase for Mary on Wednesday"
which indicates who benefited from a purchase (13eneficiary)
and when a purchase occurred (Fime).

 The goal of our approach is to detect thematic information
along with attribute information contained in natural Ian-
guage queries and documents. When the information is
present, our system uses it to help find the most relevant
document. In order to use this additional information, the
basic underlying concept of text relevance as presented earlier
needs to be modified. The major modifications include the
addition of a lexicon with thematic and attribute information,
and a modified computation of the similarity coefficient. We
now discuss our semantic lexicon.

3.2 The Semantic lexicon

 Our system uses a thesaurus as a source of semantic
categories (thematic and attribute information). For example,
Roget's Thesaurus contains a hierarchy of word classes to
relate word senses [6]. For our research, we have selected
several classes from this hierarchy to be used for semantic
categories. We have defined thirty-six semantic categories
as shown in Figure 1.

     Thematic Rote Categories                       Attribute Categories
     A ~ m~nt                                       C'Lolor
                                                    Pvt~rn~1 ~nd Internal Dimensions
     Amoun~
     ~pnefi~i ~r',                                  Form
     Cause

                                                r~nd~r

                                                  (`~n~r~1 flirnpn~ion~
     (`~ndition
     rr'nin~iriqnn                                T ~ flinipn~onq
                                                  ~A~;~n r~nininp~ with Force
     (`onvevance
     fle~~e
     n~~t; ~~ti~n

                                             RA~;r'~ with ~ to Direction
                                             1VLLJIi~~     1~~i~L~11  - --  -

     flur~tion                                   Order
     (;ml                                        Phv~ica1 Prorerties
     I n~rument                                  Position
     T ncation/Smce                              State
     \"At'ner                                    Temnerature
     Me~n~~                                      Use
     Pii~e                                       Variation
     RRn~e
     ~~Q,1lt

     Time

                         Figure 1. Thirty-Six Semantic Categories.

 Prior to ThEC, there were 3,000 entries in the lexicon
established by manual examination of roughly 6,000 of the
most frequent words occurring in NASA KSC text. For
ThEC, we made 1,000 new entries by examination of 1,700
frequent words occurring in the training text and the 52
training topics. Since the 1911 edition of Roget's Thesaurus
has become public domain, we also created software which
automatically extracted approximately 20,000 lexicon
entries. However, we did not have enough time to explore
the use of these entries.

 In order to explain the assignment of semantic categories
to a given term using Roget's Thesaurus, consider the brief
index quotation for the term "vapor":
    vapor
       n. fog             404.2
         fume             401
         illusion         519.1
                          4.3
         ~sPe~~i          328.10
         thing imagined   535~
       v. be bombastic    601.6
         bluster          911.3
         boast            910.6
         exhale           310.23
         talk nonsense    547.5

The eleven different meanings of the term "vapor" are given
in terms of a numerical category. We have developed a
mapping of the numerical categories in Roget's Thesaurus to
the thematic role and attribute categories given in Figure 1.
In this example, "fog" and "fume" correspond to the attribute
State; "steam" maps to the attribute Temperature; and "ex-


                                          201

hale" is a trigger for the attribute Motion with Reference to
Direction. The remaining seven meanings associated with
"vapor" do not trigger any thematic roles or attributes. Since
there are eleven meanings associated with "vapor," we
indicate in the lexicon a probability of 1/11 each time a
category is triggered. Hence, a probability of 2/11 is assigned
to State, 1/11 to Temperature, and 1/11 to Motion with
Reference to Direction. This technique of calculating prob-
abilities is being used as a simple alternative to a corpus
analysis.  It should be pointed out that we are still
experimenting with other ways of calculating probabilities.

 Figure 2 shows lexicon entries for prepositions as asample
of the lexicon used in our experiments. These entries are
somewhat misleading. Most p repositions trigger too many
semantic categories to be of real use.  The prepositions
"during" and "until" are examples of useful prepositions.

3~ Extended Computation of the Similarity Measure
 The probabilistic details of a semantic lexicon and the
computation of semantic weights can be found in [9]. A
detailed explanation of the manner in which we combine
semantic weights and keyword weights can be found in [8].
 Essentially we treat semantic categories like indexing
terms, and the probabilities introduced by a semantic lexicon
mean that the frequency of a category in a document becomes
an ~ frequency and the presence of a category in a
document becomes a probability for the category being
present. This means that the document frequency for a
category becomes an expected document frequency, and this
enables an inverse document frequency to be calculated for
a category.

after:      TIme(3), None(3), External and Internal Dimensions(2), Motion with Respect to Direction, Order, Li~ation~pace
above:      Amount(2), Ii)Catio~pace, None(2), linear Dimensions(2), Order, External and Internal Dimensions, Position
as:         Condition, Comparison, Manner1 None(2), Amount
at:         Condition, Ii)cation~S pace, Manner, Time, Position, External and Internal Dimensions, Duration
ato p :     ~cation~Space, Position
befbre:     [Ł~tion~Space, Time(4), External and Internal Dimensions, Motion with Respect to Direction, Order, None(2), Position
below:      Motion with Respect to Direction, None(2), Amount, State, linear Dimensions, Iocatio~pace, Position
between:    Comparison, Duration, External and Internal Dimensions, Icication/Space, Position, Amount
by:         Amount, Conveyance, Ic:cation/Space, Time, Position, External and Internal Dimensions, Means, Order, Motion with Respect
            to Direction, Duration, None
during:     Duration, Time
except:     Condition, None(2), Order, Amount(2)
for:        Duration, Goal, Purpose, Variation, Destination, Beneficiary, Amount, Range
from:       Cause, Source, Time, Motion with Respect to Direction(2), Amount, location/Space, Position, Instrument
in:         Instrument, location/Space, Purpose, ~me, Motion with Respect to Direction(S), External and Internal Dimensions(2), Position,
            Condition, Goal, Means
into:       Condition, location/Space, Time, Motion with Respect to Direction, External and Internal Dimensions, Position, None
like:       Comparison(3), Amount(2), Condition, Manner, None(7)
of:         Cause, location/Space, Source, None, Time, Duration, Position
on:         Condition, Conveyance, location/Space, Source, Time, linear Dimensions(2), Motion with Respect to Direction, General
            Dimensions, Position, Means, External and Internal Dimensions, Purpose
over:       Duration, location/Space, Time(2). Order, linear Dimensions(3), Amount(4), General Dimensions, Motion with Respect to
            Direction, External and Internal Dimensions, Position, Degree, Condition
per:        Means, Order(2)
thrzugh:    None, Order, Goal, linear Dimensions, Means, Time, location/Space, Position, Motion with Respect to Direction
to:         Accompaniment, Beneficiary, Condition, De~ee, Ii)cation/Space, Purpose, Result, Time, General Dimensions, Position,
            Motion with Respect to Direction(2), Companson
under:      Condition, location/Space, Position, Order, Degree, None, Amount, linear Dimensions(3)
until:      Condition, Time
upon:       Condition, Conveyance, Source, Time, None, General Dimensions, linear Dimensions, Means, External and Internal
            Dimensions, Motion with Respect to Direction, Purpose
with:       Accompaniment(2), Comparison, Manner, Result, Amount(3), Means, None, Order, location/Space, Position, Time
within:     Range, External and Internal Dimensions(2), Time, Degree, Position, location/Space
without:    External and Internal Dimensions, Order, Amount, None(2)

                          Figure 2. Prepositions and Their Semantic Categories
                                 ~eprinted by Permission of [3J).

 For ThEC semantic experiments performed after the
September deadline (refer to Section 5.2), we treated
semantic categories like keywords and used the following
normalized similarity coefficient:
                   a +s
                   1X~1 Wqj di~
          sim~,D~) -
                    11+3


wheres -36 is the number of semantic categories. It should
be pointed out that we are still performing experiments to
determine a proper blend of keyword and semantic infor-
mation.
4. System Details
 The following is an overview of the system we used to
perform the ThEC experiments. As pointed out earlier, many
hours of programming and debugging effort were used to
create a system for the ThEC experiments.
4.1 The Scanning Pt~~cess
 Under the QA scanning procedure, scanning a ciccument
for indexing terms is a three-step process:
 A. A token is scanned.
 B. The token is analy:eed. It is compared to a list of 166
stopwords, and these words (along with any numbers) are
later discarded. Dates are transformed into a generic format,


                                      202

to allow for the matching of dates in a variety of different
formats. Non-valid hyphenated words are separated into
mulliple tokens. Words that can be abbreviated (as deter-
mined by a list of 122 abbreviations) are replaced by their
abbreviated form. Acronyms of multiple words are detected,
and replaced by their respective acronyms.
 C. The remainder of the words are stemmed according
to a modified version of the 3. B. lovins stemming
algorithm [5). In some cases, prefixes are also removed.
 The scanning procedure for the ThEC experiments is a
modification of the QA scanning procedure, in which only
the words found in the text fields of the documents are
tokenized. In order to speed up text processing, the amount
of analyzing in step B is reduced. Dates are no longer
transformed, and abbreviations and acronyms are not created.
The stemming algorithm and removal of prefixes in step C
remains unchanged.
 For the ThEC scanning procedure, an additional step is
added. During this extra step, a hashing algorithm is used to
assign each indexing term a unique 32-bit integer value,
which we call a stem code.
4.2 Data Files
 Aftera document is processed by the ThEC version of the
QA System, four primary data files are created. These are
the document weight file, the inverted index file, the inverted
data file, and the document name file.

 The document weight file is a binary file containing a list
of floating point numbers.  These numbers are ordered
sequentially by document number, and represent the sum-
marion of the weight in the query squared for a particular
document's keywords.
 The inverted index file is ordered by stem code. Each code
has two file pointers, one pointing to the first block of data
in the data file, and the second pointing to the last block of
data. The inverted data file then consists of blocks of data,
containing pairs of document numbers that the code is found
in, and the code's frequency within that document. The
blocks are linked together to form a list.
 The document name file consists of a list of pointers into
the original text file.
 For Vol.1, both the document weight file and the inverted
index file were two megabytes. The inverted data file was
approximately 385 megabytes, and the document name file
was two megabytes.
4~ Basic Pi~)cedure
 For the TREC experiments, we did the following:
 Step 1: This step involves matching every legitimate stem
in the document collection with a unique integer value. This
is done with a linear hashing function. A table containing
this mapping, along with the number of documents each code
is found in, is temporarily saved for use in Step 2.
 Step 2: This step creates the four data files described
above. The entire database is scanned, with the four files
being created on the fly. Once this is accomplished, the table
from Step 1 is no longer necessary, and is discarded.
 Step 3: Relevant documents for each query are selected
using the Jaccard similarity coefficient. The top 200 docu-
ments for each query are then determined.
 The above three steps were followed to create the results
for the September deadline. In the next section we present
our "official" experiments and results, and some "unofficial"
experiments and results.
5. Experiments and Results
 Our experiments were intended to be Category A
experiments with two results submitted for each ranking task.
One ranking result would be for just keywords, the other
ranking result would be for keywords combined with
semantics. All query construction was automatic, and the
treatment of ad-hoc routing queries was identical.
 We also performed experiments concerning the expected
behavior of string hashing functions and the use of part~f-
speech tagging to improve retrieval performance. These
experiments are not reported here.
 As an example of our automatically built ad-hoc and
routing queries, consider Topic 004 reproduced in Figure 3.
Figure 4 indicates the keyword and semantic information
generated by the QA System for this topic. The first part of
Figure 4 indicates the stems along with their frequencies
found in the query. The second part of Figure 4 indicates the
semantic categories also found in the query along with their
expected frequencies and probability present.
 It is important to note that the topic represented in Figure
3 and Figure 4 has generated many semantic categories and
the probability present for most of them is close to, or at,


                                              203

100%. This is mainly due to the length of the text involved.
We discovered that, for the TREC document collection, each
document generated many semantic categories with high
probability present. Because we treated semantic categories
like keywords, this caused semantic weights to be essentially
useless.
 For the ThEc September deadline, we were only able to
submit a routing experiment using a keywording approach.
The results of this experiment, computed with the aid of Chris
Buckley's SMART evaluation program [1), are shown in
Figure 5. The results are not good. In Section 6, we discuss
what impeded our experiments.
 Further "unofficial" experiments were designed to test the
use of semantics. The main goal of our experiments was to
demonstrate that our original routing results could be
improved through the use of semantic analysis. In order to
do this, we made two modifications to our approach. The
first change involved dividing the original TREC documents
into paragraphs. The second change involved a semantic
analysis when calculating the list of relevant documents.
 Our experiments involved the use of only six routing
queries (for topics 001,002,003,007,017, and 022). These
topics were selected because our original results for them
were poor. Through the use of semantic analysis, we hoped
to significantly improve our results. Figure 6 shows the
precision-recall statistics for the six "poor" queries using the
retrieval results which created the statistics in Figure 5.
 When analyzing our results, we computed all precision-
recall tables through the use of Chris Buckley's SMART
evaluation program [1). The relevancy lists used were those
produced before November 1 (the original qrels for the
routing queries). We did not use the modified results that
were distributed later for Query 017. This should not affect
our results, though, because our experiments were aimed only
at improving our precision-recall averages, and the relevancy
results used were consistent from one experiment to the other.
5.1 Re-~nk'ng of Documents
 In an effort to demonstrate that semantics could affect
retrieval in the TREC environment, we used the original QA
System with a semantic lexicon containing TREC words as
described in Section 3.2. We created a separate database for
each query we considered, for a total of six databases. Each
database contained only the documents that we originally
judged in the top 250 for each query. Because of this, when
we computed new relevancy lists we were simply rearranging
the order of the same 250 documents, IIQL bringing in new
documents.
 Figure 7 reveals the precision-recall statistics when orig-
inally retrieved documents fora query are used as a document
collection and re-ranked by imposing the query again. There
is a 25.8% increase when comparing the 11-pt average here
to the 11-pt average of the originally retrieved text (Figure
6).
 To determine the ranking of a particular document with
paragraph divisions, we defined the similarity coefficient of
a document to be equalto the highest coefficient associated
with one of its corresponding paragraphs. The paragraph
divisions were automatically constructed from the original
text. The precision-recall statistics for paragraphs being used
as documents are shown in Figure 8. There is an 18.8%

<top>                                        Stems                 Stems

<head> Tipster Topic Description
<num> Number: 004
<dom> Domain: International Finance
<top> Topic: Debt Rescheduling
<dese> Description: Document wrn discuss a current debt
rescheduling agreement between a developing country and
one or more of its creditor(s).
<narr> Narrative:
A relevant document wrn discuss a current debt rescheduling
agreement reached, proposed, or being negotiated between a
debtor developing country and one or more of its creditors,
commercial and/or official. It wrn identify the debtor country
and the creditor(s), the repayment time period requested or
granted, the monetary amount requested or covered by the
accord, and the interest rate, proposed or set.
<con> Concept(s):
1. rescheduling agreement, accord, settlement, pact
2. bank debt, commercial debt, foreign debt, trade debt,
  medium-term debt, long-term debt
3. negotiations, debt talks
4. creditor banks, creditor countries/governments, Paris Club
5. debtor countries, developing countries
6. debt package
7. debt repayments
8. restructuring, rescheduling existing loans
9. lower interest-rate margin, easier terms, more lenient terms
<fac> Factor(s):
<nat> Nationality: Developing country
<time> Time: Current
</fac>
<def> Definition(s):
Debt Rescheduling - agreement between creditors and debtor
to provide debt relief by alterihg the original payment terms
of an existing debt.  This is most often accomplished by
lengthening the original schedule for principal and interest
payments, and deferring interest payments.  Done most
publicly by developing countries and their bankers, but often
less publicly by other willing creditors and debtors, e.g.,
governments, banks and companies. Much in vogue in the
early 19805, the road to rescheduling for countries in crisis
runs as follows: when a country borrows so much that its
lenders grow nervous, the banks start lending for shorter and
shorter maturities. Eventually the country, though still paying
interest on its debt, is unable to make payments on the
principal. The country is then forced to requestarescheduling,
which means that it is able to escape its immediate repayment
commitments by converting short-term loans into longer-term
ones. A country wishing to reschedule its official debt tal
</top>

                                           tal             1         wish               1
                                           on              1         longer             1
                                           short           1         convers            1
                                           commitm         1         immedi             1
                                           escap           1         abl                1
                                           mean            1         forc               1
                                           un              1         pay                1
                                           though          1         eventu             1
                                           matur           1         shorte             2
                                           lend            1         start              1
                                           nerv            1         row                1
                                           lender          1         orrow              1
                                           follow          1         run                1
                                           cris            1         road               1
                                           ear             1         vogu               1
                                           compan          1         e.g.               1
                                           wil             1         less               1
                                           banker          1         public             2
                                           defer           1         princip            2
                                           length          1         accompl            1
                                           orisin          2         alter              1
                                           relief          1         prov               1
                                           leni            1         easxxx             1
                                           margin          1         lower              1
                                           loan            2         exist              2
                                           structur        1         pack               1
                                           club            1         par                1
                                                           2         talk               1
                                           ~ernm           1         term               6
                                           m~4erm          1         trad               1
                                           foreign         1         bank               4
                                           pact            1         settl              1
                                           rat             2         interest           5
                                           accord          2         cover              1
                                           amount          1         monet              1
                                           grant           1         request            3
                                           period          1         paf~m              7
                                           identif         1                            2
                                           commerc         2         negoti             2
                                           propos          2         reach              1
                                           relev           1         credit             7
                                           countr        13          develop            5
                                                           4         cur                3
                                           discus          2         docum              2
                                           schedl        10          debt              22
                                           fin             1         intern             1

                                                                               Expected Probability
                                                                               Frequency Łr~~
                                           Linear Dimensions                   0.300000  0.300000
                                           Motion with Reference to Direction  2.000000  1.000000
                                           Accom~animent                       0.333333  0.333333
                                           Condition                           2.000000  0.937500
                                           External and Internal Dimensions    0.333333  0.333333
                                           Time                                5.744589  1.000000
                                           Duration                            1.291667  0.819444
                                           Purpose                             2.000000  1.000000
                                           Color                               2.333333  0.941472
                                           Position                            6.071428  0.999654
                                           Location/Space                      6.071428  0.999654
                                           Variation                           5.933333  1.0000
                                           Amount                              53.412807 1.0000
                                           Order                               6.833333  1.000000

             Figure3. Topic004.                 Figure 4. Automatically Generated Query for Topic 004.


                                            204

          Top ranked evaluation
  Rus number:               UCFQA1  (category B)
  Num~queries:                   19
  Total number of documents over all queries
        Retrieved:             3800
        Relevant:              2056
        Rel_ret:                517
  Recall - Precision Averages:
        at 0.00               0.5146
        atOlO                 0.2645
        at 0.20               0.1735
        at0.30                0.1006
        at 0.40               0.0693
        at 0.50               0.0107
        at 0.60               0.0069
        at 0.70               0.0000
        at 0.80               0.0000
        at 0.90               0.0000
        at 1.00               0.0000
  Average precision for all points
        11-pt Avg:            0.1037
  Average precision for 3 intermediate points
  (0.20, 0.50, 0.80)
        3-pt Avg:             0.0614
  Recall:
        at   Sdoes:           0.0162
        at lSdocs:            0.0362
        at 30does:            0.0692
        at 100 does:          0.2156
        at 200 does:          0.2899
  Precision:
        at   Sdocs:           02737
        at 15 does:           02211
        at 30 does:           02088
        at 100 does:          0.1932
        at 200 does:          0.1361
Figure 5. Results for Experiment Completed by
        September Deadline.

          Top ranked evaluation
  Run number:                     1
  Num~queries:                    6
  Total number of documents over all queries
        Retrieved:             1200
        Relevant:               863
        Rel ret:                147
  Recall - Precision Averages:
        at 0.00               0.3486
        at 0.10               0.1676
        at 020                0.0749
        at 0~0                0.0279
        at 0.40               0.0000
        at 0.50               0.0000
        at 0.60               0.0000
        at 0.70               0.0000
        at 0.80               0.0000
        at 0.90               0.0000
        at 1.00               0.0000
  Average precision for all points
        11-pt Avg:            0.0563
  Average precision for 3 intermediate points
  (0.20, 0.50, 0.80)
        3-pt Avg:             0.0250
  Recall:
        Exact:                0.1997
        at   Sdoes:           0.0033
        at lOdoes:            0.0062
        at 30does:            0.0382
        atl00does:            0.1212
        at 200 does:          0.1997
  Precision:
        Exact:                0.1225
        at   Sdocs:           0.1000
        at lOdoes:            0.0833
        at 30does:            0.1556
        atlOOdoes:            0.1550
        at 200 does:          0.1225
Figure 6. Results for Six Poorly Performing Queries using
        Experiment Completed by September Deadline.


                                                    205

         Top ranked evaluation
  Run number:                     1
  Num~queries:                    6
  Total number of documents over all queries
       Retrieved:              1200
       Relevant:                863
       Rd ret:                  152
  Recall - Precision Averages:
       at 0.00               0.4585
       atOlO                 0.2113
       at 0.20               0.0775
       at 0.30               0.0315
       at OAO                0.0000
       at 0.50               0.0000
       at 0.60               0.0000
       at 0.70               0.0000
       at 0.80               0.0000
       at 0.90               0.0000
       at 1.00               0.0000
  Average precision for all points
       11-pt Avg:            0.0708
  Average precision for 3 intermediate points
  (0.20, 0.50, 0.80)
       3-pt Avg:             0.0258
  Recall:
       Exact:                02109
       at    S does:         0.0072
       at 10 does:           0.0128
       at 30 does:           0.0469
       atlOOdoes:            0.1374
       at 200 does:          02109
  Precision:
       Exact:                0.1267
       at    Sdoes:          02000
       at 10 does:           0.1833
       at 30does:            0.1944
       at 100 does:          0.1683
       at 200 does:          0.1267
Figure 7. Results of Keywording with Document Divisions.
         Top ranked evaluation
  Run number:                     1
  Num~queries:                    6
  Total number of documents over all queries
       Retrieved:              1200
       Relevant:                863
       Rel~ret:                 157
  Recall - Precision Averages:
       at 0.00               0.40%
       atOlO                 0.2089
       at 020                0.0832
       at 0.30               0.0340
       at 0.40               0.0000
       at 0.50               0.0000
       at 0.60               0.0000
       at 0.70               0.0000
       at 0.80               0.0000
       at 0.90               0.0000
       at 1.00               0.0000
  Average precision for all points
       11-pt Avg:            0.0669
  Average precision for 3 intermediate points
  (0.20, 0.50, 0.80)
       3-pt Avg:             0.0277
  Recall:
       Exact:                0.2167
       at    Sdoes:          0.0121
       at lOdoes:            0.0167
       at 30does:            0.0462
       atl00does:            0.1392
       at 200 does:          0.2167
  Precision:
       Exact:                0.1308
       at    Sdoes:          0.2667
       at lOdoes:            0.1833
       at 30does:            0.1833
       at 100 does:          0.1650
       at 200 does:          0.1308
Figure & Results of Keywording with Paragraph Divisions.

increase when comparing the 11-pt average here to the 11-pt
average of the originally retrieved text ~igure 0).
  The re-ranking of documents using keywords was done in
order to establish a baseline for the semantic experiments
reported in the next subsection. The increase in retrieval
performance because of the re-ranking was a surprise.
According to Sparek Jones' criteria, the increases of 25.8%
and 18.8% would eachbe categoriied as "sigmIlcant" ~reater
than 10.0%) [4].

5.2 Semantic Expenments
  Essentially, what we are trying to show in this section is
that semantics can be useful if documents do not get larger
than a paragraph.
  Figure 9 displays results when semantic similarity is
combined with keyword similarity for the case that originally
retrieved ~fl~EC documents are used. These results can be
compared to those in Figure 7 (the same documents with a
strictly keywording approach). Comparing the 11-pt average
for the two methods shows a 0.7% ~when going from
the strictly keywording results to the combined semantic with
keywording results. According to Sparek Jones' criteria [4],
this is not a noticeable decrease. But, certainly, semantics
did not help.


            Top ranked evaluation
     Run number:                   1
     Num~queries:                  6
     Total number of documents over all queries
          Retrieved:             1200
          Relevant:               863
          Rel~ret:                155
     Recall - Precision Averages:
          at 0.00              0A370
          atOlO                0.2250
          at 0.20              0.0789
          at 0.30              0.0329
          at 0.40              0.0000
          at 0.50              0.0000
          at 0.60              0.0000
          at 0.70              0.0000
          at 0.80              0.0000
          at 0.90              0.0000
          at 1.00              0.0000
     Average precision for all points
          11-pt Avg:           0.0703
     Av
     (0~20era~(;~~e()ct8sI0o)n for 3 intermediate points
          3-pt Avg:            0.0263
     Recall:

                                                            The explanation for this is that when the size ofa document
                                                          is large, a greater number of semantic categories are triggered
                                                          in the document.    Also, the probability present for each
                                                          category in a document is often very close to 100%. Con-
                                                          sequently, almost every semantic category becomes present
                                                          in every document causing the semantic category weights to
                                                          become very low and useless.
                                                            To remedy this problem, we used the database that was
                                                          constructed with paragraph divisions, and computed rele-
                                                          vancy lists using the combined semantic and keywording
                                                          approach explained in Section 3.3. The results are shown in
                                                          Figure 10. The statistics there can be compared to those in
                                                          Figure 8 (the same documents with a strictly keywording
                                                          approach). When going from a keywording approach to a
                                                          combined semantic and keywording approach, the 11-pt
                                                          average increased by 7.9%. According to Sparck Jones'
                                                          criteria, this change would be classified as "noticeable"
                                                          (greater than 5.0%) [4].
                                                            The two semantic experiments reported here demonstrate
                                                          the main thing that we have learned. Our semantic approach
                                                          to text retrieval is only useful when documents are no larger
                                                          than a paragraph.


                                                                      Top ranked evaluation
                                                               Run number:                    1
                                                               Num~queries:                   6
                                                               Total number of documents over all queries
                                                                    Retrieved:              1176
                                                                    Relevant:                863
                                                                    Rel ret:                 158
                                                               Recall - Precision Averages:
                                                                    at 0.00               0.4556
                                                                    atOlO                 0.2231
                                                                    at 0.20               0.0845
                                                                    at 0.30               0.0294
                                                                    at 0.40               0.0000
                                                                    at 0.50               0.0000
                                                                    at 0.60               0.0000
                                                                    at 0.70               0.0000
                                                                    at 0.80               0.0000
                                                                    at 0.90               0.0000
                                                                    at 1.00               0.0000
                                                               Average precision for all points
                                                                    11-pt Avg:            0.0720
                                                               ~ for 3 intermediate points

                                                                    3-pt Avg:             0.0282
                                                               Recall:

          Exact:                0.2133                                  Exact:               0.2159
          at     Sdocs:         0.0072                                  at   Sdocs:          0.00%
          at    lOdocs:         0.0128                                  at  10 does:         0.0171
          at 30docs:            0.0508                                  at 30does:           0.0513
          atl00docs:            0.1378                                  at 100 does:         0.1456
          at 200 does:          0.2133                                  at200 does:          0.2159
     Precision:                                                    Precision:
          Exact:                0.1292                                  Exact:               0.1332
          at     Sdocs:         0.2000                                  at   Sdoes:          0.2333
          at    lOdoes:         0.1833                                  at  lOdoes:          0.2000
          at 30docs:            0.2000                                  at 30does:           02167
          atl00does:            0.1700                                  atl00docs:           0.1767
          at 200 does:          0.1292                                  at 200 does:         0.1317

Figure 9. Results of Semantics with Document Divisions.


                                     206

Figure 10. Results of Semantics with F~ragraph Divisions.

 It should be noted that the semantic experiments reported
here were crude. Our lexicon did not have enough TREC
words, and we used a blend of keyword and semantic weights
that was not the best. So we expect that better semantic results
for paragraphs can be achieved (refer to Section 6).
6. Failure Analysis
 In general our participation in the TREC experiments was
impeded by the following:
 PC DOS Platform. This platform has a serious memory
addressing restriction which results in memory page swap-
ping and this seriously affects the speed of processing,
especially during creation of inverted files and index
structures. We can solve this problem by moving to an OS/2
or UNIX platform.
 Extra Semantic Processing Time. Our semantic proba-
bilistic and statistical calculations more than double the
processing time for indexing and statistical ranking of
retrieved documents. Again, we can solve this problem by
moving to an OS/2 or UNIX platform.
 Time to Build Semantic lexicon. We were only able to
incorporate 1000 frequently occurring words in the training
text within our semantic lexicon. We did not have enough
time to process the test text for the ad-hoc queries. This
problem can be solved by having archival data distributed
earlier. We suspect that by having more TREC words in our
semantic lexicon, better results could have been achieved in
Section 5.2 when paragraphs are used as the basis for retrievaL
 Unknown Blend for Semantic and Keyword Weights.
There are three main aspects to our blend of semantic and
keyword weights within the vector processing model:
  (i) The Proper Probabilities to Use for the Semantics
       Triggered by a Word. For example, we let the word
       "vapor" trigger State with 18% probability, Tem-
       perature with 9% probability, and Motion with
       Reference to Direction with 9% probability. We
       have several techniques for determining probabili-
       ties such as these.
  (ii) The Scaling of Keyword Weights and Semantic
       Weights. For example, in a Question/Answer
       environment where queries are the length of a
       sentence and documents are either a sentence or at
       most a paragraph, we have been successful by
       forcing semantic similarity to be approximately 1/3
       of keyword similarity when the two are combined
       in processing small document collections (1ess than
       1000 documents). There was no scaling for the
       experiments reported in Section 5.2; we suspect
       better results could have been achieved.
 (iii) Independent Semantic Weights and Keyword
       Weights. A valid criticism of our research has been
       that the semantic contribution from a word in a
       document should be kept independent of the word's
       own similarity contribution if the word is a keyword
       in common with the query.
The overall problem of proper blend can be solved by
spending more time using TREC test documents, test topics,
and good test relevance judgments to run many retrieval
experiments to establish the correct blend.


                                         207

 Number of Semantic Categories. Mother way to solve
the problem of long documents causing semantic weights to
be of little value is to have more semantic categories. A large
number of "semantic" categories could be obtained (for
example) by using ~11 the categories and/or subcategories
found in Roget's Thesaurus, instead of the 36 semantic
categories we use. This would be a deviation from database
semantic modeling but it probably should be examined.
 Block-split Tree Structured Files. The QA System used
B+ tree structures for implementing inverted files and this
actually slowed the system in our DOS environment. The
QA System also had severe storage overhead due to storage
of character strings in the B+ trees. We have solved these
two problems by implementing a separate system using a
hashing function to establish codes for strings.
 TREC Document length.       Semantic experiments like
those reported in Section 5.2 have shown that documents
larger than a paragraph cause our semantic approach to be of
little value. This problem can be corrected by considering
paragraphs as a basis for document retrieval.
 Finally, we spent too much time on work that was never
incorporated in our experiments. We originally designed an
efficient method of inverting data files, but it could not be
used for routing queries.  Also, trying to do semantic
part-of-speech tagging experiments using SQUDS slowed us
down.
                References
[1) C. Buckley, SMART Evaluation Program (for TREC),
   Cornell SMART Group, Cornell University.

[2] C. Date, An Introduction to Database Systems, Vol.1,
   Addison Wesley, 1990.

[3] Hello Software, P.O. Box 494, Goldenrod, FL 32733.

[4] K. Sparck Jones and R. Bates, "Research on Automatic
   Indexing 1974-1976," Technical Report, Computer
   laboratory, University of Cambridge, 1977.

[5] J. Ii)vins, "Development of a Stemming Algorithm,"
   Mechanical Translation and ComputationalLinguistics,
   Vol.11, No.1-2, pp.11-31, March and June, 1968.

[6] Roget's International Thesaurus, Harper & Row, New
   York, Fourth Edition, 1977.

[7] G. Salton,Automatic T&tProcessing, Addison-Wesley,
   1989.

[8] D. Voss and J. Driscoll, "Text Retrieval Using a Com-
   prehensive Semantic lexicon," Proceedings of ISMM
   First International Conference on Informadon and
   Knowledge Management,       Baltimore, Maryland,
   November 1992.

[9] E. Wendlandt and J. Driscoll, "Incorporating a Semantic
   Analysis into a Document Retrieval Strategy," Pro-
   ceedings of the Fourteenth Annual International
   ACM/SIGIR Conference on Rese""rch and Development
   in Information Retrieva4 Chicago, Illinois, pp. 270-279,
   October 1991.