The QA System James Driscoll, Jennifer Lautenschlager, Mimi zhao Department of Computer Science University of Central Florida Orlando, Florida 32816 USA Abstract In the QA system, semantic information is combined with keywords to measure similarity between natural language queries and documents. Acombination of keyword relevance and semantic relevance is achieved by treating keyword weights and semantic weights alike using the vector pro- cessing model as a basis for the combination. The approach is based on (1) the database concept of semantic modeling and (2) the linguistic concept of thematic roles. Semantic information is stored in a lexicon built manually using information found in Roget's Thesaurus. Keywords: vector prossing model, semantic data model, semantic lexicon, thematic roles, entity attributes. 1. Introduction The QA system is based on the semantic approach to text search reported in [9). The QA system accepts natural language queries against collections of documents. The system uses keywords as document identifiers in order to sort retrieved documents based on their similarity to a query. The system also imposes a semantic data model upon the "surface level" knowledge found in text (unstructured information from a database point of view). The intent of the QA System has been to provide conve- nient access to information contained in the numerous and large public information documents maintained by Public Affairs at NASA Kennedy Space Center (KSC). During a launch at KSC, about a dozen NASA employees access these printed documents to answer media questions. The planned document storage for NASA KSC Public Affairs is around 300,000 pages (approximately 900 megabytes of disk stor- age). Because of our environment, the performance of our system is measured by a count of the number of documents one must read in order to find an answer to a natural language question. Consequently, the traditional precision and recall measures for IR have not been used to measure the per- formance of the QA System. We have had success using semantics to improve the ranking of documents when searching for an "answer" to a query in adocument collection of size less than 50 megabytes. However, it is important to note that our success has been demonstrated only in a real-world situation where queries are the length of a sentence and documents are either a sentence or, at most, a paragraph [8,9]. Our reasons for participating in ~fl~EC have been to (1) learn now our semantic approach fares when traditional IR measures of performance are used, and (2) test our system on larger collections of documents. In this paper, we describe our system, the experiments we performed, our results, and failure analysis. 2. Overview of the QA System Modified for ThEC The QA System has been restricted to an IBM compatible PC platform running under the DOS 5.0 operating system and without the use of any other licensed commercial software such as a DOS extender. The DOS version of the QA System is available at nominal cost from [3). About 2,000 hours of programming have been used to develop the current software which includes a pleasant user interface; just as many hours have been used testing the basic keyword operation of the system. In addition, approximately 1,000 hours have been used performing experiments invoWing the semantic aspect of the QA System. The SQIWI)S relational data base system has been used to carry out some of these semantic experi- ments. The QA System is implemented in C and uses B+ tree structures for the inverted files. We felt the speed of the system and its storage overhead was not efficient enough to appear reasonable for ~I~EC, so we designed a separate system without a pleasant user interface which uses a hashing scheme to establish codes for strings. This was done to cut down on storage space and eliminate the use of B+ trees. Approximately 400 hours of programming and debugging effort was used to modify the system for the TREC experi- ments. We kept the DOS environment. This work has been supported in part by NASA KSC Cooperative Agreement NCC 104)03 Project 2, Florida High Technol- ogy and Industry Council Grants 494O~ll-28-72l and 4940-11-728, and DARPA Grant 4940-11-28-808. 199 For the TREC experiments, weused nine IBMPS/2 Model 95 computers. These were 50MHz 486 computers, each with 8 megabytes of RAM and one 400 megabyte hard drive. Two of the machines had 16 megabytes of RAM and another one had two 400 megabyte hard drives. A 33 Mf~ 486 PC was used to distribute text to the nine IBM machines, and then collect information for merging and redistribution. Our plan was to put each of the nine Vol.1 and Vol.2 document collections on a separate machine, determine the document frequency for each term tj on each machine, and then merge the results and determine the inverse document frequency for each term in the entire collection. Next, we would index and query each document collection separately, and finally merge the retrieval results. The vector processing model is the basis for our approach. Terms used as document identifiers are keywords modified by various techniques such as stop lists, stemming, synonyms, and query reformulation. The calculation of the weighting factor (w) for a term in a document is a combination of term frequency (ti), document frequency (dJ), and inverse document frequency (id]). The basic definitions are as follows: tj~, = number of occurrences of term tj in document D~ df1 - number of documents in a collection which contain tj idf~ - Iog(NIdf,), where N = total number of documents W.. =~f.*jdf £j `J,5J When the system is used to query acollection ofdocuments with I terms, the system computes a vector Q equal to (Wql,Wq2,..., Wqg) representing the weights for each term in the query. The retrieval of a document with vector D~ equal to 4') representing the weights of each term in the document is based on the value of a similarity measure between the query vector and the document vector. For the NIST experiments, we used the Jaccard similarity coefficient [7] to retrieve documents for the September deadline. Jaccard Coefficient X Wqjdjj j-1 sim~,D~) = I I * X(41)2+ X(w~1)2- X Wq~j j-1 j-1 1-1 3. SemantIc App~ach Mthough the basic IR approach has shown some success in regard to natural language queries, it ignores some valuable information. For instance, consider the following query: How long does the payload crew go through training before a launch? The typical IR system dismisses the following words in the query as useless or empty: "how", "does", "the", "through", 200 "before", and "a". Some of these words contain valuable semantic information. The following list indicates some of the semantic information triggered by a few of these words: how long Duration, Time through I£~tion/Space, Motion with Reference to Direction, Time before 1~tion~pace, Time The database concept of semantic modeling and the linguistic concept of thematic roles can help in this regard. 3.1 Semantic Modeling Semantic modeling was an object of considerable database research in the late 1970's and early 1980's. Abriefoverview can be found in [2]. Essentially, the semantic modeling approach identified concepts useful in talking informally about the real world. These concepts included the two notions of entities (objects in the real world) and relationships among entities (actions in the real world). Both entities and rela- tionships have properties. The properties of entities are often called attributes. There are basic or surface level attributes for entities in the real world. Examples of surface level entity attributes are General Dimensions, Color, and Position. These properties are prevalent in natural language. For example, consider the phrase "large, black book on the table" which indicates the General Dimensions, Color, and Position of the book. In linguistic research, the basic properties of relationships are discussed and called thematic roles. Thematic roles are also referred to in the literature as partidpant roles, semantic roles and case roles. Examples of thematic roles are Bene- ficiary and Time. Thematic roles are prevalent in natural language; they reveal how sentence phrases and clauses are semantically related to the verbs in a sentence. For example, consider the phrase "purchase for Mary on Wednesday" which indicates who benefited from a purchase (13eneficiary) and when a purchase occurred (Fime). The goal of our approach is to detect thematic information along with attribute information contained in natural Ian- guage queries and documents. When the information is present, our system uses it to help find the most relevant document. In order to use this additional information, the basic underlying concept of text relevance as presented earlier needs to be modified. The major modifications include the addition of a lexicon with thematic and attribute information, and a modified computation of the similarity coefficient. We now discuss our semantic lexicon. 3.2 The Semantic lexicon Our system uses a thesaurus as a source of semantic categories (thematic and attribute information). For example, Roget's Thesaurus contains a hierarchy of word classes to relate word senses [6]. For our research, we have selected several classes from this hierarchy to be used for semantic categories. We have defined thirty-six semantic categories as shown in Figure 1. Thematic Rote Categories Attribute Categories A ~ m~nt C'Lolor Pvt~rn~1 ~nd Internal Dimensions Amoun~ ~pnefi~i ~r', Form Cause r~nd~r (`~n~r~1 flirnpn~ion~ (`~ndition rr'nin~iriqnn T ~ flinipn~onq ~A~;~n r~nininp~ with Force (`onvevance fle~~e n~~t; ~~ti~n RA~;r'~ with ~ to Direction 1VLLJIi~~ 1~~i~L~11 - -- - flur~tion Order (;ml Phv~ica1 Prorerties I n~rument Position T ncation/Smce State \"At'ner Temnerature Me~n~~ Use Pii~e Variation RRn~e ~~Q,1lt Time Figure 1. Thirty-Six Semantic Categories. Prior to ThEC, there were 3,000 entries in the lexicon established by manual examination of roughly 6,000 of the most frequent words occurring in NASA KSC text. For ThEC, we made 1,000 new entries by examination of 1,700 frequent words occurring in the training text and the 52 training topics. Since the 1911 edition of Roget's Thesaurus has become public domain, we also created software which automatically extracted approximately 20,000 lexicon entries. However, we did not have enough time to explore the use of these entries. In order to explain the assignment of semantic categories to a given term using Roget's Thesaurus, consider the brief index quotation for the term "vapor": vapor n. fog 404.2 fume 401 illusion 519.1 4.3 ~sPe~~i 328.10 thing imagined 535~ v. be bombastic 601.6 bluster 911.3 boast 910.6 exhale 310.23 talk nonsense 547.5 The eleven different meanings of the term "vapor" are given in terms of a numerical category. We have developed a mapping of the numerical categories in Roget's Thesaurus to the thematic role and attribute categories given in Figure 1. In this example, "fog" and "fume" correspond to the attribute State; "steam" maps to the attribute Temperature; and "ex- 201 hale" is a trigger for the attribute Motion with Reference to Direction. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor," we indicate in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to State, 1/11 to Temperature, and 1/11 to Motion with Reference to Direction. This technique of calculating prob- abilities is being used as a simple alternative to a corpus analysis. It should be pointed out that we are still experimenting with other ways of calculating probabilities. Figure 2 shows lexicon entries for prepositions as asample of the lexicon used in our experiments. These entries are somewhat misleading. Most p repositions trigger too many semantic categories to be of real use. The prepositions "during" and "until" are examples of useful prepositions. 3~ Extended Computation of the Similarity Measure The probabilistic details of a semantic lexicon and the computation of semantic weights can be found in [9]. A detailed explanation of the manner in which we combine semantic weights and keyword weights can be found in [8]. Essentially we treat semantic categories like indexing terms, and the probabilities introduced by a semantic lexicon mean that the frequency of a category in a document becomes an ~ frequency and the presence of a category in a document becomes a probability for the category being present. This means that the document frequency for a category becomes an expected document frequency, and this enables an inverse document frequency to be calculated for a category. after: TIme(3), None(3), External and Internal Dimensions(2), Motion with Respect to Direction, Order, Li~ation~pace above: Amount(2), Ii)Catio~pace, None(2), linear Dimensions(2), Order, External and Internal Dimensions, Position as: Condition, Comparison, Manner1 None(2), Amount at: Condition, Ii)cation~S pace, Manner, Time, Position, External and Internal Dimensions, Duration ato p : ~cation~Space, Position befbre: [£~tion~Space, Time(4), External and Internal Dimensions, Motion with Respect to Direction, Order, None(2), Position below: Motion with Respect to Direction, None(2), Amount, State, linear Dimensions, Iocatio~pace, Position between: Comparison, Duration, External and Internal Dimensions, Icication/Space, Position, Amount by: Amount, Conveyance, Ic:cation/Space, Time, Position, External and Internal Dimensions, Means, Order, Motion with Respect to Direction, Duration, None during: Duration, Time except: Condition, None(2), Order, Amount(2) for: Duration, Goal, Purpose, Variation, Destination, Beneficiary, Amount, Range from: Cause, Source, Time, Motion with Respect to Direction(2), Amount, location/Space, Position, Instrument in: Instrument, location/Space, Purpose, ~me, Motion with Respect to Direction(S), External and Internal Dimensions(2), Position, Condition, Goal, Means into: Condition, location/Space, Time, Motion with Respect to Direction, External and Internal Dimensions, Position, None like: Comparison(3), Amount(2), Condition, Manner, None(7) of: Cause, location/Space, Source, None, Time, Duration, Position on: Condition, Conveyance, location/Space, Source, Time, linear Dimensions(2), Motion with Respect to Direction, General Dimensions, Position, Means, External and Internal Dimensions, Purpose over: Duration, location/Space, Time(2). Order, linear Dimensions(3), Amount(4), General Dimensions, Motion with Respect to Direction, External and Internal Dimensions, Position, Degree, Condition per: Means, Order(2) thrzugh: None, Order, Goal, linear Dimensions, Means, Time, location/Space, Position, Motion with Respect to Direction to: Accompaniment, Beneficiary, Condition, De~ee, Ii)cation/Space, Purpose, Result, Time, General Dimensions, Position, Motion with Respect to Direction(2), Companson under: Condition, location/Space, Position, Order, Degree, None, Amount, linear Dimensions(3) until: Condition, Time upon: Condition, Conveyance, Source, Time, None, General Dimensions, linear Dimensions, Means, External and Internal Dimensions, Motion with Respect to Direction, Purpose with: Accompaniment(2), Comparison, Manner, Result, Amount(3), Means, None, Order, location/Space, Position, Time within: Range, External and Internal Dimensions(2), Time, Degree, Position, location/Space without: External and Internal Dimensions, Order, Amount, None(2) Figure 2. Prepositions and Their Semantic Categories ~eprinted by Permission of [3J). For ThEC semantic experiments performed after the September deadline (refer to Section 5.2), we treated semantic categories like keywords and used the following normalized similarity coefficient: a +s 1X~1 Wqj di~ sim~,D~) - 11+3 wheres -36 is the number of semantic categories. It should be pointed out that we are still performing experiments to determine a proper blend of keyword and semantic infor- mation. 4. System Details The following is an overview of the system we used to perform the ThEC experiments. As pointed out earlier, many hours of programming and debugging effort were used to create a system for the ThEC experiments. 4.1 The Scanning Pt~~cess Under the QA scanning procedure, scanning a ciccument for indexing terms is a three-step process: A. A token is scanned. B. The token is analy:eed. It is compared to a list of 166 stopwords, and these words (along with any numbers) are later discarded. Dates are transformed into a generic format, 202 to allow for the matching of dates in a variety of different formats. Non-valid hyphenated words are separated into mulliple tokens. Words that can be abbreviated (as deter- mined by a list of 122 abbreviations) are replaced by their abbreviated form. Acronyms of multiple words are detected, and replaced by their respective acronyms. C. The remainder of the words are stemmed according to a modified version of the 3. B. lovins stemming algorithm [5). In some cases, prefixes are also removed. The scanning procedure for the ThEC experiments is a modification of the QA scanning procedure, in which only the words found in the text fields of the documents are tokenized. In order to speed up text processing, the amount of analyzing in step B is reduced. Dates are no longer transformed, and abbreviations and acronyms are not created. The stemming algorithm and removal of prefixes in step C remains unchanged. For the ThEC scanning procedure, an additional step is added. During this extra step, a hashing algorithm is used to assign each indexing term a unique 32-bit integer value, which we call a stem code. 4.2 Data Files Aftera document is processed by the ThEC version of the QA System, four primary data files are created. These are the document weight file, the inverted index file, the inverted data file, and the document name file. The document weight file is a binary file containing a list of floating point numbers. These numbers are ordered sequentially by document number, and represent the sum- marion of the weight in the query squared for a particular document's keywords. The inverted index file is ordered by stem code. Each code has two file pointers, one pointing to the first block of data in the data file, and the second pointing to the last block of data. The inverted data file then consists of blocks of data, containing pairs of document numbers that the code is found in, and the code's frequency within that document. The blocks are linked together to form a list. The document name file consists of a list of pointers into the original text file. For Vol.1, both the document weight file and the inverted index file were two megabytes. The inverted data file was approximately 385 megabytes, and the document name file was two megabytes. 4~ Basic Pi~)cedure For the TREC experiments, we did the following: Step 1: This step involves matching every legitimate stem in the document collection with a unique integer value. This is done with a linear hashing function. A table containing this mapping, along with the number of documents each code is found in, is temporarily saved for use in Step 2. Step 2: This step creates the four data files described above. The entire database is scanned, with the four files being created on the fly. Once this is accomplished, the table from Step 1 is no longer necessary, and is discarded. Step 3: Relevant documents for each query are selected using the Jaccard similarity coefficient. The top 200 docu- ments for each query are then determined. The above three steps were followed to create the results for the September deadline. In the next section we present our "official" experiments and results, and some "unofficial" experiments and results. 5. Experiments and Results Our experiments were intended to be Category A experiments with two results submitted for each ranking task. One ranking result would be for just keywords, the other ranking result would be for keywords combined with semantics. All query construction was automatic, and the treatment of ad-hoc routing queries was identical. We also performed experiments concerning the expected behavior of string hashing functions and the use of part~f- speech tagging to improve retrieval performance. These experiments are not reported here. As an example of our automatically built ad-hoc and routing queries, consider Topic 004 reproduced in Figure 3. Figure 4 indicates the keyword and semantic information generated by the QA System for this topic. The first part of Figure 4 indicates the stems along with their frequencies found in the query. The second part of Figure 4 indicates the semantic categories also found in the query along with their expected frequencies and probability present. It is important to note that the topic represented in Figure 3 and Figure 4 has generated many semantic categories and the probability present for most of them is close to, or at, 203 100%. This is mainly due to the length of the text involved. We discovered that, for the TREC document collection, each document generated many semantic categories with high probability present. Because we treated semantic categories like keywords, this caused semantic weights to be essentially useless. For the ThEc September deadline, we were only able to submit a routing experiment using a keywording approach. The results of this experiment, computed with the aid of Chris Buckley's SMART evaluation program [1), are shown in Figure 5. The results are not good. In Section 6, we discuss what impeded our experiments. Further "unofficial" experiments were designed to test the use of semantics. The main goal of our experiments was to demonstrate that our original routing results could be improved through the use of semantic analysis. In order to do this, we made two modifications to our approach. The first change involved dividing the original TREC documents into paragraphs. The second change involved a semantic analysis when calculating the list of relevant documents. Our experiments involved the use of only six routing queries (for topics 001,002,003,007,017, and 022). These topics were selected because our original results for them were poor. Through the use of semantic analysis, we hoped to significantly improve our results. Figure 6 shows the precision-recall statistics for the six "poor" queries using the retrieval results which created the statistics in Figure 5. When analyzing our results, we computed all precision- recall tables through the use of Chris Buckley's SMART evaluation program [1). The relevancy lists used were those produced before November 1 (the original qrels for the routing queries). We did not use the modified results that were distributed later for Query 017. This should not affect our results, though, because our experiments were aimed only at improving our precision-recall averages, and the relevancy results used were consistent from one experiment to the other. 5.1 Re-~nk'ng of Documents In an effort to demonstrate that semantics could affect retrieval in the TREC environment, we used the original QA System with a semantic lexicon containing TREC words as described in Section 3.2. We created a separate database for each query we considered, for a total of six databases. Each database contained only the documents that we originally judged in the top 250 for each query. Because of this, when we computed new relevancy lists we were simply rearranging the order of the same 250 documents, IIQL bringing in new documents. Figure 7 reveals the precision-recall statistics when orig- inally retrieved documents fora query are used as a document collection and re-ranked by imposing the query again. There is a 25.8% increase when comparing the 11-pt average here to the 11-pt average of the originally retrieved text (Figure 6). To determine the ranking of a particular document with paragraph divisions, we defined the similarity coefficient of a document to be equalto the highest coefficient associated with one of its corresponding paragraphs. The paragraph divisions were automatically constructed from the original text. The precision-recall statistics for paragraphs being used as documents are shown in Figure 8. There is an 18.8% Stems Stems Tipster Topic Description Number: 004 Domain: International Finance Topic: Debt Rescheduling Description: Document wrn discuss a current debt rescheduling agreement between a developing country and one or more of its creditor(s). Narrative: A relevant document wrn discuss a current debt rescheduling agreement reached, proposed, or being negotiated between a debtor developing country and one or more of its creditors, commercial and/or official. It wrn identify the debtor country and the creditor(s), the repayment time period requested or granted, the monetary amount requested or covered by the accord, and the interest rate, proposed or set. Concept(s): 1. rescheduling agreement, accord, settlement, pact 2. bank debt, commercial debt, foreign debt, trade debt, medium-term debt, long-term debt 3. negotiations, debt talks 4. creditor banks, creditor countries/governments, Paris Club 5. debtor countries, developing countries 6. debt package 7. debt repayments 8. restructuring, rescheduling existing loans 9. lower interest-rate margin, easier terms, more lenient terms Factor(s): Nationality: Developing country Definition(s): Debt Rescheduling - agreement between creditors and debtor to provide debt relief by alterihg the original payment terms of an existing debt. This is most often accomplished by lengthening the original schedule for principal and interest payments, and deferring interest payments. Done most publicly by developing countries and their bankers, but often less publicly by other willing creditors and debtors, e.g., governments, banks and companies. Much in vogue in the early 19805, the road to rescheduling for countries in crisis runs as follows: when a country borrows so much that its lenders grow nervous, the banks start lending for shorter and shorter maturities. Eventually the country, though still paying interest on its debt, is unable to make payments on the principal. The country is then forced to requestarescheduling, which means that it is able to escape its immediate repayment commitments by converting short-term loans into longer-term ones. A country wishing to reschedule its official debt tal tal 1 wish 1 on 1 longer 1 short 1 convers 1 commitm 1 immedi 1 escap 1 abl 1 mean 1 forc 1 un 1 pay 1 though 1 eventu 1 matur 1 shorte 2 lend 1 start 1 nerv 1 row 1 lender 1 orrow 1 follow 1 run 1 cris 1 road 1 ear 1 vogu 1 compan 1 e.g. 1 wil 1 less 1 banker 1 public 2 defer 1 princip 2 length 1 accompl 1 orisin 2 alter 1 relief 1 prov 1 leni 1 easxxx 1 margin 1 lower 1 loan 2 exist 2 structur 1 pack 1 club 1 par 1 2 talk 1 ~ernm 1 term 6 m~4erm 1 trad 1 foreign 1 bank 4 pact 1 settl 1 rat 2 interest 5 accord 2 cover 1 amount 1 monet 1 grant 1 request 3 period 1 paf~m 7 identif 1 2 commerc 2 negoti 2 propos 2 reach 1 relev 1 credit 7 countr 13 develop 5 4 cur 3 discus 2 docum 2 schedl 10 debt 22 fin 1 intern 1 Expected Probability Frequency £r~~ Linear Dimensions 0.300000 0.300000 Motion with Reference to Direction 2.000000 1.000000 Accom~animent 0.333333 0.333333 Condition 2.000000 0.937500 External and Internal Dimensions 0.333333 0.333333 Time 5.744589 1.000000 Duration 1.291667 0.819444 Purpose 2.000000 1.000000 Color 2.333333 0.941472 Position 6.071428 0.999654 Location/Space 6.071428 0.999654 Variation 5.933333 1.0000 Amount 53.412807 1.0000 Order 6.833333 1.000000 Figure3. Topic004. Figure 4. Automatically Generated Query for Topic 004. 204 Top ranked evaluation Rus number: UCFQA1 (category B) Num~queries: 19 Total number of documents over all queries Retrieved: 3800 Relevant: 2056 Rel_ret: 517 Recall - Precision Averages: at 0.00 0.5146 atOlO 0.2645 at 0.20 0.1735 at0.30 0.1006 at 0.40 0.0693 at 0.50 0.0107 at 0.60 0.0069 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.1037 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.0614 Recall: at Sdoes: 0.0162 at lSdocs: 0.0362 at 30does: 0.0692 at 100 does: 0.2156 at 200 does: 0.2899 Precision: at Sdocs: 02737 at 15 does: 02211 at 30 does: 02088 at 100 does: 0.1932 at 200 does: 0.1361 Figure 5. Results for Experiment Completed by September Deadline. Top ranked evaluation Run number: 1 Num~queries: 6 Total number of documents over all queries Retrieved: 1200 Relevant: 863 Rel ret: 147 Recall - Precision Averages: at 0.00 0.3486 at 0.10 0.1676 at 020 0.0749 at 0~0 0.0279 at 0.40 0.0000 at 0.50 0.0000 at 0.60 0.0000 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.0563 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.0250 Recall: Exact: 0.1997 at Sdoes: 0.0033 at lOdoes: 0.0062 at 30does: 0.0382 atl00does: 0.1212 at 200 does: 0.1997 Precision: Exact: 0.1225 at Sdocs: 0.1000 at lOdoes: 0.0833 at 30does: 0.1556 atlOOdoes: 0.1550 at 200 does: 0.1225 Figure 6. Results for Six Poorly Performing Queries using Experiment Completed by September Deadline. 205 Top ranked evaluation Run number: 1 Num~queries: 6 Total number of documents over all queries Retrieved: 1200 Relevant: 863 Rd ret: 152 Recall - Precision Averages: at 0.00 0.4585 atOlO 0.2113 at 0.20 0.0775 at 0.30 0.0315 at OAO 0.0000 at 0.50 0.0000 at 0.60 0.0000 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.0708 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.0258 Recall: Exact: 02109 at S does: 0.0072 at 10 does: 0.0128 at 30 does: 0.0469 atlOOdoes: 0.1374 at 200 does: 02109 Precision: Exact: 0.1267 at Sdoes: 02000 at 10 does: 0.1833 at 30does: 0.1944 at 100 does: 0.1683 at 200 does: 0.1267 Figure 7. Results of Keywording with Document Divisions. Top ranked evaluation Run number: 1 Num~queries: 6 Total number of documents over all queries Retrieved: 1200 Relevant: 863 Rel~ret: 157 Recall - Precision Averages: at 0.00 0.40% atOlO 0.2089 at 020 0.0832 at 0.30 0.0340 at 0.40 0.0000 at 0.50 0.0000 at 0.60 0.0000 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.0669 Average precision for 3 intermediate points (0.20, 0.50, 0.80) 3-pt Avg: 0.0277 Recall: Exact: 0.2167 at Sdoes: 0.0121 at lOdoes: 0.0167 at 30does: 0.0462 atl00does: 0.1392 at 200 does: 0.2167 Precision: Exact: 0.1308 at Sdoes: 0.2667 at lOdoes: 0.1833 at 30does: 0.1833 at 100 does: 0.1650 at 200 does: 0.1308 Figure & Results of Keywording with Paragraph Divisions. increase when comparing the 11-pt average here to the 11-pt average of the originally retrieved text ~igure 0). The re-ranking of documents using keywords was done in order to establish a baseline for the semantic experiments reported in the next subsection. The increase in retrieval performance because of the re-ranking was a surprise. According to Sparek Jones' criteria, the increases of 25.8% and 18.8% would eachbe categoriied as "sigmIlcant" ~reater than 10.0%) [4]. 5.2 Semantic Expenments Essentially, what we are trying to show in this section is that semantics can be useful if documents do not get larger than a paragraph. Figure 9 displays results when semantic similarity is combined with keyword similarity for the case that originally retrieved ~fl~EC documents are used. These results can be compared to those in Figure 7 (the same documents with a strictly keywording approach). Comparing the 11-pt average for the two methods shows a 0.7% ~when going from the strictly keywording results to the combined semantic with keywording results. According to Sparek Jones' criteria [4], this is not a noticeable decrease. But, certainly, semantics did not help. Top ranked evaluation Run number: 1 Num~queries: 6 Total number of documents over all queries Retrieved: 1200 Relevant: 863 Rel~ret: 155 Recall - Precision Averages: at 0.00 0A370 atOlO 0.2250 at 0.20 0.0789 at 0.30 0.0329 at 0.40 0.0000 at 0.50 0.0000 at 0.60 0.0000 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.0703 Av (0~20era~(;~~e()ct8sI0o)n for 3 intermediate points 3-pt Avg: 0.0263 Recall: The explanation for this is that when the size ofa document is large, a greater number of semantic categories are triggered in the document. Also, the probability present for each category in a document is often very close to 100%. Con- sequently, almost every semantic category becomes present in every document causing the semantic category weights to become very low and useless. To remedy this problem, we used the database that was constructed with paragraph divisions, and computed rele- vancy lists using the combined semantic and keywording approach explained in Section 3.3. The results are shown in Figure 10. The statistics there can be compared to those in Figure 8 (the same documents with a strictly keywording approach). When going from a keywording approach to a combined semantic and keywording approach, the 11-pt average increased by 7.9%. According to Sparck Jones' criteria, this change would be classified as "noticeable" (greater than 5.0%) [4]. The two semantic experiments reported here demonstrate the main thing that we have learned. Our semantic approach to text retrieval is only useful when documents are no larger than a paragraph. Top ranked evaluation Run number: 1 Num~queries: 6 Total number of documents over all queries Retrieved: 1176 Relevant: 863 Rel ret: 158 Recall - Precision Averages: at 0.00 0.4556 atOlO 0.2231 at 0.20 0.0845 at 0.30 0.0294 at 0.40 0.0000 at 0.50 0.0000 at 0.60 0.0000 at 0.70 0.0000 at 0.80 0.0000 at 0.90 0.0000 at 1.00 0.0000 Average precision for all points 11-pt Avg: 0.0720 ~ for 3 intermediate points 3-pt Avg: 0.0282 Recall: Exact: 0.2133 Exact: 0.2159 at Sdocs: 0.0072 at Sdocs: 0.00% at lOdocs: 0.0128 at 10 does: 0.0171 at 30docs: 0.0508 at 30does: 0.0513 atl00docs: 0.1378 at 100 does: 0.1456 at 200 does: 0.2133 at200 does: 0.2159 Precision: Precision: Exact: 0.1292 Exact: 0.1332 at Sdocs: 0.2000 at Sdoes: 0.2333 at lOdoes: 0.1833 at lOdoes: 0.2000 at 30docs: 0.2000 at 30does: 02167 atl00does: 0.1700 atl00docs: 0.1767 at 200 does: 0.1292 at 200 does: 0.1317 Figure 9. Results of Semantics with Document Divisions. 206 Figure 10. Results of Semantics with F~ragraph Divisions. It should be noted that the semantic experiments reported here were crude. Our lexicon did not have enough TREC words, and we used a blend of keyword and semantic weights that was not the best. So we expect that better semantic results for paragraphs can be achieved (refer to Section 6). 6. Failure Analysis In general our participation in the TREC experiments was impeded by the following: PC DOS Platform. This platform has a serious memory addressing restriction which results in memory page swap- ping and this seriously affects the speed of processing, especially during creation of inverted files and index structures. We can solve this problem by moving to an OS/2 or UNIX platform. Extra Semantic Processing Time. Our semantic proba- bilistic and statistical calculations more than double the processing time for indexing and statistical ranking of retrieved documents. Again, we can solve this problem by moving to an OS/2 or UNIX platform. Time to Build Semantic lexicon. We were only able to incorporate 1000 frequently occurring words in the training text within our semantic lexicon. We did not have enough time to process the test text for the ad-hoc queries. This problem can be solved by having archival data distributed earlier. We suspect that by having more TREC words in our semantic lexicon, better results could have been achieved in Section 5.2 when paragraphs are used as the basis for retrievaL Unknown Blend for Semantic and Keyword Weights. There are three main aspects to our blend of semantic and keyword weights within the vector processing model: (i) The Proper Probabilities to Use for the Semantics Triggered by a Word. For example, we let the word "vapor" trigger State with 18% probability, Tem- perature with 9% probability, and Motion with Reference to Direction with 9% probability. We have several techniques for determining probabili- ties such as these. (ii) The Scaling of Keyword Weights and Semantic Weights. For example, in a Question/Answer environment where queries are the length of a sentence and documents are either a sentence or at most a paragraph, we have been successful by forcing semantic similarity to be approximately 1/3 of keyword similarity when the two are combined in processing small document collections (1ess than 1000 documents). There was no scaling for the experiments reported in Section 5.2; we suspect better results could have been achieved. (iii) Independent Semantic Weights and Keyword Weights. A valid criticism of our research has been that the semantic contribution from a word in a document should be kept independent of the word's own similarity contribution if the word is a keyword in common with the query. The overall problem of proper blend can be solved by spending more time using TREC test documents, test topics, and good test relevance judgments to run many retrieval experiments to establish the correct blend. 207 Number of Semantic Categories. Mother way to solve the problem of long documents causing semantic weights to be of little value is to have more semantic categories. A large number of "semantic" categories could be obtained (for example) by using ~11 the categories and/or subcategories found in Roget's Thesaurus, instead of the 36 semantic categories we use. This would be a deviation from database semantic modeling but it probably should be examined. Block-split Tree Structured Files. The QA System used B+ tree structures for implementing inverted files and this actually slowed the system in our DOS environment. The QA System also had severe storage overhead due to storage of character strings in the B+ trees. We have solved these two problems by implementing a separate system using a hashing function to establish codes for strings. TREC Document length. Semantic experiments like those reported in Section 5.2 have shown that documents larger than a paragraph cause our semantic approach to be of little value. This problem can be corrected by considering paragraphs as a basis for document retrieval. Finally, we spent too much time on work that was never incorporated in our experiments. We originally designed an efficient method of inverting data files, but it could not be used for routing queries. Also, trying to do semantic part-of-speech tagging experiments using SQUDS slowed us down. References [1) C. Buckley, SMART Evaluation Program (for TREC), Cornell SMART Group, Cornell University. [2] C. Date, An Introduction to Database Systems, Vol.1, Addison Wesley, 1990. [3] Hello Software, P.O. Box 494, Goldenrod, FL 32733. [4] K. Sparck Jones and R. Bates, "Research on Automatic Indexing 1974-1976," Technical Report, Computer laboratory, University of Cambridge, 1977. [5] J. Ii)vins, "Development of a Stemming Algorithm," Mechanical Translation and ComputationalLinguistics, Vol.11, No.1-2, pp.11-31, March and June, 1968. [6] Roget's International Thesaurus, Harper & Row, New York, Fourth Edition, 1977. [7] G. Salton,Automatic T&tProcessing, Addison-Wesley, 1989. [8] D. Voss and J. Driscoll, "Text Retrieval Using a Com- prehensive Semantic lexicon," Proceedings of ISMM First International Conference on Informadon and Knowledge Management, Baltimore, Maryland, November 1992. [9] E. Wendlandt and J. Driscoll, "Incorporating a Semantic Analysis into a Document Retrieval Strategy," Pro- ceedings of the Fourteenth Annual International ACM/SIGIR Conference on Rese""rch and Development in Information Retrieva4 Chicago, Illinois, pp. 270-279, October 1991.