Incorporating Semanfics Within a Connectionist Model and a Vector Processing Model Richard Boyd, James Driscoll, mien Syu Department of Computer Science University of Central Florida Orlando, Florida 32816 (407)823-2341 FAX: (407)823-5419 e-mail: driscoll@cs.ucf.edu Abstract Semantic information obtained from the public domain 1911 version of Roget's Thesaurus is combined with key- words to measure similarity between natural language topics and documents. Two approaches are explored. In one approach, a combination of keyword relevance and semantic relevance is achieved by using the vector processing model for calculating similarity, but extending the use of a keyword weight by using individual weights for each of its meanings. This approach is based on the database concept of semantic modeling and the linguistic concept of thematic roles. It is applicable to both routing and archival retrievaL The second approach is especially suited for routing. It is based on an Al connectionist model. In this approach, a probabilistic inference network is modified using semantic information to achieve a competitive activation mechanism that can be used for calculating similarity. Keywords: vector processing model, semantic data model, semantic lexicon, inference network, connectionist model. 1 . Introduction The experiments reported here use a relatively efficient method to detect the semantic representation of text. Our original method is based on semantic modeling and is described in [4,17,19). Semantic modeling was an object of considerable database research in the late 1970's and early 1980's. Abriefoverview can be found in [3]. Essentially, the semantic modeling approach identified concepts useful in talking informally about the real world. These concepts included the two notions of entities (objects in the real world) and relationships among entities (actions in the real world). Both entities and rela- tionships have properties. The properties of entities are often called attributes. There are basic or surface level attributes for entities in the real world. Examples of surface level entity attributes are General Dimensions, Color, and Position. These properties are prevalent in natural language. For example, consider the phrase "large, black book on the table" which indicates the General Dimensions, Color, and Position of the book. In linguistic research, the basic properties of relationships are discussed and called thematic roles. Thematic roles are also referred to in the literature as participant roles, semantic roles and case roles. Examples of thematic roles are Ben~ ficiary and Time. Thematic roles are prevalent in natural language; they reveal how sentence phrases and clauses are semantically related to the verbs in a sentence. For example, consider the phrase "purchase for Mary on Wednesday" which indicates who benefited from a purchase(13eneficiary) and when a purchase occurred (Fime). A main goal of our research has been to detect thematic information along with attribute information contained in natural language queries and documents. In order to use this additional information, the concept of text relevance needs to be modified. In [17,19] the major modifications included the addition of a lexicon with thematic and attribute information, and a modified computation of a vector processing similarity coefficient. That research concerned a Question/Answer environment where queries were the length of a sentence and documents were either a sentence or at most a paragraph. At that time, our lexicon was based on 36 semantic categories, and in that environment, our semantic approach produced a significant improvement in retrieval performance. However, for TREC-1 [4], document and topic length presented a problem and caused our semantic approach based on 36 semantic categories to be of little value. However, as reported in [4], by breaking the TREC documents into paragraphs, a significant improvement was demonstrated. This work has been supported in part by NASA KSC Cooperative Agreement NCC 10~3 Project 2, Florida High Technol- ogy and Industry Council Grants 494011-28-721 and 4940-1 1-2~728. 291 In Section 2, we describe our original semantic lexicon and an extension which uses a larger number of semantic categories. Section 3 presents an application of an Al connectionist model to the task of routing. Section 4 presents an approach different than reported in TREC-1 [4], using our extended semantic lexicon within the vector processing model. Section 5 summarizes our rasearch effort. 2. The Semantic Lexicon Our semantic approach uses a thesaurus as a source of semantic categories (thematic and attribute information). For example, Roget's Thesaurus contains a hierarchy of word classes to relate word senses [14]. In TREC-1 [4] and in earlier research [17,19], we selected several classes from this hierarchy to be used for semantic categories. We defined thirty-six semantic categories as shown in Figure 1. In order to explain the assignment of semantic categories to a given term using Roget's Thesaurus, consider the brief index quotation for the term "vapor": vapor n. fog 404.2 fume 401 illusion 519.1 spirit 4.3 steam 328.10 thing imagi~ed 535.3 v. be bombastic 601.6 bluster 911.3 boast 910.6 exhale 310.23 talk nonsense 547.5 The eleven different meanings of the term "vapor" are given in terms of a numerical category. We developed a mapping of the numerical categories in Roget's Thesaurus to the thematic role and attribute categories given in Figure 1. In this example, "fog" and "fume" correspond to the attribute State; "steam" maps to the attribute Temperature; and "ex- hale" is a trigger for the attribute Motion with Reference to Direction. The remaining seven meanings associated with "vapor" do not trigger any thematic roles or attributes. Since there are eleven meanings associated with "vapor," we indicated in the lexicon a probability of 1/11 each time a category is triggered. Hence, a probability of 2/11 is assigned to State, 1/11 to Temperature, and 1/11 to Motion with Reference to Direction. This technique of calculating prob- abilities is being used as a simple alternative to a corpus analysis. It should be pointed out that we are still experimenting with other ways of calculating probabilities. For example, as in [8], a probabilistic part-of-speech tagger could be used to further restrict the different meanings of a term, and existing lexical sources could be used to obtain an ordering based on frequency of use for the different meanings of a term. As reported in [4], the use of 36 semantic categories caused problems when dealing with TREC documents. When the size of a document is large, a greater number of the 36 semantic categories are triggered in the document. Also, when using the semantic approach described in [19] the probability present for each category in a document is often very close to one. Consequently, almost every one of the Thematic Role Categories Attribute Categories TACM Accomnaniment ACOL Color TAMT Amount AEID External and Internal Dimensions ThNF Beneficiarv AFRM Form TCSE Cause AOND Gender TCND Condition AODM General Dimensions TCMP Comnenson ALDM Linear Dimensions TCNV Conve ance AMFR Motion Conjoined with Foree ThOR De~e AOMT Motion in General ThST Destination AMDR Motion with Reference to Direction ThUR Duration AORD Order TOOL Ooal APIIP Phvsical Pronerties TINS Instrument APOS Position TSPL I:c~tion/Si,ace ASTE State TMAN Manner A~mrature TMNS Means AUSE Use ThUR Purpc~e AVAR Variation ThNO Ran~ i~FS Result TSRC Source TTIM Time Figure 1. Thirty-Six Semantic Categories. 292 36 semantic categories becomes present in every document. This causes semantic category weights to become very low and useless within that approach. As ~reportedin[4], one way to solve this problem is to break ThECdocum ents into paragraphs. But, another way to solve the problem of long documents causing semantic weights to be of little value is to have more semantic categories. A large number of "semantic" categories can be obtained (for example) by using ~ the categories and/or subcategories found in Roget's Thesaurus, instead of the 36 semantic categories we have used. This may be a deviation from database semantic modeling. In any case, it needs to be examined. Consequently, for the experiments reported here, a semantic lexicon was created based on all the word senses found in the public domain 1911 version of Roget's The- saurus. To provide an example, consider Topic 052 as shown in Figure 2. Fi~re 3 indicates the keywords and frequency information within Topic 052, along with the semantic categories obtained from our extended lexicon for those keywords. Note that stemming was not used for the pro- cessing of Topic 052; so, some keywords in Topic 052 were not located in our lexicon (e.g. sanctions). The categories recorded in our extended semantic lexicon usethe category numbers found in the 1911 version of Roget's Thesaurus. These numbers are then followed by a part-of- speech code also found in the 1911 version of Roget's Thesaurus. The number after the part-of-speech code represents a sub-category, but this number does not appear in the 1911 version of Roget's Thesaurus. That number was created based on groupings of words within the thesaurus. ~op> TIpster Topic Description Number: 052 Domain: International Economics Topic: South African Sanctions Description: Document discusses sanctions against South Africa Narrative: A relevant document will discuss any aspect 0' South African sanctions, such as: sanctions dccl~(po~ by a country against the South African government in response to its apaitheid poncy, or in response topressure by an indIvidual, organization or another country; intemational sanctions against Pretoria imposed bythe United Nations; the effects 0' sanctions against & Africa; opposition to sanctions; or' compliance with sanctions by a company. The document will identif~ the sanctions instituted or being considered, e.g., corporate disinvestment, trade ben' academic boycott, arms embargo. Concept(s): 1. sanctions, international sanctions, economic sanctions 2. corporate exodus, corporate disinvestment, stock divestiture, ben on new investment, trade ban, import ben on South African diamonds, U.N. arms embargo, curtailment 0' delbrise contracts, cutoff 0' nonmUitary goods, academic boycott, reduction 0' cultural ties 3. ap&theid, white domination, racism 4. an-theid, black m~Jority rule 5. Pretoria ~c> Factor(s): Nationality: South Africa <`lac>