DR-LINK's Linguistic-Conceptual Approach to Document Detection1 Elizabeth D. Liddy Sung H. Myaeng School of Information Studies Syracuse University Syracuse1 New York 132100-4100 liddy@mailbox.syr.edu; shmyaeng@mailbox.syr.edu 1' Overview Our approach to the difficult problem of selecting only those documents which satisfy a user's specified information need, is to pay parallel attention to two very important aspects of the task. Firstly, there are many documents which have no likely possibility of being relevant to either a standing query or a query newly put to the system. These documents should be filtered from further consideration at an early stage in the system's processing if the system's later processing is computationally expensive, and if their presence introduces unnecessary ambiguity, while their removal produces more accurate results. This focusing process should continue at subsequent stages using additional linguistic features of the query and documents in order to further refine the flow of documents. Secondly, there is a continuum of levels of linguistic-conceptual processing which can produce enrichments of the original text in order to explicitly represent documents at more conceptual levels for more accurate matching to queries. Our approach also recognizes that, as reflected by the topic statements, the retrieval task in TREC requires capabilities beyond what has been required in the past for traditional IR systems. Topic statements describe not only `aboutness' but also more detailed information such as relationships among entities, characteristics of participants in an event, and temporality. We believe that richer representations of documents and topic statements are essential to meet the extended retrieval requirements of such complex information needs and to reduce the ambigui- ties resulting from keyword-based retrieval. To produce this enriched representation the system uses lexical, syntactic, semantic, and discourse linguistic processing techniques for distilling from documents and topic statements all the rich layers of knowledge incorporated in their deceptively simple textual surface and produces a final document representation which has been shaped by all these levels of linguistic processing. To achieve the goals stated above, we have developed a system whose architecture is modular in design, with five separate processing modules which continuously refine the flow of documents both in terms of pure numbers and in terms of continual semantic enrichments (see Figure 1). Briefly previewed, the five modules processing is as follows: The Subject Field Coder uses semantic word knowledge to produce a summary-level topical vector representation of a document's contents that is matched to a vector representation of a topic statement in order to select for further processing only those documents which have real potential of being relevant. This subset of documents is then passed to the Text Structurer, 1 Ken McVearry, Woojin Paik, Ming Li, Edmund Yu, and Chris Khoo contributed to the design, data analysis and implementation of the system for TREC-1. 113 Topic Statement ¾ ¾ Millions LAW fl~AD[m~ of ranked BUS Documents MAIN [~--] reason ~documents ECO EXPEC [~ mm] iffireason II MIL Relation 9 Text Concepti- Subject Field Structurer Detector Coder Fig. 1: DR-LINK Conceptual Conceptual Graph Graph Matcher Generator which sub-divides a text into its discourse-level segments in order to focus later matching to the appropriate discourse component in response to particular types of information need. All of the structured texts1 with the appropriate components high-lighted, are passed to the Relation- Concept Detector1 whose purpose is to raise the level at which we do matching from a key- word or key-phrase level to a more conceptual level by expanding terms in the topic statement to all terms which have been shown to be `substitutable' for them, and then by extracting semantic relations between concepts from both documents and topic statements. This component produces concept-relation-concept triples which are passed to the Conceptual Graph Generator which converts these triples into the CG formalism (Sowa, 1984). The resultant CGs are passed to the Conceptual Graph Matcher, which measures the degree to which a particular topic statement CG and candidate document CGs share a common structure, and ranks the documents accordingly. The five modules in DR-LINK have well specified interfaces, making it possible for some of the modules to be re-combined, when appropriate, in a different order for a more advantageous flow of processing. For example, the Subject Field Coder can produce vectors for any-size unit of text (e.g. a sentence, a paragraph, a discourse-level text-type component, or the full document). Therefore, the Subject Field Coder can create an SFC representation for the full document before the text has been decomposed into its constituent discourse-level components by the Text Structurer, or the Subject Field Coder can be run on the document after the Text Structurer has recognized the discourse level components of a text, and can therefore produce separate vectors for the differing types of information (e.g. current event, past event, opinion, potential future event) contained in the various discourse-level components (e. g. Main Event, History, Evaluation, Expectation) within a newspaper text. This permits an SFC-vector of a particular topic statement to be matched to the representation for just that component whose content is most likely to be appropriate. Another vital aspect of our approach which is evidenced in the various semantic enrichments (e.g. Subject Field Codes, discourse components, concept-relation-concept triples, Conceptual Graphs) added to the basic text, is the real attention paid to representation at a deeper than surface level. That is, DR-LINK deals with lexical entities using more conceptually-based syntactic groupings. For example, complex nominals will be processed as meaningful multi- word constituents because the combination of individual terms in complex nominals conveys quite different meanings than if the individual constituents were individually interpreted. In addition, verbs are represented in case-frames so that the other lexical entities in the sentence which perform particular semantic roles in conjunction with the verb are represented according to these semantic roles. Also, the very rich semantic data (e.g. location, purpose, nationality) that is conveyed in the formulaic, appositional phrases typically accompanying proper nouns are represented in such a way that the semantic relations implicitly conveyed in the appositions are explicitly available for more refined matching and the creation of CGs which contain this relational information. In each of these three examples, relations are important in that they contextually bind concepts which otherwise would be treated as if they were independent of each other. To accomplish this task, given a text database, DR-LINK extracts important relations by relying on relation-revealing formulae (RRF) that are patterns of linguistic (lexical, syntactic, and semantic) clues by which particular relations are detected. 2. Detailed System DescriDtion Since our system is modular in design, with well-defined boundaries between the various 115 modules, the system description will also be organized according to these same divisions. The documents and the topic statements are analyzed in basically similar ways by the system's modules, with a few exceptions which will be detailed below within that module's description. Z&L Pre-ProcessinQ We have chosen to perform rather substantive pre-processing of the raw text that is received from DARPA because much of our later processing is dependent on clean, well-demarcated text. For example, we identify sub-headlines, and embedded figures, as well as identifying and restoring correct sentence boundaries. Also, we identify multiple stories within a single document so that each separate story can have its own SF0 vector produced which will be representative of just that one story, and so that accurate text structuring can be accomplished at the individual story level. The text is then processed by the POST part-of-speech tagger (Meteer et al, 1991) loaned to us by BBN, that stochastically attaches a part-of-speech tag to individual words. The part-of- speech tagged text is then fed into a bracketer, a deterministic finite state automaton that adds several different types of brackets for linguistic constituents (e.g. noun phrases, prepositional phrases, clauses, etc.) essential for several tasks in our system. 2.b. Subiect Field Coder The Subject Field Coder (SFCer) produces a summary-level semantic representation of a text's contents that is useable either for ranking a large set of incoming documents for their broad subject appropriateness to a standing query, or for dividing a database into clusters of documents on the same topic. One important benefit of the the SF0 representation is that it implicitly handles both the synonymy and polysemy problems which have plagued the use of NLP in l.R. systems because this representation is one level above the actual words in a text. For example, Figure 2 presents a short WSJ article and a humanly readable version of the normalized SF0 vector which serves as the document's semantic summary representation. A U.S. magistrate in Flodda ordered Cados Lehder Rivas, described as among the wodd's leading cocaine traffickers, held wfthout bond on 11 drug-smuggllng count. Lehder, who was caotured last week in Colombia and immediately extradited to the U.S., pleaded innocent to the charges in federal court in Jacksonvme. LAW .2667 .1333 BUSINESS .1333 ~NOMlOB .0667 DRUGS .1333 MILITARY .0667 POUTICAL SCIENCE .1333 OOCUPATIOI~ .0667 Fig. 2: Sample WSJ document and its SF0 representation The SFCer uses the Subject Codes from Longman's Dictionarv of Contemporary English (LDOCE) to produce this semantic representation of a text's contents. The machine-readable tape of the 1987 edition of LDOCE contains 35,899 headwords and 53,838 senses, for an average of 1.499 senses per headword plus several fields of information not visible in the hard-copy version 116 which are extremely useful in natural language processing tasks, e.g. the Subject Codes. The Subject Codes are based on a classification scheme of 124 major fields and 250 sub-fields. Subject Codes are manually assigned to words in LDOCE by the Longman lexicographers. There is a potential problem, however, with the Subject Code assignments which become obvious when an attempt is made to use them computationally. Namely, a particular word may function as more than one part of speech and each word may also have more than one sense, and each of these entries and/or senses may be assigned different Subject Codes. This is a slight variant of the standard disambiguation problem1 which has shown itself to be nearly intractable for most NLP applications, but which needed to be successfully handled if DR-LINK was to produce correct semantic SFC vectors. We based our computational approach to successful disambiguation on a study of current psycholinguistic research literature from which we concluded that there is no single theory that can account for all the experimental results on human lexical disambiguation. We interpret the these results as suggesting that there are three potential sources of influence on the human disambiguation process: Local context - the sentence containing the ambiguous word restricts the interpretation of ambiguous words Domain knowledge - the recognition that a text is concerned with a particular domain activates only the senses appropriate to that domain Frequency data - the frequency of each sense's general usage affects its accessibility We have computationally approximated these three knowledge sources in our disambiguator. We consider the `uniquely assigned' and `high-frequency' SFCs of words within a single sentence as providing the local context which suggests the correct SFC for an ambiguous word. The SFC correlation matrix which was generated by processing a corpus of 977 Wall Street Journal (WSJ) articles containing 442,059 words, equates to the domain knowledge (WSJ topics) that is called upon for disambiguation if the local context does not resolve the ambiguity. And finally, ordering of SFCs in LDOCE replicates the frequency-of-use criterion. We implement the computational disambiguation process by moving in stages from the more local level to the most global type of disambiguation, using these sources of information to guide the disambiguation process. The work is unique in that it successfully combines large-scale statistical evidence with the more commonly espoused local heuristics. We tested our SFC disambiguation procedures on a sample of twelve randomly selected WSJ articles containingi 66 sentences consisting of 1638 words which had SFCs in LDOCE. The system implementation of the disambiguation procedures was run and a single SFC was selected for each word. These SFCs were compared to the sense-selections made by an independent judge who was instructed to read the sentences and the definitions of the senses of each word and then to select that sense of the word which was most correct. The disambiguation implementation selected the correct SFC 89% of the time (Longman's Dictionary of Contemoorary English (LDOCE). Operationally, the SFCoder tags each word in a document with the appropriate, disambiguated (SFC). The within-document SFCs are then summed and normalized to produce a vector of the SFCs representing that document. Topic statements are likewise represented as SFC vectors. In 117 the routing situation, each topic statement SFC vector is compared to the incoming document SFC vectors and the douments are then ranked according to similarity to the topic statement SF0 vector. Either a predetermined or adjustable criterion can be used to select those documents whose SF0 vectors exhibit a predetermined degree of similarity to the topic statement SF0 vector. This set is then passed to later system components for more refined representation and matching. For use with retrospective or ad hoc queries, the SF0 vectors are clustered using Ward's agglomerative clustering algorithm (Ward, 1963) to form classes in the document database. For retrieval, queries are likewise represented as SF0 vectors and then matched to the prototype SF0 vector of each cluster in the database. Clusters whose prototype SF0 vectors exhibit a predetermined criterion of similarity to the query SF0 vector are passed on to other system components for more computationally expensive representation and matching (Liddy, Paik, & Woelfel, 1992). a&' Text Structurer The purpose of the Text Structuring module in DR-LINK is to delineate the discourse-level organization of each document's contents so that those document components where the type of information suggested by the topic statement is most likely to be found, can be selected for higher weighting. For example, in newspaper texts, opinions will be found in EVALUATION components, basic facts of the news story will be found in MAIN EVENT components, and predictions will be found in EXPECTATION components. The Text Structurer produces an enriched representation of each document by decomposing it into these smaller, conceptually labelled components. In parallel, the Topic Statement Processor evaluates each topic statement to determine if there is an indication that a particular component in the documents should be more highly weighted when matched to the topic statement representation. For example, topic statement indicator-terms such as predict or anticipate or proposed reveal that the time frame of the event being searched for must be in the future, in order for the document to be relevant. Therefore, documents in which this event is reported in a piece of text which has been marked by the Text Structurer as being either EXPECTATION or MAIN, FUTURE would be ranked more highly than those in which this event is reported in a different component. Operationally, DR-LINK evaluates each sentence in the input text, comparing it to the known characteristics of the prototypical sentence of each component of the text-type model, and then assigns a component label to the sentence. For the newspaper text-type model, we took as a starting point, the hierarchical newspaper text model proposed by van Dijk (1988). With this as a preliminary model, several iterations of coding of a sample of 149 randomly chosen Wall StreetJournal articles from 1987-1988 resulted in a revised News Schema which organized van Dijk's terminal node categories according to a more temporally oriented perspective. The News Schema Components account for all the text in the sample of articles. The components are: CIRCUMSTANCE, CONSEQUENCE, CREDENTIALS, DEFINITION, ERROR, EVALUATION, EXPECTATION, HISTORY, LEAD, MAIN EVENT, NO COMMENT, PREVIOUS EVENT, REFERENCES, anrl VERBAL REACTION. The process of manually coding the sample also served to suggest to us that during our intellectual decomposing of texts, we were in fact relying on six different types of linguistic information to make our decisions. The data from the sample set which could be used to provide the raw data for these evidence sources was then analyzed statistically and translated into 118 computationally recognizable text characteristics to be used by the Text Structurer to assign a component label to each sentence. Briefly defined, the six sources of evidence used in the Text Structurer are: Likelihood of Component Occurring - The unit of analysis for the first source of evidence is the sentence and is based on the observed frequency of each component in our coded sample set. Order of Components - This source of evidence relies on the tendency of components to occur in a particular, relative order determined by calculating across the coded files of the sample documents1 looking not at the content of the individual documents, but the component labels. The results are contained in two 19 by 19 matrices, one for probability of which component follows a given component and one for probability of which component precedes a given component. Lexical Clues - The third source of evidence is a set of one, two and three word phrases for each component. The set of lexical clues for each component was chosen based on observed frequencies and distributions. We were looking for words with sufficient occurrences, statistically skewed observed frequency of occurrence in a particular component, and semantic indication of the role or purpose of each component. Syntactic Sources - We make use of two types of syntactic evidence: 1) typical sentence length as measured in average number of words per sentence for each component; 2) individual part- of-speech distribution based on the output of the part~f-speech tagging of each document, using POST, a part-of-speech tagger loaned to us by BBN (Meteer et al, 1991). This evidence helps to recognize those components which, because of their nature, tend to have a disproportionate percentage of words of a particular part of speech. Tense Distribution - Some components, as might be expected by their name alone, tend to contain verbs of a particular tense more than verbs of other tenses. For example, DEFINITION sentences seldom contain past tense, whereas the predominate tense in HISTORY and PREVIOUS EVENT sentences is the past tense, based on POST tags. Continuation Clues - The sixth and final source of evidence is based on the conjunctive relations suggested in Halliday and Hasan's Cohesion Theorv (1976). The continuation clues are lexical clues which occur in a sentence-initial position and which were observed in our coded sample data to predictably indicate either that the current sentence continues the same component as the prior sentence or that there is a change in the component. These evidence sources for instantiating a discourse-level model of the newspaper text-model have been incorporated in the Text-Structurer, which evaluates each sentence of an input newspaper article against these six evidence sources for the purpose of assigning a text-level label to each sentence. The implementation uses the Dempster-Shafer Theory of Evidence Combination (Shafer, 1976) to coordinate information from the very complex matrices of statistical values for the various evidence sources which were generated from the intellectual analysis of the sample of 149 WSJ articles (Liddy, Paik, Mcvearry & Yu, In press). Operatbnally within DR-LINK, each document is processed a sentence at a time and each source of evidence assigns a number between 0 and 1 to indicate the degree of support that evidence source provides to the belief that a sentence is of a particular news-text component. Then, a simple supporting function for each component is computed and the component with the greatest 119 support is selected as the correct tag for that sentence. To convey how the Text Structurer output is used by DR-LINK, Figure 3 presents a Topic Statement which is highlighted to show its implicit request for future-oriented prediction information. This need for a specific type of information is recognized by the Topic Statement Processor which maps this need for future-oriented information to EXPECTATION or MAIN, FUTURE components in documents. The Text Structure Matcher then searches for documents with these components, as exemplified in the document in Figure 4, which shows a relevant WSJ article, as structured by the DR-LINK component, in which the required information occurs in sentences which have been correctly tagged as EXPECTATION. ZLdL Relation-Concept Detector The main function of the RCD module is to extract relations that connect concepts that otherwise would be treated as isolated and independent. For example, a relation REASON can be extracted from phrases like `because of, `as a result of, and `due to', which form lexical RRF, in order to connect the two constituents occurring before and after the phrases. It should be noted that the RRF we are developing are domain-independent since they are based on fairly universal linguistic clues rather than a domain model. There are several types of RRF derived from a variety of linguistic constructs including verb-oriented thematic roles, complex nominals, proper noun appositions, nominalized verbs, adverbs, prepositional phrases, and some other ad hoc patterns revealing relations that appear in the literature (Cf. Somers, 1987) and are expected to be suitable for IR purposes based on our preliminary analysis of the topic statements. We intend to conduct an extensive study of usefulness of individual relations as part of our failure analysis. With different types of RRF stored in a knowledge-base, we go through multiple stages of partial linguistic analyses, as opposed to a holistic syntactic processing followed by a semantic interpreter, to extract relations and generate CGs eventually. With the tagged, bracketed, and structured text as the input, various sub-modules in the ROD selectively detect implicit relations as well as concepts being connected, by focusing on occurrences of patterns of interest found in the knowledge base and by bypassing portions of text irrelevant to the relation extraction tasks. The output of the ROD component is a set of concept-relation-concept triples where concepts are derived often from content-bearing words and relations from non-content words or indirectly from the linguistic structure by consulting the knowledge base. For example, the Proper Noun (PN) apposition category of RRF will help categorize the many occurrences of PNs in text and determine semantic relations between a PN and the apposition which either precedes or follows it. For example, in the following sentence fragment, apposition RRF will recognize and categorize the LOCATION relation and the PRODUCT relation of the PN, General Development: `~... General Development, B Miami-based developer Of planned communications,..." General Development -> (LOCATION) -> Miami General Development -> (PRODUCT/SERVICE) -> developer of planned communications As another example of processing in the RCD module, the Case Frame Handler will, given a sentence fragment which has been processed by the tagger and bracketer: 120 Tipster Topic Description: 008 Topic: Economic Projections Description: Document will contain quantitative projections of the future value of some economic indicator for countries other than the U.S. Narrative: To be relevant, a document must include a projection of the value of an economic indicator (e.g., gross national product (GNP), stock market, rate of inflation, balance of payments, currency exchange rate, stock market value, per capita income, etc.) for the coming year, for a country other than the United States. Concepts: 1. inflation, stagflation, indicators, index 2. signs, projection, forecast, future 3. rise, shift, fall, growth, drop, expansion, slowdown, recovery 4. %, billions 5. NotUS. Nationality: Not U.S. Time: Future Fiq. 3: Sample Topic Stateir~nt 121 HEADLINE West Germany's Central Bank Cuts Key Rate: Little Impact Is Expected On Currency Markets Or Nation's Economy LEAD The West German central bank, as expected, cut the interest rate it charges on securities-repurchase agreements with banks, a move that appears likely to fall short of its goal of supporting the dollar by widening the gap between German and U.S. interest rates. MAIN The Bundesbank yesterday called for interest-rate tenders on 28-day securities-repurchase agreements at a minimum rate of 3.5%, well below the 3.8% rate it had been asking since January. EXPECT. EXPECT. Market experts estimate that the actual rate to emerge from bidding today is likely to be closer to 3.6% or 3.65%. But they expect the Bundesbank to use later tenders to continue to push the rate lower, possibly to 3.3%. DEFIN. The repurchase facility is the central bank's principal open- market means for refinancing the banking system with funds backed by securities. Yesterday's Bundesbank action signals broad declines in capital-market interest rates, and could lead to lower credit rates charged by commercial banks. EXPECT. MAIN The West German move complements a recent lowering of short- term interest rates in Japan and a tightening of credit by the U.S. Federal Reserve Board. MAIN However, foreign-exchange traders said the Bundesbank's action didn't have any immediate effect on the currency market and that small interest-rate movements are unlikely to significantly outweigh other factors weakening the dollar, including the U.S. trade deficit. Fiq. 4: Wall Street Journal article Relevant to T.S. 8 122 Venezuela and its creditor banks agreed yesterday to restructure $3 bmion of its foreign debt the verb `agree' triggers a set of case frames in the knowledge base with the additional informaUon provided by the preposition `to' followed by another verb, the correct case frame (i.e. RRF for case relaUons) will be chosen: (((A X S) ((AT ? to+V) (CT ? that))) where the first element of each of the two list indicates the relation (e.g. A for AGENT, AT for ACTIVITY, and CT for CONTENT) the second specifies the LDOCE semantic restriction (e.g. X for `abstract or human), and the last shows the syntactic or lexical clue that must be satisfied in order for the case frame to be instantiated. In this example, S is for a sublect, to+V for an infinitive, and `that' for a relative clause. The two sub-lists, (AT ? to+V) (CT ? that), indicates that only one of them is triggered at a given Ume. Eventually the following set of triples are generated: [agree] -> (A) -~ [country: Venezuela] [agree] -> (A) -> [creditoLbank] [agree] -> (AT) -> [restructure] where the concept node [country: Venezuela] is formed by applying rules from the Proper Noun knowledge base and processor. 2.e. Conceptual Graoh Generator Given the triples generated in the RCD module by applying various types of RRF, the next step is to merge them to form a complete CG for a sentence, paragraph, or an SRF component. The resulting CGs consist of not only concepts and relaUons but also instantiations or referents of concepts, which are usually derived from proper nouns like company names but can be another CG derived from a nested clause, thereby albwing for nested CGs. Since the different types of RRF in the knowledge base are developed and applied in the RCD modules in a more or less independent manner, a form of conflict resolution is necessary in this component. Given the set of concept~relation-concept triples as shown above and another set derived from the verb `restructure' with a frame ((A X S) (P W 0)): [restructure] -> (A) -> [country: Venezuela] [restructure] -> (A) -> [creditor_bank] [restructure] -~ (P) -~ [debt] where A and P are for `agent' and `patient' relations, and [debt] -~ (ME) -~ [money] [debt] -> (CH) -> [foreign] are produced by a special handler within the RCD, where ME and CH stand for `measure' and `characteristics', the CG generator produces the following CG for the sentence: 123 [agree] - (A) -> [country: *1 Venezuela] (A) -> (creditor_bank: *2] (AT) -> [restructure] - (A) -> [country: *1 Venezuela] (A) -> [creditor_bank: *2] (P) -> [debt] - (ME) -> [money] (CH) -> [foreign]. where *1 and *2 indicate that the nodes with the same number represent the same concept. While this process of extracting relations and constructing CGs is applied both to documents and topic statements, the latter require additional processing to capture unique features of information needs often found in the topic statement. Accordingly, CGs generated from topic statements have such additional features as importance weights on concept and relation nodes and ways of indicating whether an instantiation of a concept must exist in relevant documents. This specialized processing, in comparison with document processing, is accomplished by treating topic statements as a sub-language and building a model for them. For example, some information on weights is revealed by phrases like ~ optionaIN and "... must exist ..." whereas the need for an instantiation of a concept id indicated by phrases like "Identification of the company must be included". While CG theory provides a framework in which IR entities can be represented adequately, much of the representation task involves intellectual analysis of topic statements and documents so that we capture and store concepts and relations that are ontologically adequate for IR. For example, it is essential to choose, organize and classify a restricted set of relations in such a way that they facilitate matching and inferencing with two CGs representing a document and a topic statement. The efficacy of the relations we have chosen will be determined with full experiments and failure analyses. a~ Conceptual Graoh Matcher The main function of the CG matching component is to determine the degree to which two OGs share a common structure and score each document with respect to the topic statement. This is accomplished by empbying techniques necessary to model plausible inferences with CGs (Myaeng & Khoo, 1992). In order to allow for approximate matching between concept nodes or relation nodes, we have developed a matrix that represents similarities among relations being used in OG representation, as well as some concepts. Our goal is to enhance both precision and recall. By exploiting the structure of the CGs and ihe nature of the relations, we attempt to meet the specific information needs in topic statements. By allowing for partial matching (e.g. between `debt' and `bank debt') and inexact matching (e.g. between `debt' and `loan' and between `CO-AGENT' and `AGENT') at the node level, we can increase recall. For CG matching, we first developed and implemented a base algorithm that is flexible enough to allow for various types of partial matching between two CGs and ran experiments to test its practicality (Myaeng and Lopez-lopez, 1992). While the general subgraph isomorphism problem is known to be computationally intractable, matching CGs containing conceptual information (i.e. labels on nodes) appears to be practical. With improved understanding of the 124 nature of RRF and the exact features of the representation of documents and topic statements, we have developed and implemented heuristics for scoring documents for their relevancy with respect to a given topic statement, and added them to the base algorithm. With the rich representation of topic statements and documents, the expanded matching component currently is capable of discriminating documents based on the availability of rather unusual information needs specified in a topic statement, such as a certain status of an event or a need to contain a specific entity (e.g. company name) satisfying a certain condition. Also it facilitates the process of reliably identifying "hot spots" in retrieved documents. An example of how a matching between document OG and a topic statement OG is shown in Figure 5, where the entire OG represents the example sentence discussed above and the topic statement sentence: "...a current debt reschedullng agreement... between a debtor developing country and one or more of its creditors, commercial andlor official. It will identify the debtor country and the creditor(s), the payment time period..., and the interest rate, ...,, The dark area and the individual nodes with gray shades represent the matched parts: a connected sub-OG that is the main contributor for the final score and some single node matches, respectively. It should be noted that information on the relative importance of the text components delineated by the Text Structure component with respect to the topic statement is incorporated in the scoring heuristics. An independent module under development, which will have direct bearing on the final results of matching and which will be added at a later stage to determine its efficacy, is RIT (Roaet's International Thesaurus) Coder. The goal is to simulate the use of a type hiearchy in the CG theory by replacing lexical terms in concept nodes in our representation with RIT semi-colon group numbers, each of which represent a set of semantically similar words and phrases in its hierarchy. Our approach is to use the words surrounding the target word to be replaced with an RIT code as context words by which we try to disambiguate the sense of the target word and find the exact location in the RIT. While this approach is to increase both recall and precision at the same time through implicit expansion of terms and sense disambiguation, we will have to see the sensitivity of incomplete disambiguation to the overall retrieval results. Testino and Results Although the DR-LINK entry in TREC was not tested in the same manner as the other systems, the status of the system and the testing which was done were consistent with the milestones that had been established in consultation with our TIPSTER contractor. DR-LINK was not tested as a full system due to the fact that DR-LINK is based on several theories that have never before been implemented in an information retrieval system and none of the system's components were in even the design stage of development when the project began. As a result of these facts, our system is not yet fully implemented. The results that follow are on the three modules (Subject Field Coder, Text Structurer, Conceptual Graph Matcher) that have been implemented to date. The full system will be tested at the eighteenth month meeting of TIPSTER. 125 IV""e~~1~ 1)agent char foreigi ~~iVCfl~ZU~ [~ZmT4 Fiq. 5: Example of CG matching result 126 3.a. Subiect Field Coder Testing The Subject Field Coder can be evaluated in its filtering function in the following way: The SF0 vectors of documents in a database are compared to a topic statement vector and, for each topic statement, ranked according to similarity. The question then is, how far down this ranked list would the system need to proceed in order to include all the relevant documents in the set of documents that was passed on to the next system module? This testing procedure has been run using the WSJ and Ziff collections on Disk 1 and Topic Statements 1 to 50 provided by TREC. In the Wall Street Journal database, results showed that on average across the 50 topic statements, all the relevant documents were ranked in the top 28% of the list. Therefore, on average, 72% of the WSJ database did not need to be further processed by the later modules in the system. In the Ziff database, on average across the 50 topic statements, all the relevant documents were ranked in the top 66% of the database, a much poorer result. Therefore, on average, 34% of the Ziff database need not be further processed. When Topic Statements 51 to 100 were searched on the WSJ database, all the relevant documents were ranked in the top 32% of the list. Error analysis of the Ziff results, has revealed the cause of the poorer performance on that database. Namely, for several of the queries, documents were judged relevant only if they contained the particular proper noun mentioned in the topic statement (e.g. OS/2, Mitsubishi, IBM's SAA standards, etc.). Given these types of topic statements, the most appropriate search approach for a system would be keyword matching, which is not at all the type of matching that is done using the SF0 representation. The SFCs represent a document at a higher level of abstraction, not at the keyword level. That is documents which discuss a particular computer, will have a strong weighting of the Data Processing slot on the SF0 vector, but no means for matching on a particular computer name. Therefore, the error analysis showed that the SF0 performance was hampered by its inability to match at the level of a specific company name or product. Fortunately, we do have at hand the means to improve the results, as we are in the process of incorporating a second-level of document ranking using Proper Noun processing algorithms. Results using this extended representation will be available at the eighteenth month TIPSTER meeting. 3.b. Text Structurer Testing The Text Structurer was tested using five of the six evidence sources, as we have not yet implemented the algorithms for incorporating evidence from the Continuation Clues. We tested the Text Structurer on a set of 116 WSJ documents, consisting of several thousand sentences. This first testing resulted in 72% of the sentences being correctly identified. An additional hand simulation of one small, heuristic adjustment was tested and improved the system's performance to 74% of the sentences being correctly identified. A second run of a smaller sample of sentences resulted in 80% correct identification of components for sentences. Ongoing efforts at improving the quality of the evidence sources used by the Text Structurer, plus the incorporation of the Continuation Clue evidence, promise to improve these results significantly. It should be remembered, as well, that the automatic discourse structuring of documents has not been reported elsewhere in the literature, so that this very new type of text processing is in its infancy and likely has much room for future improvements. 127 Conceptual Graph Matcher Testing As the first prototype, the CG Matcher was implemented with a set of scoring heuristics whose theoretical basis is on the Dempster-Shafer theory of evidence. The score for a document is computed progressively from node-level evidence through multiple stages to generate a score for text units of different granularities. In order to test the feasibility of the prototype module, we have run it with manually generated OGs for twenty documents and five topic statements for which we have relevance judgments. While it is premature to draw any conclusions on the efficacy of the matching algorithm and the heuristics, mainly due to the size of the data set and the stage of the development, the results were encouraging and have provided us with much insight on how the scoring heuristics need to be tuned. Conclusions This paper on the DR-LINK system should be considered a report on a work in progress, since we did not have a fully devebped system at the time of the TREC testing. However, we do believe that the three system components which were tested perform quite respectably, given their innovativeness. Continued development and feedback from the TREC results will provide much more refined versions of these system modules. In addition, two system modules remain to be developed and the full system, which is quite synergistic in its approach to achieving its goals, remains to be integrated and tested as a full system. The Relation Concept Detector and Conceptual Graph Generator modules are being implemented in tandem and, when completed, will make the DR-LINK system fully operational. Full system testing will be conducted for the eighteen month TIPSTER testing. In the interim, our goal in this paper has been to describe the five unique modules which comprise DR-LINK and which, in combination, promise to provide a full system which has the necessary filtering power to make later processing more accurate and the depth of linguistic processing required to provide real conceptual level matching and retrieval. References Halliday, M. A. K. & Hasan, R. (1976). Cohesion in English. London, Longmans. Liddy, E.D. & Paik, W. (1992). Statistically-Guided word sense disambiguation. In Prnceedinos of MAI Fall Symposium Series: Probabilistic aooroaches to natural language. Menlo Park, CA: AAAI. Liddy, E.D., Paik, W., Mcvearry, K. & Yu, E. (In press). Automatic discourse-level structuring of newspaper texts: Empirical testing of a model. Liddy, E.D., Paik, W. & Woelfel, J. (1992). Use of subject field codes from a machine-readable dictionary for automatic classification of documents. Proceedings of 3rd ASIS Classification Research Workshop. Meteer, M., Schwartz, R. & Weischedel, R. (1991). POST: Using probabilities in language processing. Proceedings of the Twelfth International Conference on Artificial Intelligence. Sydney, Australia. Myaeng, S. H. (1992) Using conceptual graphs for information retrieval: a framework for representation and flexible inferencing. Proceedings of Symposium on Document Analysis and Information Retrieval, Las Vegas, March 16-18. Myaeng, S. H. & Khoo, C. (1992). On uncertainty handling in plausible reasoning with conceptual graphs. Proceedings of 7th Workshop on Conceptual Graphs, Las Cruces, NM, July, 1992. 128 Myaeng1 S. H. & Lopez-lopes, A. (1992). A conceptual graph matching: a flexible algorithm and experiments. Journal of Experimental and Theoretical Artificial Intelligence, Vol.4, 107126. Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ: Princeton University Press. Somers, H. L. (1987). Valency and Case in Computational Linguistics. Edinburgh: Edinburgh University Press. Sowa, J. (1984). Conceptual Structures: Information Processing in Mind and Machine. Reading, MA: Addison-Wesley. van Dijk, T. (1988). New analysis: Case studies of international and national news in the press. Hillsdale, NJ: Lawrence Earlbaum Associates. Ward, J. (1963). Hierarchical grouping to optimize an objection function. Journal of the American Statistical Association. 58, p.237-254. 129