DR-LINK's
                Linguistic-Conceptual Approach to Document Detection1

                               Elizabeth D. Liddy
                                Sung H. Myaeng
                           School of Information Studies
                              Syracuse University
                        Syracuse1 New York 132100-4100
                  liddy@mailbox.syr.edu; shmyaeng@mailbox.syr.edu

1'  Overview

Our approach to the difficult problem of selecting only those documents which satisfy a user's
specified information need, is to pay parallel attention to two very important aspects of the task.
Firstly, there are many documents which have no likely possibility of being relevant to either a
standing query or a query newly put to the system. These documents should be filtered from
further consideration at an early stage in the system's processing if the system's later
processing is computationally expensive, and if their presence introduces unnecessary
ambiguity, while their removal produces more accurate results. This focusing process should
continue at subsequent stages using additional linguistic features of the query and documents in
order to further refine the flow of documents. Secondly, there is a continuum of levels of
linguistic-conceptual processing which can produce enrichments of the original text in order to
explicitly represent documents at more conceptual levels for more accurate matching to
queries.
Our approach also recognizes that, as reflected by the topic statements, the retrieval task in
TREC requires capabilities beyond what has been required in the past for traditional IR systems.
Topic statements describe not only `aboutness' but also more detailed information such as
relationships among entities, characteristics of participants in an event, and temporality. We
believe that richer representations of documents and topic statements are essential to meet the
extended retrieval requirements of such complex information needs and to reduce the ambigui-
ties resulting from keyword-based retrieval. To produce this enriched representation the
system uses lexical, syntactic, semantic, and discourse linguistic processing techniques for
distilling from documents and topic statements all the rich layers of knowledge incorporated in
their deceptively simple textual surface and produces a final document representation which has
been shaped by all these levels of linguistic processing.

To achieve the goals stated above, we have developed a system whose architecture is modular in
design, with five separate processing modules which continuously refine the flow of documents
both in terms of pure numbers and in terms of continual semantic enrichments (see Figure 1).
Briefly previewed, the five modules processing is as follows:

The Subject Field Coder uses semantic word knowledge to produce a summary-level topical
vector representation of a document's contents that is matched to a vector representation of a
topic statement in order to select for further processing only those documents which have real
potential of being relevant. This subset of documents is then passed to the Text Structurer,

      1 Ken McVearry, Woojin Paik, Ming Li, Edmund Yu, and Chris Khoo contributed to the design,
data analysis and implementation of the system for TREC-1.


                                   113

                                     Topic Statement

             ž
                          ž


Millions                    LAW        fl~AD[m~
   of                                                                            ranked
                                 BUS
Documents                               MAIN [~--]                  reason    ~documents
                            ECO
                                        EXPEC [~ mm] iffireason
               II                MIL

                               Relation
    9               Text       Concepti-
Subject Field     Structurer   Detector
  Coder

                  Fig. 1: DR-LINK

              Conceptual
Conceptual      Graph
 Graph         Matcher
Generator

which sub-divides a text into its discourse-level segments in order to focus later matching to
the appropriate discourse component in response to particular types of information need. All of
the structured texts1 with the appropriate components high-lighted, are passed to the Relation-
Concept Detector1 whose purpose is to raise the level at which we do matching from a key-
word or key-phrase level to a more conceptual level by expanding terms in the topic statement
to all terms which have been shown to be `substitutable' for them, and then by extracting
semantic relations between concepts from both documents and topic statements. This component
produces concept-relation-concept triples which are passed to the Conceptual Graph
Generator which converts these triples into the CG formalism (Sowa, 1984). The resultant
CGs are passed to the Conceptual Graph Matcher, which measures the degree to which a
particular topic statement CG and candidate document CGs share a common structure, and ranks
the documents accordingly.
The five modules in DR-LINK have well specified interfaces, making it possible for some of the
modules to be re-combined, when appropriate, in a different order for a more advantageous flow
of processing. For example, the Subject Field Coder can produce vectors for any-size unit of
text (e.g. a sentence, a paragraph, a discourse-level text-type component, or the full
document). Therefore, the Subject Field Coder can create an SFC representation for the full
document before the text has been decomposed into its constituent discourse-level components
by the Text Structurer, or the Subject Field Coder can be run on the document after the Text
Structurer has recognized the discourse level components of a text, and can therefore produce
separate vectors for the differing types of information (e.g. current event, past event, opinion,
potential future event) contained in the various discourse-level components (e. g. Main Event,
History, Evaluation, Expectation) within a newspaper text. This permits an SFC-vector of a
particular topic statement to be matched to the representation for just that component whose
content is most likely to be appropriate.

Another vital aspect of our approach which is evidenced in the various semantic enrichments
(e.g. Subject Field Codes, discourse components, concept-relation-concept triples, Conceptual
Graphs) added to the basic text, is the real attention paid to representation at a deeper than
surface level. That is, DR-LINK deals with lexical entities using more conceptually-based
syntactic groupings. For example, complex nominals will be processed as meaningful multi-
word constituents because the combination of individual terms in complex nominals conveys
quite different meanings than if the individual constituents were individually interpreted. In
addition, verbs are represented in case-frames so that the other lexical entities in the sentence
which perform particular semantic roles in conjunction with the verb are represented
according to these semantic roles. Also, the very rich semantic data (e.g. location, purpose,
nationality) that is conveyed in the formulaic, appositional phrases typically accompanying
proper nouns are represented in such a way that the semantic relations implicitly conveyed in
the appositions are explicitly available for more refined matching and the creation of CGs which
contain this relational information. In each of these three examples, relations are important in
that they contextually bind concepts which otherwise would be treated as if they were
independent of each other. To accomplish this task, given a text database, DR-LINK extracts
important relations by relying on relation-revealing formulae (RRF) that are patterns of
linguistic (lexical, syntactic, and semantic) clues by which particular relations are detected.

2.  Detailed System DescriDtion

Since our system is modular in design, with well-defined boundaries between the various


                                   115

modules, the system description will also be organized according to these same divisions. The
documents and the topic statements are analyzed in basically similar ways by the system's
modules, with a few exceptions which will be detailed below within that module's description.

Z&L Pre-ProcessinQ
We have chosen to perform rather substantive pre-processing of the raw text that is received
from DARPA because much of our later processing is dependent on clean, well-demarcated text.
For example, we identify sub-headlines, and embedded figures, as well as identifying and
restoring correct sentence boundaries. Also, we identify multiple stories within a single
document so that each separate story can have its own SF0 vector produced which will be
representative of just that one story, and so that accurate text structuring can be accomplished
at the individual story level.

The text is then processed by the POST part-of-speech tagger (Meteer et al, 1991) loaned to us
by BBN, that stochastically attaches a part-of-speech tag to individual words. The part-of-
speech tagged text is then fed into a bracketer, a deterministic finite state automaton that adds
several different types of brackets for linguistic constituents (e.g. noun phrases, prepositional
phrases, clauses, etc.) essential for several tasks in our system.

2.b. Subiect Field Coder

The Subject Field Coder (SFCer) produces a summary-level semantic representation of a text's
contents that is useable either for ranking a large set of incoming documents for their broad
subject appropriateness to a standing query, or for dividing a database into clusters of
documents on the same topic. One important benefit of the the SF0 representation is that it
implicitly handles both the synonymy and polysemy problems which have plagued the use of NLP
in l.R. systems because this representation is one level above the actual words in a text.
For example, Figure 2 presents a short WSJ article and a humanly readable version of the
normalized SF0 vector which serves as the document's semantic summary representation.

      A U.S. magistrate in Flodda ordered Cados Lehder Rivas, described as among the
      wodd's leading cocaine traffickers, held wfthout bond on 11 drug-smuggllng
      count. Lehder, who was caotured last week in Colombia and immediately extradited
      to the U.S., pleaded innocent to the charges in federal court in Jacksonvme.

      LAW                .2667                             .1333
      BUSINESS           .1333            ~NOMlOB          .0667
      DRUGS              .1333            MILITARY         .0667
      POUTICAL SCIENCE   .1333            OOCUPATIOI~      .0667
Fig. 2: Sample WSJ document and its SF0 representation

The SFCer uses the Subject Codes from Longman's Dictionarv of Contemporary English (LDOCE)
to produce this semantic representation of a text's contents. The machine-readable tape of the
1987 edition of LDOCE contains 35,899 headwords and 53,838 senses, for an average of 1.499
senses per headword plus several fields of information not visible in the hard-copy version


                                    116

which are extremely useful in natural language processing tasks, e.g. the Subject Codes. The
Subject Codes are based on a classification scheme of 124 major fields and 250 sub-fields.
Subject Codes are manually assigned to words in LDOCE by the Longman lexicographers. There is
a potential problem, however, with the Subject Code assignments which become obvious when an
attempt is made to use them computationally. Namely, a particular word may function as more
than one part of speech and each word may also have more than one sense, and each of these
entries and/or senses may be assigned different Subject Codes. This is a slight variant of the
standard disambiguation problem1 which has shown itself to be nearly intractable for most NLP
applications, but which needed to be successfully handled if DR-LINK was to produce correct
semantic SFC vectors.

We based our computational approach to successful disambiguation on a study of current
psycholinguistic research literature from which we concluded that there is no single theory that
can account for all the experimental results on human lexical disambiguation. We interpret the
these results as suggesting that there are three potential sources of influence on the human
disambiguation process:
    Local context - the sentence containing the ambiguous word restricts the
                  interpretation of ambiguous words

    Domain knowledge - the recognition that a text is concerned with a particular
                      domain activates only the senses appropriate to that domain

    Frequency data - the frequency of each sense's general usage affects its
                    accessibility
We have computationally approximated these three knowledge sources in our disambiguator. We
consider the `uniquely assigned' and `high-frequency' SFCs of words within a single sentence as
providing the local context which suggests the correct SFC for an ambiguous word. The SFC
correlation matrix which was generated by processing a corpus of 977 Wall Street Journal
(WSJ) articles containing 442,059 words, equates to the domain knowledge (WSJ topics) that
is called upon for disambiguation if the local context does not resolve the ambiguity. And finally,
ordering of SFCs in LDOCE replicates the frequency-of-use criterion. We implement the
computational disambiguation process by moving in stages from the more local level to the most
global type of disambiguation, using these sources of information to guide the disambiguation
process. The work is unique in that it successfully combines large-scale statistical evidence
with the more commonly espoused local heuristics.
We tested our SFC disambiguation procedures on a sample of twelve randomly selected WSJ
articles containingi 66 sentences consisting of 1638 words which had SFCs in LDOCE. The
system implementation of the disambiguation procedures was run and a single SFC was selected
for each word. These SFCs were compared to the sense-selections made by an independent judge
who was instructed to read the sentences and the definitions of the senses of each word and then
to select that sense of the word which was most correct. The disambiguation implementation
selected the correct SFC 89% of the time (Longman's Dictionary of Contemoorary English
(LDOCE).

Operationally, the SFCoder tags each word in a document with the appropriate, disambiguated
(SFC). The within-document SFCs are then summed and normalized to produce a vector of the
SFCs representing that document. Topic statements are likewise represented as SFC vectors. In


                                 117

the routing situation, each topic statement SFC vector is compared to the incoming document SFC
vectors and the douments are then ranked according to similarity to the topic statement SF0
vector. Either a predetermined or adjustable criterion can be used to select those documents
whose SF0 vectors exhibit a predetermined degree of similarity to the topic statement SF0
vector. This set is then passed to later system components for more refined representation and
matching.

For use with retrospective or ad hoc queries, the SF0 vectors are clustered using Ward's
agglomerative clustering algorithm (Ward, 1963) to form classes in the document database. For
retrieval, queries are likewise represented as SF0 vectors and then matched to the prototype
SF0 vector of each cluster in the database. Clusters whose prototype SF0 vectors exhibit a
predetermined criterion of similarity to the query SF0 vector are passed on to other system
components for more computationally expensive representation and matching (Liddy, Paik, &
Woelfel, 1992).

a&' Text Structurer
The purpose of the Text Structuring module in DR-LINK is to delineate the discourse-level
organization of each document's contents so that those document components where the type of
information suggested by the topic statement is most likely to be found, can be selected for
higher weighting. For example, in newspaper texts, opinions will be found in EVALUATION
components, basic facts of the news story will be found in MAIN EVENT components, and
predictions will be found in EXPECTATION components. The Text Structurer produces an
enriched representation of each document by decomposing it into these smaller, conceptually
labelled components. In parallel, the Topic Statement Processor evaluates each topic statement
to determine if there is an indication that a particular component in the documents should be
more highly weighted when matched to the topic statement representation. For example, topic
statement indicator-terms such as predict or anticipate or proposed reveal that the time frame
of the event being searched for must be in the future, in order for the document to be relevant.
Therefore, documents in which this event is reported in a piece of text which has been marked
by the Text Structurer as being either EXPECTATION or MAIN, FUTURE would be ranked more
highly than those in which this event is reported in a different component.

Operationally, DR-LINK evaluates each sentence in the input text, comparing it to the known
characteristics of the prototypical sentence of each component of the text-type model, and then
assigns a component label to the sentence. For the newspaper text-type model, we took as a
starting point, the hierarchical newspaper text model proposed by van Dijk (1988). With this
as a preliminary model, several iterations of coding of a sample of 149 randomly chosen Wall
StreetJournal articles from 1987-1988 resulted in a revised News Schema which organized
van Dijk's terminal node categories according to a more temporally oriented perspective. The
News Schema Components account for all the text in the sample of articles. The components are:
CIRCUMSTANCE, CONSEQUENCE, CREDENTIALS, DEFINITION, ERROR, EVALUATION,
EXPECTATION, HISTORY, LEAD, MAIN EVENT, NO COMMENT, PREVIOUS EVENT, REFERENCES, anrl
VERBAL REACTION.
The process of manually coding the sample also served to suggest to us that during our
intellectual decomposing of texts, we were in fact relying on six different types of linguistic
information to make our decisions. The data from the sample set which could be used to provide
the raw data for these evidence sources was then analyzed statistically and translated into


                                   118

computationally recognizable text characteristics to be used by the Text Structurer to assign a
component label to each sentence. Briefly defined, the six sources of evidence used in the Text
Structurer are:
Likelihood of Component Occurring - The unit of analysis for the first source of evidence is the
sentence and is based on the observed frequency of each component in our coded sample set.

Order of Components - This source of evidence relies on the tendency of components to occur in
a particular, relative order determined by calculating across the coded files of the sample
documents1 looking not at the content of the individual documents, but the component labels. The
results are contained in two 19 by 19 matrices, one for probability of which component follows
a given component and one for probability of which component precedes a given component.

Lexical Clues - The third source of evidence is a set of one, two and three word phrases for each
component. The set of lexical clues for each component was chosen based on observed frequencies
and distributions. We were looking for words with sufficient occurrences,  statistically skewed
observed frequency of occurrence in a particular component, and semantic indication of the role
or purpose of each component.

Syntactic Sources - We make use of two types of syntactic evidence: 1) typical sentence length
as measured in average number of words per sentence for each component; 2) individual part-
of-speech distribution based on the output of the part~f-speech tagging of each document, using
POST, a part-of-speech tagger loaned to us by BBN (Meteer et al, 1991). This evidence helps to
recognize those components which, because of their nature, tend to have a disproportionate
percentage of words of a particular part of speech.

Tense Distribution - Some components, as might be expected by their name alone, tend to
contain verbs of a particular tense more than verbs of other tenses. For example, DEFINITION
sentences seldom contain past tense, whereas the predominate tense in HISTORY and PREVIOUS
EVENT sentences is the past tense, based on POST tags.

Continuation Clues - The sixth and final source of evidence is based on the conjunctive
relations suggested in Halliday and Hasan's Cohesion Theorv (1976). The continuation clues are
lexical clues which occur in a sentence-initial position and which were observed in our coded
sample data to predictably indicate either that the current sentence continues the same
component as the prior sentence or that there is a change in the component.

These evidence sources for instantiating a discourse-level model of the newspaper text-model
have been incorporated in the Text-Structurer, which evaluates each sentence of an input
newspaper article against these six evidence sources for the purpose of assigning a text-level
label to each sentence. The implementation uses the Dempster-Shafer Theory of Evidence
Combination (Shafer, 1976) to coordinate information from the very complex matrices of
statistical values for the various evidence sources which were generated from the intellectual
analysis of the sample of 149 WSJ articles (Liddy, Paik, Mcvearry & Yu, In press).

Operatbnally within DR-LINK, each document is processed a sentence at a time and each source
of evidence assigns a number between 0 and 1 to indicate the degree of support that evidence
source provides to the belief that a sentence is of a particular news-text component. Then, a
simple supporting function for each component is computed and the component with the greatest


                                 119

support is selected as the correct tag for that sentence.

To convey how the Text Structurer output is used by DR-LINK, Figure 3 presents a Topic
Statement which is highlighted to show its implicit request for future-oriented prediction
information. This need for a specific type of information is recognized by the Topic Statement
Processor which maps this need for future-oriented information to EXPECTATION or MAIN,
FUTURE components in documents. The Text Structure Matcher then searches for documents
with these components, as exemplified in the document in Figure 4, which shows a relevant WSJ
article, as structured by the DR-LINK component, in which the required information occurs in
sentences which have been correctly tagged as EXPECTATION.

ZLdL Relation-Concept Detector
The main function of the RCD module is to extract relations that connect concepts that otherwise
would be treated as isolated and independent. For example, a relation REASON can be extracted
from phrases like `because of, `as a result of, and `due to', which form lexical RRF, in order to
connect the two constituents occurring before and after the phrases. It should be noted that the
RRF we are developing are domain-independent since they are based on fairly universal
linguistic clues rather than a domain model. There are several types of RRF derived from a
variety of linguistic constructs including verb-oriented thematic roles, complex nominals,
proper noun appositions, nominalized verbs, adverbs, prepositional phrases, and some other ad
hoc patterns revealing relations that appear in the literature (Cf. Somers, 1987) and are
expected to be suitable for IR purposes based on our preliminary analysis of the topic
statements. We intend to conduct an extensive study of usefulness of individual relations as part
of our failure analysis.

With different types of RRF stored in a knowledge-base, we go through multiple stages of partial
linguistic analyses, as opposed to a holistic syntactic processing followed by a semantic
interpreter, to extract relations and generate CGs eventually. With the tagged, bracketed, and
structured text as the input, various sub-modules in the ROD selectively detect implicit
relations as well as concepts being connected, by focusing on occurrences of patterns of interest
found in the knowledge base and by bypassing portions of text irrelevant to the relation
extraction tasks. The output of the ROD component is a set of concept-relation-concept triples
where concepts are derived often from content-bearing words and relations from non-content
words or indirectly from the linguistic structure by consulting the knowledge base.

For example, the Proper Noun (PN) apposition category of RRF will help categorize the many
occurrences of PNs in text and determine semantic relations between a PN and the apposition
which either precedes or follows it. For example, in the following sentence fragment, apposition
RRF will recognize and categorize the LOCATION relation and the PRODUCT relation of the PN,
General Development:

 `~... General Development, B Miami-based developer Of planned communications,..."

General Development -> (LOCATION) -> Miami
General Development -> (PRODUCT/SERVICE) -> developer of planned communications

As another example of processing in the RCD module, the Case Frame Handler will, given a
sentence fragment which has been processed by the tagger and bracketer:


                                 120

Tipster Topic Description:   008

Topic:  Economic Projections

Description: Document will contain quantitative projections of the
               future value of some economic indicator for countries
               other than the U.S.

Narrative: To be relevant, a document must include a projection of the
             value of an economic indicator (e.g., gross national product
             (GNP), stock market, rate of inflation, balance of payments,
             currency exchange rate, stock market value, per capita income,
             etc.) for the coming year, for a country other than the United
             States.

Concepts:

1. inflation,  stagflation, indicators, index

2. signs, projection, forecast, future

3. rise, shift, fall, growth, drop, expansion, slowdown, recovery

4. %, billions

5. NotUS.


Nationality:   Not U.S.


Time: Future

Fiq. 3:   Sample Topic Stateir~nt


                                      121

HEADLINE

West Germany's Central Bank Cuts Key Rate: Little Impact Is
Expected On Currency Markets Or Nation's Economy

LEAD     The West German central bank, as expected, cut the interest
         rate it charges on securities-repurchase agreements with
         banks, a move that appears likely to fall short of its goal of
         supporting the dollar by widening the gap between German and
         U.S. interest rates.

MAIN

The Bundesbank yesterday called for interest-rate tenders on
28-day securities-repurchase agreements at a minimum rate
of 3.5%, well below the 3.8% rate it had been asking since
January.

EXPECT.


EXPECT.

Market experts estimate that the actual rate to emerge
from bidding today is likely to be closer to 3.6% or
3.65%.

But they expect the Bundesbank to use later tenders to
continue to push the rate lower, possibly to 3.3%.

DEFIN.    The repurchase facility is the central bank's principal open-
          market means for refinancing the banking system with funds
          backed by securities.

          Yesterday's Bundesbank action signals broad declines
          in capital-market interest rates, and could lead to
          lower credit rates charged by commercial banks.

EXPECT.

MAIN     The West German move complements a recent lowering of
         short- term interest rates in Japan and a tightening of credit
         by the U.S. Federal Reserve Board.

MAIN

However, foreign-exchange traders said the Bundesbank's
action didn't have any immediate effect on the currency market
and that small interest-rate movements are unlikely to
significantly outweigh other factors weakening the dollar,
including the U.S. trade deficit.

          Fiq. 4:   Wall Street Journal article Relevant to T.S. 8


                                  122

       Venezuela and its creditor banks agreed yesterday to restructure $3 bmion of
                             its foreign debt

the verb `agree' triggers a set of case frames in the knowledge base with the additional
informaUon provided by the preposition `to' followed by another verb, the correct case frame
(i.e. RRF for case relaUons) will be chosen:

(((A X S) ((AT ? to+V) (CT ? that)))
where the first element of each of the two list indicates the relation (e.g. A for AGENT, AT for
ACTIVITY, and CT for CONTENT) the second specifies the LDOCE semantic restriction (e.g. X for
`abstract or human), and the last shows the syntactic or lexical clue that must be satisfied in
order for the case frame to be instantiated. In this example, S is for a sublect, to+V for an
infinitive, and `that' for a relative clause. The two sub-lists, (AT ? to+V) (CT ? that), indicates
that only one of them is triggered at a given Ume. Eventually the following set of triples are
generated:

[agree] -> (A) -~ [country: Venezuela]
[agree] -> (A) -> [creditoLbank]
[agree] -> (AT) -> [restructure]
where the concept node [country: Venezuela] is formed by applying rules from the Proper Noun
knowledge base and processor.

2.e. Conceptual Graoh Generator

Given the triples generated in the RCD module by applying various types of RRF, the next step is
to merge them to form a complete CG for a sentence, paragraph, or an SRF component. The
resulting CGs consist of not only concepts and relaUons but also instantiations or referents of
concepts, which are usually derived from proper nouns like company names but can be another
CG derived from a nested clause, thereby albwing for nested CGs. Since the different types of
RRF in the knowledge base are developed and applied in the RCD modules in a more or less
independent manner, a form of conflict resolution is necessary in this component.

Given the set of concept~relation-concept triples as shown above and another set derived from
the verb `restructure' with a frame ((A X S) (P W 0)):

[restructure] -> (A) -> [country: Venezuela]
[restructure] -> (A) -> [creditor_bank]
[restructure] -~ (P) -~ [debt]
where A and P are for `agent' and `patient' relations, and

[debt] -~ (ME) -~ [money]
[debt] -> (CH) -> [foreign]

are produced by a special handler within the RCD, where ME and CH stand for `measure' and
`characteristics', the CG generator produces the following CG for the sentence:


                                 123

   [agree] -
   (A) -> [country: *1 Venezuela]
   (A) -> (creditor_bank: *2]
   (AT) -> [restructure] -
        (A) -> [country: *1 Venezuela]
        (A) -> [creditor_bank: *2]
        (P) -> [debt] -
            (ME) -> [money]
              (CH) -> [foreign].
where *1 and *2 indicate that the nodes with the same number represent the same concept.

While this process of extracting relations and constructing CGs is applied both to documents and
topic statements, the latter require additional processing to capture unique features of
information needs often found in the topic statement. Accordingly, CGs generated from topic
statements have such additional features as importance weights on concept and relation nodes and
ways of indicating whether an instantiation of a concept must exist in relevant documents. This
specialized processing, in comparison with document processing, is accomplished by treating
topic statements as a sub-language and building a model for them. For example, some
information on weights is revealed by phrases like ~ optionaIN and "... must exist ..." whereas
the need for an instantiation of a concept id indicated by phrases like "Identification of the
company must be included".

While CG theory provides a framework in which IR entities can be represented adequately, much
of the representation task involves intellectual analysis of topic statements and documents so
that we capture and store concepts and relations that are ontologically adequate for IR. For
example, it is essential to choose, organize and classify a restricted set of relations in such a
way that they facilitate matching and inferencing with two CGs representing a document and a
topic statement. The efficacy of the relations we have chosen will be determined with full
experiments and failure analyses.

a~ Conceptual Graoh Matcher
The main function of the CG matching component is to determine the degree to which two OGs
share a common structure and score each document with respect to the topic statement. This is
accomplished by empbying techniques necessary to model plausible inferences with CGs
(Myaeng & Khoo, 1992). In order to allow for approximate matching between concept nodes or
relation nodes, we have developed a matrix that represents similarities among relations being
used in OG representation, as well as some concepts. Our goal is to enhance both precision and
recall. By exploiting the structure of the CGs and ihe nature of the relations, we attempt to meet
the specific information needs in topic statements. By allowing for partial matching (e.g.
between `debt' and `bank debt') and inexact matching (e.g. between `debt' and `loan' and between
`CO-AGENT' and `AGENT') at the node level, we can increase recall.
For CG matching, we first developed and implemented a base algorithm that is flexible enough to
allow for various types of partial matching between two CGs and ran experiments to test its
practicality (Myaeng and Lopez-lopez, 1992). While the general subgraph isomorphism
problem is known to be computationally intractable, matching CGs containing conceptual
information (i.e. labels on nodes) appears to be practical. With improved understanding of the


                                 124

nature of RRF and the exact features of the representation of documents and topic statements, we
have developed and implemented heuristics for scoring documents for their relevancy with
respect to a given topic statement, and added them to the base algorithm. With the rich
representation of topic statements and documents, the expanded matching component currently
is capable of discriminating documents based on the availability of rather unusual information
needs specified in a topic statement, such as a certain status of an event or a need to contain a
specific entity (e.g. company name) satisfying a certain condition. Also it facilitates the process
of reliably identifying "hot spots" in retrieved documents.
An example of how a matching between document OG and a topic statement OG is shown in Figure
5, where the entire OG represents the example sentence discussed above and the topic statement
sentence:
    "...a current debt reschedullng agreement... between a debtor developing
    country and one or more of its creditors, commercial andlor official. It will
    identify the debtor country and the creditor(s), the payment time period...,
    and the interest rate, ...,,

The dark area and the individual nodes with gray shades represent the matched parts: a connected
sub-OG that is the main contributor for the final score and some single node matches,
respectively. It should be noted that information on the relative importance of the text
components delineated by the Text Structure component with respect to the topic statement is
incorporated in the scoring heuristics.

An independent module under development, which will have direct bearing on the final results of
matching and which will be added at a later stage to determine its efficacy, is RIT (Roaet's
International Thesaurus) Coder. The goal is to simulate the use of a type hiearchy in the CG
theory by replacing lexical terms in concept nodes in our representation with RIT semi-colon
group numbers, each of which represent a set of semantically similar words and phrases in its
hierarchy. Our approach is to use the words surrounding the target word to be replaced with an
RIT code as context words by which we try to disambiguate the sense of the target word and find
the exact location in the RIT. While this approach is to increase both recall and precision at the
same time through implicit expansion of terms and sense disambiguation, we will have to see the
sensitivity of incomplete disambiguation to the overall retrieval results.

    Testino and Results

Although the DR-LINK entry in TREC was not tested in the same manner as the other systems,
the status of the system and the testing which was done were consistent with the milestones that
had been established in consultation with our TIPSTER contractor. DR-LINK was not tested as a
full system due to the fact that DR-LINK is based on several theories that have never before been
implemented in an information retrieval system and none of the system's components were in
even the design stage of development when the project began. As a result of these facts, our
system is not yet fully implemented. The results that follow are on the three modules (Subject
Field Coder, Text Structurer, Conceptual Graph Matcher) that have been implemented to date.
The full system will be tested at the eighteenth month meeting of TIPSTER.


                                   125

IV""e~~1~
 1)agent


    char

   foreigi

~~iVCfl~ZU~

                          [~ZmT4


 Fiq. 5:  Example of CG matching result


                                126

3.a. Subiect Field Coder Testing

The Subject Field Coder can be evaluated in its filtering function in the following way: The SF0
vectors of documents in a database are compared to a topic statement vector and, for each topic
statement, ranked according to similarity. The question then is, how far down this ranked list
would the system need to proceed in order to include all the relevant documents in the set of
documents that was passed on to the next system module? This testing procedure has been run
using the WSJ and Ziff collections on Disk 1 and Topic Statements 1 to 50 provided by TREC. In
the Wall Street Journal database, results showed that on average across the 50 topic statements,
all the relevant documents were ranked in the top 28% of the list. Therefore, on average, 72%
of the WSJ database did not need to be further processed by the later modules in the system. In
the Ziff database, on average across the 50 topic statements, all the relevant documents were
ranked in the top 66% of the database, a much poorer result. Therefore, on average, 34% of the
Ziff database need not be further processed. When Topic Statements 51 to 100 were searched on
the WSJ database, all the relevant documents were ranked in the top 32% of the list.

Error analysis of the Ziff results, has revealed the cause of the poorer performance on that
database. Namely, for several of the queries, documents were judged relevant only if they
contained the particular proper noun mentioned in the topic statement (e.g. OS/2, Mitsubishi,
IBM's SAA standards, etc.). Given these types of topic statements, the most appropriate search
approach for a system would be keyword matching, which is not at all the type of matching that
is done using the SF0 representation. The SFCs represent a document at a higher level of
abstraction, not at the keyword level. That is documents which discuss a particular computer,
will have a strong weighting of the Data Processing slot on the SF0 vector, but no means for
matching on a particular computer name. Therefore, the error analysis showed that the SF0
performance was hampered by its inability to match at the level of a specific company name or
product. Fortunately, we do have at hand the means to improve the results, as we are in the
process of incorporating a second-level of document ranking using Proper Noun processing
algorithms. Results using this extended representation will be available at the eighteenth month
TIPSTER meeting.

3.b. Text Structurer Testing

The Text Structurer was tested using five of the six evidence sources, as we have not yet
implemented the algorithms for incorporating evidence from the Continuation Clues. We tested
the Text Structurer on a set of 116 WSJ documents, consisting of several thousand sentences.
This first testing resulted in 72% of the sentences being correctly identified. An additional hand
simulation of one small, heuristic adjustment was tested and improved the system's
performance to 74% of the sentences being correctly identified. A second run of a smaller
sample of sentences resulted in 80% correct identification of components for sentences. Ongoing
efforts at improving the quality of the evidence sources used by the Text Structurer, plus the
incorporation of the Continuation Clue evidence, promise to improve these results significantly.
It should be remembered, as well, that the automatic discourse structuring of documents has not
been reported elsewhere in the literature, so that this very new type of text processing is in its
infancy and likely has much room for future improvements.


                                  127

    Conceptual Graph Matcher Testing
As the first prototype, the CG Matcher was implemented with a set of scoring heuristics whose
theoretical basis is on the Dempster-Shafer theory of evidence. The score for a document is
computed progressively from node-level evidence through multiple stages to generate a score
for text units of different granularities.

In order to test the feasibility of the prototype module, we have run it with manually generated
OGs for twenty documents and five topic statements for which we have relevance judgments.
While it is premature to draw any conclusions on the efficacy of the matching algorithm and the
heuristics, mainly due to the size of the data set and the stage of the development, the results
were encouraging and have provided us with much insight on how the scoring heuristics need to
be tuned.

    Conclusions

This paper on the DR-LINK system should be considered a report on a work in progress, since
we did not have a fully devebped system at the time of the TREC testing. However, we do believe
that the three system components which were tested perform quite respectably, given their
innovativeness. Continued development and feedback from the TREC results will provide much
more refined versions of these system modules. In addition, two system modules remain to be
developed and the full system, which is quite synergistic in its approach to achieving its goals,
remains to be integrated and tested as a full system. The Relation Concept Detector and
Conceptual Graph Generator modules are being implemented in tandem and, when completed, will
make the DR-LINK system fully operational. Full system testing will be conducted for the
eighteen month TIPSTER testing.

In the interim, our goal in this paper has been to describe the five unique modules which
comprise DR-LINK and which, in combination, promise to provide a full system which has the
necessary filtering power to make later processing more accurate and the depth of linguistic
processing required to provide real conceptual level matching and retrieval.

    References

Halliday, M. A. K. & Hasan, R. (1976). Cohesion in English. London, Longmans.
Liddy, E.D. & Paik, W. (1992). Statistically-Guided word sense disambiguation. In Prnceedinos
    of MAI Fall Symposium Series: Probabilistic aooroaches to natural language. Menlo Park,
    CA: AAAI.
Liddy, E.D., Paik, W., Mcvearry, K. & Yu, E. (In press). Automatic discourse-level structuring
    of newspaper texts: Empirical testing of a model.
Liddy, E.D., Paik, W. & Woelfel, J. (1992).  Use of subject field codes from a machine-readable
    dictionary for automatic classification of documents. Proceedings of 3rd ASIS Classification
    Research Workshop.
Meteer, M., Schwartz, R. & Weischedel, R. (1991). POST: Using probabilities in language
  processing. Proceedings of the Twelfth International Conference on Artificial Intelligence.
  Sydney, Australia.
Myaeng, S. H. (1992) Using conceptual graphs for information retrieval: a framework for
  representation and flexible inferencing. Proceedings of Symposium on Document Analysis
  and Information Retrieval, Las Vegas, March 16-18.
Myaeng, S. H. & Khoo, C. (1992). On uncertainty handling in plausible reasoning with
  conceptual graphs. Proceedings of 7th Workshop on Conceptual Graphs, Las Cruces, NM,
  July, 1992.

                                             128

Myaeng1 S. H. & Lopez-lopes, A. (1992). A conceptual graph matching: a flexible algorithm and
  experiments. Journal of Experimental and Theoretical Artificial Intelligence, Vol.4,
  107126.
Shafer, G. (1976). A mathematical theory of evidence. Princeton, NJ:  Princeton University
  Press.
Somers, H. L. (1987). Valency and Case in Computational Linguistics. Edinburgh: Edinburgh
  University Press.
Sowa, J. (1984). Conceptual Structures: Information Processing in Mind and Machine. Reading,
  MA: Addison-Wesley.
van Dijk, T. (1988). New analysis: Case studies of international and national news in the press.
  Hillsdale, NJ: Lawrence Earlbaum Associates.
Ward, J.  (1963). Hierarchical grouping to optimize an objection function. Journal of the
  American Statistical Association. 58, p.237-254.


                                     129