DR-LINK:   A System Update for TREC-2
                               Elizabeth D. Liddy
                               Sung H. Myaeng
                           School of Information Studies
                               Syracuse University
                          Syracuse, New York 132100-4100

                    liddy@mailbox.syr.edu; slunyaeng@mailbox.syr.edu


1. Overview of DR-LINK's Approach

The theoretical goal underlying the DR-LINK System is to represent and match documents and queries at the various
linguistic levels at which human language conveys meaning. Accordingly, we have developed a modular system
which processes and represents text at the lexical, syntactic, semantic, and discourse levels of language. In concert,
these levels of processing permit DR-LINK to achieve a level of intelligent retrieval beyond more traditional
approaches. In addition, the rich annotations to text produced by DR-LINK are replete with much of the semantics
necessary for document extraction.

The system was planned and developed in a modular fashion and fl~ctional modularity has been achieved, while a
full integration of these multiple levels of linguistic processing is within reach. As currently configured, DR-LINK
performs a staged processing of documents, with each module adding a meangful annotation to the text. For
matching, a Topic Statement undergoes analogous processing to determine its relevancy requirements for documents
at each stage. Ainong the many benefits of staged processing are: improvements and changes can be easily made
within any module; the contribution of the various stages can be empirically tested by simply trulng them on or
off; modules can be re-ordered (as was done within the last six months) in order to utilize document annotations in
various ways, and; individual modules can be incorporated in other evolving systems.

The purpose of each of the processing modules will be briefly introduced here (also see Figure 1) in the order in
which the system is currendy run, with fuller explanations provided in the section below: 1) the Text Structurer
labels clauses or sentences with a text~omponent tag which provides a means for responding to the discourse level
Topic Statement requirements of time, source, intentionality, and state of completion; 2) the Subject Field Coder
provides a subject-based, sununary-level vector representation of the content of each text; 3) the Froper Noun
Interpreter and 4) the Complex Noininal Phraser provide precise levels of content representation in the form of
concepts and relations, as well as controlled expansion of group nouns and content-bearing nomirial phrases; 5) the
Relation-Concept Detector produces concept-relation-concept triples with a range of semantic relations expressed via
various syntactic classes, e.g. verbs, nominalized verbs, complex nominals, and proper nouns; 6) the Conceptual
Graph Generator combines the triples to form a CG and adds Roget International Thesaurus (Rrf) codes to concept
nodes, and; 7) the Conceptual Graph Matcher determines the degree of overlap between a query graph and graphs of
those documents which surpass a statistically predetermined criterion of likelihood of relevance based on ranking by
the integrated processing of the first four system modules.

2. Detailed Svstem Descrintion

In the following system description, emphasis is placed on work accomplished within the last year, plus a basic
overview description of each module. The more rudirnentary processing details of each module plus fuller description
of eartier development are available in the TREC-l Froceedings (Harman, 1993).

2. A. Text Structur~

Since human interpretation of text is influenced by expectations regarding the text to be read, discourse level analysis
is required for a system to approximate the same level of meaningful representation and matching. DR-LINK's Text
Structurer is based on discourse linguistic theory which suggests that texts of a particular type have a predictable


                                    85

                DOCU~ENTS        TOPICS

                                z
                         Preprocessor

                                j
                                I
     Text Structurer ~    Proper Noun (PN)
  I:
  I;     (TS)           I,. Interpreter
  I

1

I

I             I
              m.

V

  Complex Nominal

L~ra:L~-i..~.


         j

 Subject
 Field
 Coder


 V8 SFC
Matcher

TS+PN+CN
Inverted File
 Creator

            1'
         NLP Query
         Constructor


TS+PN+CN
Matcher

j.

V

      Relation Concept
        Detector
   I


Conceptual
Graph (CG)
Generator

 4...
Topic CG
Processor


   I

               Integrated
                Matcher
                                                    V
                 I
                   Recall
                                            I~~CG:j
               4 Predictor                     Matcher

Ranked        Ranked
 List          List
                         Fig. 1: DR-LINK System


                               86

                                              Ranked
                                               List

text-level structure which is used by both producers and readers of that text-type as an indication of how and where
certain information endemic to that text-type will be conveyed. We have implemented a Text Structurer for the
newspaper text-type, which produces an annotated version of a news article in which each clause or sentence is tagged
for the specific slot it instantiates in the news-text model, an extension of van DUkts earlier model (1988). The
structural annotations are used to respond more precisely to information needs expressed in Topic Statements, where
some aspects of relevancy can only be met by understanding a Topic Statement's discourse requirements. For
example, Topic Statement 75, states that:

  Document will identtf' an instance in which automation has clearly paid off or conversely,
  has failed.

which contains the implicit discourse requirement that relevant instances should occur in the CONSEQUENCE
component of a news article. DR-LINK extracts this requirement from the Topic Statement and will only assign a
similarity value for the discourse-level of relevance to those documents in which the sought information occursin a
CONSEQUENCE component.

The current news-text model consists of thhi:y-eight recognizable components of information observed in a large
sample of traimug texts (e.g. MAIN EVENT, VERBAL REACrION, EVALU~ON, FUTURE
CONSEQUENCE, PREVIOUS EVENT). The Text Structurer assigns these component labels to document clauses
or sentences on the basis of lexical clues learned from text, which now coznprise a special lexicon. We considered
expanding the lexicon via avallable lexical resources such as ~ or ~ but our
analysis of these resources suggested that they do not capture the particularities of lexical usage in the sublangnage
of newspaper reporting.

The Text Structur& has recently been improved to assign structural tags at the clause level, a refinement which has
corrected most of the anomalies that were observed in earlier testings of the Text Structurer. For example, given the
new clause-level structuring, the following sentence is correctly interpreted as containg both future-oriented
information in the LEAD-FUTURE segment and some nested information regarding a past situation in the LEAD-
HISTORY segment.

      ~EAD-FL~ South Korea's trade surplus, 4ŁAD-HISTh which more than doubled in 1987
      to $6.55 billion, <~AD-HISTh is expected to narrow this year to above $4 billion. </IŁAD-
      FUT>

We have recently implemented new matching techniques which more fully realize the Text Structurer's potential
contribution to the system's performance. This was achieved as one outcome of a study which greatly increased our
understanding of how text structure requirements in Topic Statements should be used for matching documents to
Topic Statements. Analysis of relevant and non-relevant documents retrieved for a test sample of Topic Statements
indicated that most of the errors in the Text Structurer's matching were not serious errors, but only slight
mismatches in terms of the conceptual defmitions of some of the text model's components. This suggested that our
model was overly specific for the task of responding to discourse aspects of information requirements, and that
matching Text Structure needs from a Topic Statement to structured documents called for a more gen&alized model.
That is,. Topic Statement text-structure requirements are not expressed at the same level of specificity at which Text
Structure components are recognizable in documents.

Given this, we reduced the matching complexity via a function that maps the thirty-eight news-text components to
seven meta-components. These are: LEAD-MAIN, HISTORY, FLyF~, CONSEQUENCE, EVALUNnON,
ONGOING, and OTHERS. The new approach allows the system to continue to impose the fmer-level, 38-
component structure on the newspaper articles themselves with excellent precision, but maps this fuller set of text
components to the seven meta-components at the matching stage, as the Topic Statements' text structure
requirements are coded at the meta-component level. Unofficial experimental results indicate that this new scheme
has significantly increased the Text Structurer's contribution to an improved level of precision in the retrieval of
relevant documents.


                                    87

While the Text Structurer module processes documents as described above, the analysis of Topic Statements for their
Text Structure requirements is done by the Natirral Language Query Constructor (Qc) which also analyzes the proper
noun and complex nominal requirements of Topic Statements. The QC, as well as the matching and ranking of
documents using these sources of linguistic information, is described below.

2. B. Subject Field ~

The Subject Field Coder (SFC), as reported at ThEC-1, has been producing consistently reliable semantic vectors to
represent both documents and Topic Statements using the semantic codes assigned to word senses in a machine-
readable dictionary. Details regarding this process are reported in detail in Liddy et al, 1993. Our more recent efforts
on this module have focused on multiple ways to exploit SFC-based similarity values between document and query
vectors. One implementation is the use of the ranked vector-sinlilarity values for predicting a cut-off criterion of
potentially relevant documents when the module is used as an initial filter. This is to replace the earlier practice used
in the eighteenth month TJPSTER testing, where documents were ranked by their SFC-vector similarity to a query
SFC-vector and the top two thousand documents were passed to the CG Matcher, since CG matching is too
coinputationally expensive to handie all documents in the collection. To report the SFC's performance, at that time
we reported how far down the ranked list of documents the system would need to process documents in order to get
all the judged relevant documents. Although the results were highly promising (all relevant documents were, on
average, m the top 37% of the ranked list based on SFC similarity values), this figure varies considerably for
individual Topic Statements. Therefore, we needed to devise a method for predicting a priori for individual Topic
Statements, the cut-off criterion for any desired level of recall. We first developed a method that could successfully
predict a cut-off criterion based on just SFC similarity values. We then extended the algorithm to incorporate the
similarity values produced when proper noun, coniplex nominal, and text structure requirements are considered as
well, to produce an integrated ranking based on these varied sources of linguistic information.

The SFC-based cut-off criterion uses a multiple regression formula which was developed on the odd-numbered Topic
Statements from 1 to 50 and a training corpus of Wall Street Journal articles. The regression formula takes into
account the distribution of similarity values for documents in response to a particular query by incorporating the
mean and standard deviation of the similarity value distribution, the similarity of the top-ranked document, and the
desired recall level. The cut-off criterion was tested on the held-out, twenty-five Topic Statements. The averaged
results, when a user is striving for 100% recall, showed that only 39.65 % of the 173,255 documents would need to
be processed flirther. And this document set, in fact, contained 92% of the judged-relevant documents.

The advantage of the cut-off criterion is it's sensitivity to the varied distributions of SFC similarity values for
individual Topic Statements, which appears to reflect how "appropriate" a Topic Statement is for a particular
database. For many queries, a relatively small portion of the database, when ranked by similarity to the Topic
Statement, will need to be further processed. For example, for Topic Statement forty-two, when the goal is 100%
recall, the regression formula predicts a cut-off criterion similarity value which requires that only 13% of the ranked
output be flirther processed, and the avallable relevance judgments show that this pool of documents contains 99% of
the documents judged relevant for that query.

2. C. V-8 Matchin~

Given the complete modularity of the first four modules in the system, for the twenty-four month TIPSThR testing,
we reordered two modules so that Text Structuting is done prior to Subject Field Coding. This allowed us to
implement and test a new version of matching which combines in a unique way the Text Structurer and the Subject
Field Coder. We refer to this version as the V-S model, since eight SFC vectors are produced for each document, one
for each of the seven meta~ategories, plus one for all of the categories combined. The V-S model, therefore, provides
multiple SFC vectors for each document, thereby representing the distribution of SFCs over the various meta-text
coinponents that occur in a news-text document. This means, in the V-S matching, that if certain content areas of the
Topic Statement are required to occur in a document inone meta-text component, e.g. CONSEQUENCE, and other
content is required to occur in another meta-text component, e.g. FUTURE, this proportional division can be
matched against the V-S vectors produced for each document at a fairly abstract, subject level. For the TIPSThR
twenty-four month evaluation, we have experimented with several formulas for combining the similarity values of


                                   88

the multiple SFC vectors produced for each document9 including both a Dempster-Shafer combination and a straight
averaging. Although official results are not yet available, our internal test results indicate that the combination of
Text Structuring and Subject Field Coding produces an improved ranking of documents, especially when using the
Dempster-Shafer method.

2. D. Proper Noun Interpreter

Our earlier work with the SFCoder, suggested that the most important factor in improving the pefformance of this
upstream ranking module, would be to integrate the general subjectAevel representation provided by SFCodes with a
level of text representation that enabled more refined discrimination. Analysis of earlier test results suggested that
proper noun ~N) matching that incorporated both particular proper nouns (e.g. Argentina, FAA) as well as
`category' level proper nouns (e.g. third-world country, government agency) would improve precision pexformarce.
The Proper Noun Interpreter ~aik et al, 1993) that we developed provides: a canomcal representation of each proper
noun; a classification of each proper noun into one of thirty-seven categories, and; a means for expanding group
nouns into their constituent members (e.g. all the countries comprising the Third World). Recent work on our proper
noun algorithms, context-based rules, and knowledge bases, has improved the module's ability to recognize and
categorize proper nouns to 93% correct categorization using 37 categories as tested on a sample set of 545 proper
nouns from newspaper text. The improved peiformance has a double impact on the system's retrieval peiformance, as
proper nouns contribute both to the downstream relation-concept representation used in CG matching as well as to
the upstream proper noun, complex norninal, and text structure ranking of documents in relation to individual
queries. Details of processing Topic Statements for their PN requirements and the use of this sintilarity value in
document ranking is described in the later section on the Query Constructor.

2. E. Complex Nominal Phraser

A new level of natural language processing has been incorporated in the DR-LINK System with the implementation
of the Complex Nominal (CN) Phraser. The motivation behind this addition was our recognition that either, in
addition to proper nouns, or in the absence of proper nouns, most of the substantive content requirements of Topic
Statements are expressed in complex nominals (i.e. noun + noun,       reduction", "government assistance",
"health hazards"). Complex nominals provide a linguistic means for precise conceptual matching, as do proper
nouns. However, the conceptual content of complex nomihals can be expressed in synonymous phrases, in a
dilferent way than can the conceptual content of proper nouns, which are more particularized. Therefore, for complex
nominals, a controlled expansion step was incorporated in the CN matching process in order to accomplish the
desired goals of improved recall, as well as improved precision.

For input to the CN Phraser, the complex nominals in Topic Statements are recognizable as adjacent noun pairs or
non-predicating adjective + noun pairs in the output of the part-of-speech tagger. Having recognized all CNs, the
substitutable phrases for each complex nominal are found by computationally determining the overlap of
synonymous terms suggested by RIT and statistical corpus analysis. These processes serve to identify all second
order associations between each complex nominal constituent and terms in the database. Second order associations
exist between terms that are used interchangeably in certain contexts. The premise here is that if, for example, terms
a and b are both frequently premodified by the same set of terms in a corpus, it is highly likely that terms a and b are
substitutable for each other within these phrases. The use of both corpus and RIT information appears to Irmit the
over-generation that frequently results from automatic term expansion. Ongoing experiments on this new addition to
the system will help us further refrne the process and will be reported more extensively in the near future.

The terms that exhibit second order associations are compiled into equivalence classes. These equivalence classes
provide substitutable synonymous phrases for Topic Statement complex nominals and are used by the matching
algorithms in the same manner that the original complex nominals are used. The complex nominals and their
substitutes are first used in the upstream matching of Topic Staternents to documents as one contributing factor to
the integrated similarity value, to be ftnther explained in the section on the Query Constructor.

In addition, each complex nominal and its assigned relation provides a CRC to the RCD module for use in the fmal
round of matching. For that module, semantic relations between the constituent nouns of each complex nominal are


                                   89

assigned manually, using an ontology of forty-three relations. Some example complex nominals plus relation are:

     [press] <- (SOURCE) <- [commentaries]
     ~owth] -> (MEASURE) -> [rate]
     [electronic] <- (MEANS) <- [theft]
     [campaign] <- (USEDYOR) <- [fmances]

The development of the complex nominal CRC knowledge base was an intellectual effort for the twenty four month
testing, but our cu~rent task is the flill automation of the semantic relation assigninent. Although difficult, our
experience with the intellectual process has encouraged us to pursue appropriate NU~-based macinne-learning
techniques which will enable the system to automatically recognize and code semantic relations in complex
nominals.'

in CG matching, the existence of both case-frame relations and complex nominal relations make it possible for the
system to detect conceptual similarity even if expressed in different grammatical structures, such as a verb +
arguments in a Topic Statement and a complex nominal in a document, e.g.:
     "reduce the debt" = [reduc*] ~> (OBJECI) -> [debt]
     "debt reduction" = [debt] <- (OBJECI) <~ [reduc*]

To achieve the flillest exploitation of relational information despite grammatical realization, a further step was
necessary in order to match on CRCs produced by verb-based analysis and CRCs produced by complex nominal
analysis. This required the determination of the degree of relation-similarity across the two relation sets. There are
approximately sixty relations used in case frames, while there are approximately forty relations used in complex
nominals. A relation-similarity table was constructed that assigns a degree of similarity between twenty~ight pairs
across the two grammatically-distinguished sets, and a degree of simllarity between pairs within the same set. The
relation-similarity table is used in the fmal CG matching to allow concepts that are linked by a relation in a
document that is different from the relation that links the same two concepts in the Topic Statement, to still be
awarded some degree of similarity. The quality and appropriateness of the similarity table will be determined by the
results of the twenty-four month testing which will also provide empirical evidence of the Complex Nominal
Phraser's impact on pefformance. Sample runs have indicated that the inclusion of complex nominals has a strongly
positive impact on our results in both of its incorporations in the system.

2. F. Natural Language Ouery Constructor

We have implemented a Natural Language Query Constructor (QC) for DR-LINK which takes as input a Topic
Statement which has been pre-processed by straight-forward techniques, such as part-of-speech tagging as well as
SGML-tagging of the meta-language which reflects the typical request-presentation language used in Topic
Statements (e.g. "A relevant document will ..." or "To be relevant...") The QC produces a query which reflects the
appropriate logical combinations of the text structure, proper noun, and complex nominal requirements of a Topic
Statement. The basis of the QC is a sublanguage grammar which is a generalization over the regularities exhibited in
the Topic, Description, and Narrative fields of the one hundred fifty `flPSTER Topic Statements. It should be noted
that the sublanguage grammar, with minor modifications, is capable of handling non-TJPSTER queries, soits
generalized utility is promising. Earlier work ~iddy et al, 1991) demonstrated that the sublanguage approach is an
effective and efficient approach to natural language processing tasks within a particular text-type, here Topic
Statements.

For the twenty-four month runs, the QC sublanguage grammar detects the required logical combination of text
structure components, proper nouns, and complex nomis. These are the specific entities which we consider to be
particularly revealing indicators of relevant documents. in most cases, matching on these classes produces high-
precision ranked results, although there are some instances in which single common nouns may also be needed. Mter
analyzing the twenty-four month results, we will determine whether to expand the range of linguistic types which
can be used to instantiate the variables in the QC's logical assertions.


                                    90

The QC sublanguage grammar relies on fluiction words (e.g. conjunctions, prepositions, relative pronouns), meta-
level phrases (e.g. "such as", 9'examples of', "as well as"), and punctuation (e.g. commas, semi-colons) to recognize
and extract the relevancy requirements of Topic Statements. These linguistic features serve as clues to the
`organizing' structure of a Topic Statement and present each Topic Statement's unique thematic content in a
recognizable frame. The QC sublanguage interprets a Topic Statement into pattern-action rules which are used to
reduce each sentence in a Topic Statement into a first order logic assertion, reflecting the boolean-like requirements
of Topic Statements, including NOT'd assertions. In addition, defmite noun phrase anaphors are recognized and
resolved by sublanguage grammar processing rules.

2.G. Inte~tedMatcji~

Each logical assertion produced by the QC for a Topic Statement is evaluated against the entries in the document
inverted file and a weight is assigned to each segment of text (either a clause or a sentence) which has any similarity.
The weighting scheme we are currently using evolved from iterative testing. Each segment of text is indexed in the
inverted file with a text structure component label and will be assigned a weight if it contains any proper nouns or
cornplex nominals that match the Topic Statement's requirements. The following weights are assigned:

  proper noun       -
  complex nommal    -
  proper noun category =

                    1.00
                    1.00
                    0.50

This means, for example, that if, in response to the following requirement from a Topic Statement:

        A relevant document will provide data on Japanese laws, regulations, andlor practices
        which help the foreigner understand how Japan controls, or does not control, stock-
        market practices which could be labeled as insider trading.

a document text-segment contains `Japanese law', and `stock-market practice' (or one of its synonymous phrases), and
`insider trading' (or one of its synonymous phrases), that segment is assigned a prelimiary value of 3.00. Depending
on which field in the Topic Statement the assertion came from, and whether the document text-segment matches the
Topic Statement's Text Structure requirement, the preliminary value will be multiplied by one of the following co-
eflicients:

 Topic field and required Text Structure component
 lssssc, Narr, or Concept field and required Text Structure component
 Topic field and non-required Text Structure component
 Desc, Narr, or Concept field and non-required Text Structure component

                                                         = 1.00
                                                         = 0.75
                                                         = 0.50
                                                         = 0.25

So if `Japanese law' and `stock-market practice' and `insider trading' were conceptual requirements frorn a Topic field
assertion that also required them to occur in an EVALUATION or LEAD-MAIN text cornponent, and they occurred
in a document text segment which has been tagged by the Text Structurer as EVALUATION, the value of 3 would
be multiplied by 1; whereas if that assertion came from the Description field in the Topic Statement and the three
required phrases occuried in a document text segment labelled CONSEQUENCE by the Text Structurer, the value of
3 would be multiplied by .25.

Since the QC interprets each sentence in the Topic, Description, Narrative, and Concept fields in a Topic Statement,
multiple, sornetimes overlapping, sometimes repetitive assertions are produced for a single Topic Statement. In the
current implementation, each of these Topic Statement assertions is compared to the inverted document file, and the
highest similarity value for a single assertion in the document is used as that document's integrated similarity value
for that Topic Statement.

The similarity value which results from the QC module matching is combined with the SFC similarity value of the
document, and an integrated similarity scre for each document is produced. This similarity value can be used in
several ways. Firstly, the two similarity values can be used to provide a full ranking of all the documents which


                                  91

takes into account the lexical, semantic and discourse sources of linguistic information in both documents and
queries. Secondly, it can serve as input to a filter which uses a more complex version of the original cut-off criterion
to determine how many documents should be farther processed by the system's fmal modules.

For the Integrated Mateher to produce a combined rankng, each document's similarity value for a given Topic
Statement can be thought of as being composed of two elements. One element is the SFC similarity value and one
element is the similarity value that represents the combined proper noun, complex nonnal, and text structure
simliarities. Additionally, the system will have computed the regression formula, the mean, and standard deviation of
the distribution of the SFC similarity values for the individual Topic Statement. Using these statistical values, the
system produces the cut-off criterion value. Since we know from the eighteen-month results, that 74% of the
relevant documents had what we refer to as a k-value (then PN value; now PN, CN, TS values) and the remaining
26% of the relevant documents had no k-value, we use this information to predict what proportion of the predicted
relevant documents should come from which segment of the ranked documents for flill recall. The combined ranking
can be envisioned as consisting of four segments, as shown in Figure 2.


Docs. having a k-value
& an SFC value                 I  Group 1
above the cut-off

                               -----cut-off criterion SFC similarity value
Docs. having a k-value
& an SFC value                 I  Group2
below the cut-off

Docs. having no k-value
& an SFC value                 I  Group 3
above the cut-off

                               ------cut-off criterion SFC simllarity value
Docs. having no k-value
& an SFC value                 I  Group4
below the cut-off


              Fig. 2: Schematic of Segmented Ranks from SFC & Integrated Ranking (k-value)


Four groups are required to reflect the tw~way distinction mentioned above. The fnst distinction is between those
groups which have a k-value and which should contain 74% of the relevant documents and those documents without
a k-value, which should contribute 26% of the relevant documents. The second distinction is between those
documents whose SFC similarity value is above the predicted cut-off criterion and those whose SFC similarity value
is not.

When a cut-off criterion is the application desired, the system will produce the ranked list in response to a desired
recall level, by concatenating the documents above the appropriate cut-off for that level of recall from Group 1; then
documents above the appropriate cut-off for that level of recall from Group 3. However, since our test results show
that there is a potential 8% error in the predicted cut-off criterion for 100% recall, we use extrapolation to add the
appropriate proportion of the top ranked documents from Group 2 to Group 1, before concatenating documents from
Group 3. These same values are used to produce the best end-t~end ranking of all the documents using the various
segments.

Document ranks are produced by the Integrated Matcher and the cut-off criterion is used either by an individual user
who requires a cert~~ recall level for a particular information need, or, as in the twenty four month JIPSThR test
situation, by the system to determine how many documents from the Integrated Matcher ranking will be passed on to
the fmal modules for further processing.


                                        92

 2. H. Topic Statement Processin~ for Conceptual Graph Generation

 The processing of topic statements for CG generation does not make use of the output of the Natural Language
 Query Constructor, but instead the current system first applies the same RCD and CG generator modules to produce
 topic statement Cr5) CGs. Several TS-specific processing requirements have been identified, some of which have
 been implemented as post-processing routines and others are under development.

 - Elimination of concept and relation nodes corresponding to contenfless meta-phrases (e.g. `~Relevant
   document must identify ..."). If both of the concept nodes in a concept-relation-concept triple belong
   to a meta-phrase, the CRC is ignored. When only one of them is a meta-phrase concept, the triple is
   not removed blindly uniess the other concept occursin another triple.

 - Handling of negated parts of topic statements. The weights are adjusted in such a way that an occurrence of the
   negated concept in a document will contribute to the negative evidence that the document will be relevant. In
   effect, the two weights for the concept are switched.

   Automatic assignment of weights to concept and relation nodes. There are several factors we consider: the
   conventional way of determimng the importance of terms using inverse document frequency (DF) and total
   frequency; the location of terms occurring in topic statements; the part of speech information for each term; and
   indications in the topic statement sublanguage (e.g. the document MUST contain...). Although we have
   implemented a program that tags individual words with the degree of importance based on the sublanguage
   patterns, we assigned concept weights based on IDF values of terms in the collection for the evaluation, due to
   time constraints.

   Merging common concept appearing in different sections of topic statements. Although it is not safe
   in general to assume that two concepts sharing the same concept name actually refer to the same concept
   instantiation and merge them blindly, we have observed that this is not the case in the topic statements. In fact,
   we believe that it is desirable to merge CG fragments using common concept nodes. This is an important process
   that eliminates undesirable effects on scoring. Without this, a document contaimng a concept occurring repeatedly
   in <desc>, <narr>, and <con> fields would be ranked unnecessarily high (or low if it is negated) because each
   ocerrence of the concept would make an independent contribution to the overall score.

 Since an integrated automatic topic processing module was not available, the mechanical aspects of the process were
 hand-simulated with some parts done automatically and other done manually.

 2.1. Relation Concent Detector ~CD)

 The output of the Complex Nominal Phraser and the Proper Noun Interpreter modules described above provide
 concept-relation-concept triples directly to the Relation-Concept Detector ~CD) module. In addition, the following
 RCD handlers are operative.

 One of the more distinct aspects of the DR-LINK system is its capability of extracting and using relations in the
 fmal representation of documents and topic statements in their CG representations. This module provides bullding
 blocks for the CG representation by generating concept-relation-concept triples based on the domain-independent
 knowledge bases we have been constructing with machine-readable resources and corpus statistics. In this module,
 there are several handlers that are activated selectively depending on the input sentence.

 2. L 1. Case Frame (CF) Handler

 The main function of the CF Handler is to generate concept-relation-concept triples where one of the concepts comes
 typically from a verb. It identifies a verb in a sentence and counects it to other constituents surrounding the verb.
 Since the relations (about 50 we use currently) included inour representation are originated froin the theories of
 linguistic case roles (Somers, 1987, and Cook, 1989) and are all semantic in nature, this module consults the


                                              93

knowledge base contahibig 13,786 case frames we have constructed, each of which prescribes a pattern involving a
verb and the cOrTe~onding concept-relation~oncept triples.

Given a set of case frames for different senses of~decline', for example,

  (decline 1   (~ATJENT subject? obligatory)))

  (decline 2   ((AGENT subject human obligatory)
                  ~ATJEET object? optional)))

  (decline 3   ((AGENT subject human obligatory)
                  (ACIwnY infInitive ? obligatory)
                  (link infmitive subject AGENT)))

AGENT, PAIJENT, and ACIwnY are the relations that connect the verb to other constituents. The second
components (e.g. subject) prescribe the syntactic categories of the constituents and the third components (e.g.
human) semantic restrictions that the subject and object should satisfy. The last components (e.g. obligatory)
indicate whether the constituent must exist in a sentence in order for the particular case frame to be instantiated. The
last line of the third case frame instructs the CF handler to link the subject to the infrnitive verb with the AGENT
relation. This kind of linklng instructions allow the CF handler to produce triples containing non-verbal
constituents.

The CF Handler selects the best case frame by attempting to instantiate each case frame and determine which one is
satisfied most by the sentence at hand. This can be seen as a sense disambiguation process using both syntactic and
semantic information. The semantic restriction information contained in the case frames were obtained from
IŁ)OCF, and when the sentence is processed, the CF handler also consults IDOCF to get semantic restriction
information for individual constituents surrounding the verb in the sentence and compares it with the restrictions in
the case frames of the verb as a way to determine which case frame is likely to be the correct one.

With the following sentence fragment,

   the chairman declined to elaborate on the disclosure...

the CF handier chooses the third case frame and produces

  [decline] -> (AGENT) -> [chairman]
  [decline] -> (ACIwIFY) -> [elaborate]
  [elaborate] -> (AGENT) -> [chairman]

In the current implementation, the input text to the CF handler is first tagged with part-of-spoech information and
bracketed for constituent boundaries. BBN's POST tagger ~etter et al., 1991) has been used to attach a
part-of-spoech tag to individual words. The constituent boundary bracketer we developed then marks boundaries of
grammatical constituents such as infinitives, noun phrases, prepositional phrases, clauses, etc.

At the time of writing, the case frame knowledge base contains 13,786 case frames, of which 13,444 are for all the
verb entries (5,206) in I1)OCE, and the rest are for 342 verbs that appear in the Wall Street Journal collectionbut
are not in the IDOCE as a headword. While we have constructed case frames for most of the plirasal verbs in
u)OCE, the capability of processing plirasal verbs has not
been implemented in the current CF Handler.

2.1.2. Nominalized Verb (NV) Handler

The nominalized verb handler has been implemented for the DR-LINK system we ran for the TJPSTER 24th month
evaluation. Its main function is to consult the NV case frames to identify a NV in a sentence and create


                                       94

concept-relation~oncept triples based on the nile. At the same time, it converts the NV into its verb form. In this
way, we can allow for a match between a CG fraginents generated from a phrase coutaning verb and another
fragment generated from a noun phrase containing the corresprxiding nonlimeized verb. For example, the NV Handler
converts the sentence fragment

   the company's investigation of the incident...

into

  [investigate] -> (AGEN~I) -> [company]
  [investigate] -> ~ATIENl) -> [incident].

This process is much more than a sophisticated way of performing stemming in that we canonicalize
concept-relationconcept triples rather thanjust concept nodes.

For NV processing, 15,053 case frames have been generated for 1,593 noniinalized verbs. Most of the case frames
for NVs were automatically generated from the corresponding verb case frames. This process was also facIlitated by
identifying potential NVs from [DOCE.

No explicit testing of the impact of NVs in information retrieval has been done yet although we have convinced
ourselves with anecdotal evidence that this would improve the retrieval performance. More semantic processing of
noininalized verbs in determiing the relations to the surrounding constituents is on the future research agenda. More
rigorous study on the impact of NVs on information retrieval should be done, too.

2.1.3. Noun Phrase (NP~ and Prepositional Phrase (?P) Handler

The noun phrases that are not handled by the complex-noininal handler or by the nominalized verb handler are
analyzed so that the head noun is connected to the concepts outside the noun phrase (e.g. a verb concept in the CF
Handler). In addition, this module identifies individual concepts corresponding to adjectives and other nouns in a
cornpound noun and connects them with CHARACIERI~C, ATIRIBIJIE, or LINK relations. LINK is the most
generic relation in our system.

Once noun phrases are handled this way, this module handles prepositional phrases by connecting the head noun
concept of the noun phrase to the preceding constituent (e.g. a verb or a noun). The preposition attachanent problem
is a difficult one, and the current implementation takes the simple- minded approach with general relations such as
LINK, which can match with many of other semantically more specific relations. Our preliminary analysis indicates
that this approach correctiy handles about 75% of the prepositional phrase cases in the Wall Street Journal
collection. More acuurate and finer-level processing will be done with more semantically oriented rules that check
the semantic restrictions and use more specific relations. The role of this handier will be diininlshed when we process
phrasal verbs as part of the CF handler, for which we have constructed case frames.

2.1.4. Ad-hoc Haddl~

This module looks for lexical patterns not covered by any of the other special handlers discussed above. Its
processing is also driven by itsown knowledge base of patterns to infer relations between concepts. For example, a
sentence fragment

   bought the item for the purpose of satisfying...

contains a pattern

  ..... for the (ADJ) purpose of [NP]

in the knowledge base, and hence results in a triple


                                   95

  [`,uy] -> (GOAL) -> [satisfy]

The knowledge base contains a small number of simple patterns involving BE verbs and more than 350 pattern rules
for phrasal patterns across phrase boundaries, by which important relations are extracted. The pattern rules specify
certain lexical patterns and the order of oceurrences of words belonging to certain part-of-speech categories, and the
concept-relation-concept triples to be generated. These patterns require a processing capability no more powetful than
a fmite state automaton. Due to the time constraints, however, the current ad-hoc handler has not been generalized to
process all the patterns, and about 30% of the patterns in the knowledge base are recogm2ed and handled correctly.

2. J. Conceptual Graph (CG) Generator

After individual RCD modules have generated concept-relation-concept triples for a document, the CO generator
merges them to form a set of conceptual graphs, each corresponding to a clause in most cases. Since more than one
handler can generate different triples for the same concept pairs (e.g. a prepositional phrase handled by the CF handler
and the NPI?P handler) based on independently constructed rules and on independent processes, a form of conflict
resolution is necessary. III the current implementation, we simply order the execution of different handlers based on
the general quality of the rules and the resulting triples so that more reliable handlers have higber precedence.

The concept nodes in the resulting COs can not only contain general concept names but also some instantiations
(referents) of the concepts. Such a concept can be derived either from a proper noun such as a company name or from
a sub-ordinate clause. `lithe latter case, the instantiation is a CO itself to produce a CO like

  [country: {US}] <- (SOURCE-OF-INFO) <- [C#: [[pact] <- ~AT]ENI) <-...]

`lithe current implementation, concepts with the same instantiation are merged across sentences to form a larger CO,
but concept with the same label but without any referents across sentences are treated as separate concepts and are not
merged. A pronoun resolution method is being implemented to merge a pronoun to its antecedent as a way to
increase the connectivity of COs and hence increase the usefulness of relation nodes.

As a way to make our current representation more "conceptual", we have implemented a module that adds RYF
~oget's `liternational Thesaurus) codes to individual concept nodes so that the label on the nodes is not a word but a
position of the hierarchy of Rif. The lowest level position beyond individual lexical items in the ~ hierarchy is
called a semi~olon group consisting of several terms within the delimiter of semi-colons, which represents a
concept.

The mapping from a word (called target) in text to a position in RUE requires sense disambiguation, and our approach
is to use the words surrounding the target word as the context within which the sense of the target word is determined
and one or more R~ codes are selected. The algorithm selects minimal number (i.e. one or more) of RUE codes, not
just the best one, for target words since we feel that some of the sense distinctions made in RUE are unnecessarily
subtle, and it is unlikely that any attempts to make such fme distinctions would be successful and hence contribute
to information retrieval.

We have produced RUE-coded documents and topic statements for the San Jose Mercury collection and the routing
queries. All the concept nodes derived from nouns now have RUE codes selected using the surrounding tex~ as the
context. Those concept nodes derived from verbs also have RUE codes but in a different way. `listead of using the
surrounding text as the context and trying to disambiguate senses (we concluded that this method is not reliable for
verbs), we first assign RUE codes to each sense of LDOCE verb entries using the same method. `lithis case the
context become the defmition text in LI)OCE. Ohce we select the right case frame by Case Frame Handl& while text
is processed, the RUE codes attached to the case frame are automatically assigned to the target verb.

2. K. ConceDtual Oraph (CO) Matcher

The main flinction of the CO matcher is to determine the relevance of each document against a topic statement CO


                                   96

and produce a ranked list of documents as the third and fmal output of the system. Using the techniques necessary to
model plausible inferences with COs ~yaeng and Khoo, 1992), this module computes the degree to which the topic
statement CG is covered by the COs in the document (see Myaeng and Liddy (1993) and Myaeng & Lopez-lopez
(1992) for details).

While the most obvious strength of the CO approach is its ability to enhance precision by exploiting the structure
of the COS and the semantics of relations in document and topic statement COs, and by attempting to meet the
specIfic semantic constraints of topic statements, we also attempt to increase recall by allowing flexibility in
node-level matching. Concept labels can be matched partially (e.g. between `Bill Clinton' and `Clinton'), and both
relation and concept labels can be matched inexacdy (e.g. between `aid' and `loan' or between `AGENT' and
`EXPERIENCER'). For both inexact and partial matches, we determine the degree of matching and apply a
multiplication factor less than 1 to the resulting score. For inexact matching cases, we have used a relation
Similarity table that determines the degree of similarity between pairs of relations. Although this type of matching
slows down the matching time, we feel that until we have a more accurate way of deterniining the conceptual
relations and a way to represent at a truly conceptual level (e.g. our attempt to use R~ codes), it is necessary. More
importandy, the similarity table reflects our ontology of relations and allows for matching between relations
produced by different RCD handlers whose operations in turn are heavily dependent on the domain-independent
knowledge bases.

We have done a series of matching experiments internally to evaluate various strategies in CO matching/scoring and
document representation with the goal of selecting the best one for the fmal TIPSIER 24th month runs. The first
question we had was how to "normalize" the score assigned to a document based on the current scoring scheme. As
described above, the scoring algorithm is query-oriented in the sense that the score reflects to what extent the query
CO is covered by the document CO. While this approach is theoretically justifiable, one potential drawback is that a
document containing the entire query CO is not ranked higher than one that contains fragments of the query CO
scattered in the document as long as they cover the same query CO. That is "connectivity" or `coherence" of
matching document CO is not fully taken into account.

With the intuitive notion that the number of matching CO fragments in a document would be inv&sely proportional
to "connectivity", we have been experimenting with various normalization factors that are a function of the number
of matching CO fragments. At the time of writing, our experimental data show that when we consider 12 sentential
COs as a unit (called "paragraph") and use the number of units containing one or more matching CO fragments in
the normalization function, we obtain the best result. Among all the functions we have tried, the best normalization
factor we have found experimentally so far is:

  1.()5 A (1-x)

where x is the number of text units that contain one or more matching CO fragments. When this is combined with
the maximum of the scores assigned to individual "paragraph" as follows:

  S*1.05A(1~x) + OA*M

where S is for the unnormalized score and M for the maximum "paragraph" score, we obtained the best results. Since
we determined the constants incrementally, it is entirely possible that different combination of the constants can give
better results. It is relatively clear based on these experiments that the first or the second term alone are always
inferior to the combination. The number of sentential COs for "paragraphs", 12, seems also pretty stable.

We have produced TIPSThR runs using the Rf[-coded documents and topic statements. The current matching
program attempts to match on Rr[ codes ouly when the concept names (words) don't match. Because of this
conservative approach, the RH' codes do not block a match between two different polysemous words and thus have
any direct impact on the word ambiguity problems in IR. With the disambiguation process employed when RH'
codes are chosen for a noun or verb, however, the net effect is analogues to term expansion with sense
disambiguation. It should be noted that since RH' codes are used for both document and query concepts, this amounts
to sense-disambiguated term expansion on both queries and documents.


                                   97

While the original motivation was to represent documents and topic statements at more conceptual level using RrF
codes, we are also testing the effectiveness Of RIT-based term expansion in IR environments. Using the scheme we
have developed for term clustering using contextual information in the corpus (Myaeng & Li, 1992), we have three
methods to evaluate: RIT-based expansion, term~uster based expansion, and a combination of the two so that we
can elintinate the problem of using a general- puipose thesaunis and the errors made by the termclustering method.

For TWSThR evaluation, we have submitted two sets of four runs: one with R~ codes and the other without them.
Each set consists of three runs for different scoring schemes and the last one for the combination of the three runs
which appears to produce the best result inour internal experiinent.

3.ThstRufls

The DR-LINK group elected to put their efforts into continued work for the twenty-four month `HPSThR testing,
and as a result we lost our opportunity to have TREC~ompatible results to discuss at this time. Although our
twenty-four month ~STER runs have been submitted, many of our top-ranked documents were not amongst those
submitted by TREC participants, soit is virtually impossible to make even unofficial reports on our system's
perfo[lnance. We trust that in the near future there will be some comparable groups and/or runs to measure ourselves
against after the results from both TIPSThR and TREC-2 are available.

4.Summary

As the above descriptions should convey, we have made a great deal of progress in the development and integration
of the DR-LINK System since TREC-1. Unfortunately, the absence of quantified results of our pelformance limits
our convmcing power. However, we are pleased to have demonstrated that a system implementation of our original
notion of integrating multiple levels of linguistic processing so that retrieval can be conducted at a conceptual rather
than word-based level is nearly achieved.

Many rich research and implementation ideas remain to be explored in all of the DR-LINK modules, particularly
those which have only been in exi~tence for a few months.


The research reported herein was funded by ARPA's IIPSThR Program. We are grateful to both Longrnans' for
access to the machine readable tape of LDOCE and to BBN for the loan of their POST tagger.

Without the untiring efforts of the following individuals, neither the DR-LINK System nor our very promising
results would have been possible: Margot Clark, Bob Del Zoppo, Saleh Elinohammed, Chs Khoo, Tracey Lemon,
Ming Li, Mary McKenna, Ken McVearry, Carin Obad, Woq~in Paik, Ching-cheng Shih, Joc Woeltel, Fdinund Yu,
Ahsan Zhia.


Cook, W (1989). Case Grammar Theory Washington, D.C.: Georgetown University Press.

Harman, D. (Ed.) (1993). Th~ first Text REtrieval Conference (IREC-1) National Institute of Standards and
  Technology.

Liddy, E.D., Jorgensen, C.L., Sibert, E. & Yu, E.S. (1991). Sublanguage grammar in natural language
  processing. Pr~e~gs of RlAO `91 ~ Barcelona.

Liddy, E.D., Paik, W. & Yu, E.S. (1993). Document filtering using semantic information from a machine readable
  dictionary. ~ of the ACL Workshop on Very Large Cor~~ra.


                                   98

Meteer, M., Schwartz, R. & Weischedel, R. (1991). POST: Using probabilities in language processing.
                                                     Sydney, Australia.

Myaeng, S, H. & Khoo, Chris (1992). On uncertainty handling in plausible reasoning with conceptual graphs.
          ci the 7th Conceptual Granhs Workshon. Las Cruces, NM.

Myseng, S. H. & Li, Ming (1992). Building a database-specific concept hierarchy for information retrieval by
  acquiring lexical semantics froin a corpus. Proceedings of the First International Conference in Information and


Myaeng, S. H. & Liddy, E. D. (1993). Information retrieval with semantic representation of texts. ~
  Svmrosium on Document Analysis and Information Retrieval. Las Vegas.

Myaeng, S. H. & Lopez-Lopez, A. (1992). Conceptual graph matching: A flexible algorithin and experiments.
  Journal of Experimental and Theoretical Artificial Intelligence. Vol.4,107-126.

Paik, W., Liddy, E.D., Yu, E.S., & McKenna, M. (1993). Categorizing and standardizing proper nouns for efficient
  information retrieval. ~x:eedings of the Workshop on Acquisition of Lexical Knowledge from Text. Association
  for Computational Linguistics. pp.154-60.

Som&s, H. L. (1987). Valency and Case in Computational Linguistics. Edinburgh: Edinburgh University
  Press.

van Duk, T. A. (1988). News as discourse. HillsdaleLawrence Erlbaum Associates.


                                     99