Bayesian Inference with Node Aggregation for
                                  Information Retrieval
                                       Brendan Del Favero
                                          Robert Fung
                               Institute for Decision Systems Research
                                 350 Cambridge Avenue, Suite 380
                                       Palo Alto, CA 94306
                                       idsr @ netcom.com

 1     Introduction
 Information retrieval can be viewed as an evidential
 reasoning problem. Given a representation of a document
 (e.g., the presence or absence of selected words and
 phrases), and a representation of an information need (e.g.,
 topics of interest), the problem of information retrieval is to
 infer the degree to which the document matches the
 information need. Since probability theory is the classical
 choice for automating evidential reasoning, probabilistic
 approaches to information retrieval are natural and have
 had a long history, starting in the 1960's (Maron & Kuhns,
 1960).

 In this paper we describe research that adapts and applies
 Bayesian networks, a new technology for probabilistic
 representation and inference, to information retrieval. The
 technology has substantial advantages over older
 technologies including an intuitive representation and a set
 of efficient inference algorithms. We discuss the Bayesian
 network technology and probabilistic information retrieval
 in Section 2 of this paper.

 Our research is directed at developing a probabilistic
 information retrieval architecture that:
  * is oriented towards assisting users that have stable
      information needs in routing (i.e., sorting through)
      large amounts of time-sensitive material,
  * gives users an intuitive language with which to specify
      their information needs,
  * requires modest computational resources (i.e., memory
      and CPU speed), and
  * can integrate relevance feedback and training data with
      users' judgements to incrementally improve retrieval
      performance.

 Towards these goals, we have developed a system that
 allows a user to specify: multiple topics of interest (i.e.,
 information needs), qualitative and quantitative
 relationships between the topics, document features that
 relate to the topics, and quantitative relationships b~tween
 these features and the topics. The system runs on a
 Macintosh II computer and can use training data to estimate
 any of the quantitative values in the system. We discuss the
 particular methods we developed and used in our system in
 Section 3.


                                              151

We participated in the exploratory group (Category B) of
the 1993 Text Retrieval Conference (TREC-2), sponsored
by the National Institute of Standards and Technology
(MST). As a participant in the exploratory group, we were
tasked with working with a subset of the TREC-2 training
and test data. Our training data consisted of Wall Street
Journal (WSJ) articles and our test data consisted of San
Jose Mercury News (SJMN) articles. We chose a subset of
10 topics out of the 50 TREC-2 routing topics to best
illustrate the methods and concepts we developed. The
choice of the 10 topics was reported to the TREC
coordinators prior to our training runs and, of course, prior
to our receipt of the test data. We generated routing queries
for each of the 10 chosen topics, trained against the WSJ
training set to improve our queries, and tested these queries
against the SJMN articles in the test data set.

Our system was developed entirely within the duration of
the TREC-2 project (January 93 to June 93) including the
document handling, feature extraction, inference, and
reporting capabilities. Our TREC-2 effort consisted of the
two authors. We describe the experimental set-up in
Section 4 and the result of our test run in Section 5.

We are very encouraged by the test results and have many
ideas for future research, which we discuss in Section 6.


2     Background
In this section we describe the Bayesian network
technology and outline the previous efforts in probabilistic
information retrieval.

2.1 Bayesian Networks

While probability theory provides a suitable theoretical
foundation for evidential reasoning, a technology based on
probability theory that is computationally tractable and that
includes an effective methodology for acquiring the needed
probabilistic information has been lacking. Recent
developments in Bayesian networks have provided these
features. As the name suggests, the technology is based on
a network representation of probabilistic information
(Howard & Matheson, 1981; Pearl, 1988).

A Bayesian network represents beliefs and knowledge
about a particular class of situations. The use of Bayesian

networks is similar to expert system technologies. Given a
Bayesian network (i.e., a knowledge base) for a class of
situations and evidence (i.e., facts) about a particular
situation of that class, conclusions can be drawn about that
situation. The technology has been used in a wide variety
of situations, including medical diagnosis, military situation
assessment, and machine vision.

A Bayesian network is a directed acyclic graph where each
node represents a random variable (i.e., a set of mutually
exclusive and collectively exhaustive propositions). Each
set of arcs into a node represents a probabilistic dependence
between the node and its predecessors (the nodes at the
other end of arcs). The primary technical innovation of
Bayesian networks is the representation, through the
network's structure, of conditional independence relations
between the variables in a network. This innovation not
only provides an intuitive representation to acquire
probabilistic information but also renders inference
tractable for large numbers of real-world situations.

Inferences can be drawn from a Bayesian network with a
wide variety of algorithms. There are exact algorithms
(Lauritzen & Spiegelhalter, 1988; Shachter, 1986; Shachter,
D'Ambrosio, & Del Favero, 1990) as well as approximate
algorithms (Fung & Chang, 1989) Inference algorithms
compute a probability distribution of some combination of
variables in the network given all the evidence represented
in the network.

Since the introduction of the Bayesian network technology,
several efforts have been made to apply it to information
retrieval (Fung, Crawford, Appelbaum, & Tong, 1990;
Turtle & Croft, 1990). The results are promising.
However, several recent innovations have been made in the
Bayesian network technology that have not yet been
applied to information retrieval. The innovations involve
the representation of conditional independence relations
that are finer than those currently represented at the
network level. One of these innovations is Similarity
Networks (Heckerman, 1991) developed by David
Reckerman at Stanford University. Another innovation for
representing the relationship between variables, the "co-
occurrence diagram," was developed on this project.

2.3 Probabilistic Information Retrieval

Most probabilistic information retrieval techniques can be
illustrated graphically through the use of Bayesian
networks. In a simple formulation, there are n topics of
interest (ti....tn) and m identifiable document features
(f 1....~m)   The information retrieval problem is to
compute the posterior probability p(ij If 1....~m) for each
topic, given a quantification (i.e., the joint probability p(tl,
     tn)) of the frequency that the topics appear in the corpus
and a quantification (i.e., the conditional probability
p(f 1....~m ti.....tn)) of the relationship of the
"presence" of a topic in a document and the "presence" of
features.


                                                           152

Figure 2.1 is a Bayesian network representing a retrieval
model with one topic of interest and three features. The
node t represents the events "the document is relevant to
topic t" and "the document is not relevant to topic t." The
nodes Ii represent the events "the feature Ii is present in the
document" and "the feature ~ is not present (is absent) in
the document." The prior probabilities of the events t and
not-t are stored in the t node. The conditional probabilities
p( Ii It) are stored in each of the feature nodes.


  Figure 2.1: The two-level Bayesian network model of
              information retrieval

Because of the lack of arcs between the feature nodes, this
diagram embodies the assumption, used in many
probabilistic systems, that the features are conditionally
independent of each other given the topic.

Using the Bayesian network form of Bayes' rule called arc
reversal (Shachter, 1986), the posterior probability of the
event t can be computed. The inversion formulas are
straightforward and computationally feasible (they are
described in the Appendix). Figure 2.2 shows the network
in Figure 2.1 with the arcs reversed. It represents a model
in which knowing whether one or more of the features are
present provides information (i.e., the posterior distribution
on node t) about whether the topic is relevant to the
document.


   Figure 2.2: The Information retrieval model after
               Bayesian inversion

To address multiple topics within this framework, a model
similar to the one represented by Figure 2.1 would be
generated for each topic, and inference would be carried out
on each topic separately. However, these multiple models

fail to represent possible relationships between the topics.
As a consequence, the acquisition of consistent feature-
topic probabilities and the interpretation and comparison of
multiple topic probabilities are problematic. For example,
it would be impossible using this framework to compute the
probability that a document was relevant to two selected
topics or to compute the probability that a document was
relevant to at least one of two selected topics.

Bayesian networks can be used to explicitly represent the
relationship between topics. For example, consider Figure
2.3 in which the two topics ti and t2 are related.

           x?sž


                 0
Figure 2.4: Multiple-topic query represented as a single
            compound-topic query

                                                 3     Retrieval Architecture

                                                 This section describes the information retrieval architecture
                                                 we have developed.

  Figure 2.3: Information Retrieval Model with Two
                Related Topics

The Bayesian inference problem with multiple topics
becomes more complicated than before. To compute the
posterior probability of each topic, all the other topics must
be removed through marginalization as well as reversing
the topic-feature arcs as before. In addition, any joint
distribution between the topics can be computed. To
perform these computations, a general inference algorithm
is needed.

However, there is a way to simply a multiple topic network
to look like a single topic network (Figure 2.1). The topics
can be combined into a compound topic (node 5 in Figure
2.4) (Chang & Fung, 1989) whose range is all possible
present-or-absent combinations of ti and t2. An advantage
of this representation is that the same simple computational
formulas can be used as before.

A disadvantage of this representation is that the
intermediate query, 5, must contain (in the worst case) 2n
states, representing all possible present-or-absent
combinations of its n parent topics. However, this worst
case is rarely seen, because the relationships between topics
are typically sparse. Building this intermediate query is one
of the innovations of our research and is described in
Section 3.2.

The inputs to the system are the descriptions of topics of
interest and a set of documents to test. The output of the
system is list of documents ranked by their degree of
relevance to each topic. The system computes the degree
of relevance as the probability that a document is relevant
to a topic, for every document-topic pair. To help in this
task, the system is given a training set of documents to
which relevance judgements have been attached.

There are several components to our retrieval system. The
feature extraction component, described in Section 3.1,
translates a document from its raw text form to a simpler
internal representation that the system can use. The query
construction component, described in Section 3.2, translates
the description of a set of topics into an internal
representation. The document scoring component,
described in Section 3.3, uses Bayesian inference to
calculate a measure of a document's relevance to a topic,
given the internal representations of both document and
topic.

3.1 Featnre Selection and Extraction

Any observable characteristic of the text of a document that
may be clearly defined (e.g., as either present or absent)
may be regarded as a feature. Our system looks for two
types of features in the text of a document: single words
and pairs of adjacent single words. If a feature appears at
least once in a document, it is counted as "present."
Otherwise, it is counted as "absent."

The internal representation of a document is therefore a
binary vector, each element indicating the presence or
absence of the corresponding feature in the document.

                                                 The system removes many common suffixes from a word
                                                 using Paice's stemming rules (Paice, 1977). This means,
                                                 for example, that if either of the words "walking" or


                                              153

"walks" is present in a document, the system considers the
root word "walk" to be present in the document.

The system must be given a list of features for which to
look. This target list was constructed in three steps. First, a
list of candidate features was generated from the
descriptions of the 50 TREC-2 routing topics. The text of
the descriptions was broken up into individual words, from
which the suffixes (if any) were removed. Duplicate words
and common words (stop words) were removed. A number
of two-word features (such as phrases and proper names)
were identified by hand. This procedure created a list of
about 1400 candidate features.

Second, the system extracted the relative frequency
information for each of these features from the training
documents. Then, for each topic (the 10 plus an additional
topic representing "none of the above,") the system sorted
the candidate features in descending order according to the
F4 formula of Robertson and Sparck Jones (Robertson &
Sparck Jones, 1976), which we used is a measure of the
ability of a feature to characterize a topic.

Third, the top 30 features for each topic (and the top 60
features for "none of the above") were combined into a
single list. After removing duplicates, this yielded the final
list of 229 features.

We tried numerous feature selection strategies and settled
on this one as the most satisfactory.

3.2 Query Representation and Construction

In preparation for the inference step, the 10 single-topic
queries are combined into one multiple-topic query. As
mentioned in Section 2.3, this aggregation can in the worst
case require 210 different states in the multiple-topic query.
In this section, we describe how to reduce the number of
states required by considering the relationships between the
topics.

Once the states of the query have been structured, we must
assign numerical values to the Bayesian model. We
describe how to generate estimates of the prior probabilities
of each of these states in Section 3.2.3. We describe in
Section 3.2.4 how to define, for each feature and state, the
conditional probability that the feature appears in a
document given that the document is relevant to that state.

3.2.1 Pairwise Relationships Between Topics

There are six possible pairwise relationships between any
two topics t1 and t2:

 1.  t1 and t2 are mutually exclusive: if there is no
     document relevant to both topics
 2.  t1 is a subset of t2 if all documents relevant to t1 are
     also relevant to t2

 3.  t2 is a subset of t1 if all documents relevant to t2 are
     also relevant to t1

 4.  t1 and t2 are equivalent if t1 and t2 are subsets of
     each other (they satisfy Relations 2 and 3)
 5.  t1 and t2 are dependent if knowing that a document
     is relevant to t1 gives you some information about
     whether the document is relevant to t2

 6.  t1 and t2 are independent if knowing that a
     document is relevant to t1 gives you no information
     on whether the document is relevant to t2

Each of the Relations 1-5 is a type of dependence between
topics. In a belief network, topics satisfying Relations 1-5
would be connected by an arc. To ensure that two topics
satisfy only one of the relations, Relation 5 (dependence) is
defined as any type of dependence that is different from
those of Relations 1-4.

Relation 6, independence, is represented in a belief network
by the absence of an arc between the topics.

The distinction between dependence (Relation 5) and
independence (Relation 6) is useful in calculating the
probabilities of combinations of the topics, as described in
Section 3.2.3.

Relations 1-4 can be identified by testing whether the
defining conditions are satisfied in the training set. If the
topics satisfy none of the first four relations, then a chi-
square test can distinguish between Relations 5 and 6. If
there are too few documents with relevance judgements for
both topics to make reliable conclusions (our cutoff was 13
documents), the system makes an assessment, then prompts
the user to verify it manually.

                                                 To fully explore the pairwise relationships between topics,
                                                 we selected ten of the fifty TREC-2 routing topics to use in
                                                 our models and tests. The topics are listed in Table 3.1.


                                              154

Topic     Topic Description
Number

  57      Financial Health of MCI
  61      Israeli Role in the Iran-Contra Affair
  74      Conflicting Policies of the US Government
  85      Official Corruption in any government
  88      Crude Oil Price Trends
  89      Downstream investing by OPEC members
  90      Data on Proven Reserves of Oil and Gas
  97      Fiber Optic Applications
  98      Fiber Optic Equipment Manufacturers
  99      Iran-Contra Affair

      Table 3.1: List of the 10 topics we selected

The ten topics were selected to provide an interesting
mixture of relationships among the topics. Before looking
at the relevance information in the training set, we assessed
these relationships manually, as shown in Table 3.2. The
table is symmetrical about the main diagonal.


    57    61  74   85 88   89 90   97  98
      e       i    i  ~                d
          e   dI~~~~~ sd
      i   d   e    d   i   i       i   i ~sd
      i   5   d    e   i   i       i   i   sd
              i    i   e   d   d           i
    -         i~i~d~e~d                  -
                                           i
        ~                  d   e ~       ~
      d
              i    i               e   d
      d       i    i               d   e
          sd  sd   sd  i   i               e

       d = dependent
       5 = subset
       sd = subset or dependent
       i = independent
       e = equivalent
       empty = mutually exclusive

57
61
74
85
88
89
90
97
98
99

     57 61    74  85  88  89 90   97  98  99
     e    ~   ~
57
61   ~    e   d   5                *  *   d
74 ~d~d       e   d                *      d
85        5   d   e    ~       *   *  d* ~d
88        ~       ~    e  i    i   ~  *   *
89                     i  e    i   ~      *
90        *       ~    i  i    e   ~      ~
97 ~d ~* ~* ~* ~ ~             ~   e  d   *
98 ~
99   *    d  ~d   d   ~* ~*    ~   ~  ~   e

* = manual intervention required

Table 3.3: Pairwise relationships determined from the data

         The system's assessments match the manual assessments
         quite well. The system found even more mutually
         exclusive pairs than we had intuitively thought. One
         surprise is between topics 61 ("Israeli role in the Iran-
99       Contra Affair") and topic 91 ("Iran Contra Affair"). One
         might assume that 61 is a subset 99, but there are indeed
         documents in the training set that are relevant to 61 but not
         to 99. Thus, the relation between 61 and 99 is dependence
         rather subset.

         The information in Table 3.3 can be expressed as a directed
         graph, as shown in Figure 3.1. Each topic is a node. Two
         topics are connected by an arc if there is at least one
         document containing them both. There is no arc between
         mutually exclusive topics. The arcs marked "i" connect
         independent topics, the unmarked arcs connect topics that
         are dependent, and the directed arc points to a subset from
         its superset.

         Figure 3.1 is not a belief network. It may be considered a
         "co-occurrence diagram," since topics that are relevant
         together (that co-occur) in the collection are connected.

                            97

Table 3.2: Pairwise relationships between 10 TREC topics,
  assessed manually before looking at the training data

Table 3.3 shows the topic relationships that were generated
automatically from the relevance judgements on the
training documents. Manual verification of the system's
assessment was required in about half of the cases (marked
with an asterisk). The table is symmetrical about the main
diagonal.


                                             155

       57      98
  ˝
74      61
  I>˝f
99      85

88      89
    90

Figure 3.1: Relationships diagram among 10 TREC topics

3.2.2 Generating states from the topics

The states of the multiple-topic query can be generated
automatically by enumerating all possible "relevant or not
relevant" combinations of topics, subject to two rules:

   1.      Two mutually exclusive topics cannot both be
           relevant within the same state.

   2.      A topic that is a subset of another cannot be
           relevant within a state unless its superset topic
           is also relevant.

By construction, there is always one state, U-, to which a
document is relevant if it is relevant to none of the topics.

For the ten topics, these rules reduce the number of states
drastically from the theoretical maximum of 1024 = 210
states to the actual number of 28 states.

A state is identified by listing the topics to which a
document is relevant (or not relevant) if it is relevant to that
state. For instance, documents relevant to the state
(61-74 85-99) are relevant to topics 61 and 85 and are not
relevant to topics 74 and 99. The list of states appears in
the first column of Table 3.4.

3.2.3 Generating State Priors

In theory, the prior probabilities of the states can be
calculated from their relative frequencies in the training set.
However, since there are very few documents in the TREC-
2 training set that were evaluated for three or more topics, a
different way to estimate the priors was required.

One estimate is provided by factoring the states prior into
products of the priors of smaller compound topics. This
can be accomplished without manual intervention. For
instance, for the independent topics 88, 89, and 90, only
three numbers are needed (the prior probabilities of the
three topics in the training set) to compute the probabilities
of all seven states containing the three topics. For the state
(61-74 85-99), two numbers are needed: the prior
probability of the compound topic (-74 85-99) and the
probability that topic 61 is relevant given that topic 85 is
relevant. The priors obtained by this method are shown in
the second column of Table 3.4. They are expressed as
inverse frequencies: the number given is the number of
weeks (on average) between articles relevant to a state,
assuming 1000 documents per week.

        State           Average          Average
                     Weeks between    Weeks between
                        Articles,        Articles,
                        Assessed         Assessed
                     Automatically       Manually
  -88 -89   90             < 1               3
  -88   89 -90             < 1               3
  -88   89  90             20                7
   88 -89 -90              < 1               2
   88 -89   90             20                4
   88   89 -90              40               4
   88   89  90             4000              9
   57   97  98              10                8
   57   97 -98             < 1               5
   57  -97  98             < 1                6
   57  -97 -98             < 1               3
  -57   97  98              1                 6
  -57   97 -98             < 1                4
  -57  -97  98             < 1                3
   61   74  85   99         33               20
  -61   74  85   99         14                3
   61   74  85 -99         < 1               50
  -61   74  85 -99         < 1                3
   61 -74   85   99        < 1               20
  -61  -74  85   99        < 1                5
   61 -74   85  -99         5                100
  -61  -74  85  -99         2                 2
        74 -85   99        < 1                3
        74 -85  -99        < 1                2
       -74 -85   99        < 1                3
   57   74                 < 1               20
   85   98                                   50

   U                 _______________


     Table 3.4: The states and their prior probabilities
  expressed as frequencies (assuming 1000 articles per
       week) generated by two assessment methods

However, these priors proved unsatisfactory and required
manual override, for three reasons. First, the training set is
a very biased sample of the WSJ. The training documents
were selected by the retrieval systems of TREC-1 to be
intentionally relevant to at least one of the topics. Thus, the
prior derived from the training set tends to be a gross
overestimate of the true prior.

                                                 Second, the time period of the training set (1987 to 1992) is
                                                 different from that of the test set (only 1991). The
                                                 frequency of the topics relative to each other changes over
                                                 time. For instance, there are many fewer articles on the
                                                 Iran-Contra affair (relative to the other topics) in 1991 than
                                                 there were in 1987-1990. We adjusted the priors to match
                                                 the relative frequencies expected in the test set.


                                              156

Third, the priors are widely divergent from the intuition of
any user who reads a newspaper. The numbers suggest, for
instance, that about 1 out of every 100 articles in the WSJ
(and, by analogy, in the SJMN) are relevant to topic 57,
"MCI." Any reader of the WSJ or the SJMN knows that
this estimate is much too high - the relative frequency is
probably closer to 1 in 500 or 1 in 1000 than to 1 in 100.

Thus, the prior probabilities were assessed manually. It is
assumed that the user has some knowledge of the test
domain (in this case, articles in the SJMN) and can with
some thought assess the relative frequencies of various
states as a part of specifying the query for the particular
routing request. For each state, we ask the user what is the
average number of weeks between the publication of
articles relevant to that state. This number is presented in
the third column of Table 3.4. It can be converted to a prior
probability by combininbg it with the assumption that there
are 1000 documets per week.


The prior probability of U , p( U ), is calculated as one
minus the sum of the priors of all the other states, to ensure
that the probability of all states together is one.

3.2.4 Feature Conditional Probabilities

The inference algorithm requires, for each feature and for
each state, the conditional probability of the feature given
the state. These probabilities cannot be obtained directly
from the relative frequencies obtained from the training set,
because there are few documents that are relevant to more
than one topic at a time.

We approximate these probabilities by using a structure
called a noisy-or gate.

The noisy-or gate combines the effects of two or more
factors, each of which may contribute to the presence of a
feature. It is a model of disjunctive interaction, as
described in (Pearl, 1988). It has been used in medical
decision research to calculate the probability of a particular
symptom being present, given diseases that cause the
symptom (Heckerman, 1989).

In the context of information retrieval, a feature may be
present due to any of the topics tha~ are relevant in the state.
For each state-feature pair, we build a noisy-or model. The
contributing factors are the topics that are relevant within
the state. The effect is the feature's presence or absence in
the document.

For example, consider a feature f and the state (57-97 98).
The feature may be present due to topics 57 or 98. It
cannot be present due to topic 97 because that topic is not
relevant within the state. Let El be the event that the
feature is present due to topic 57, and let E2 be the event
that the feature is present due to topic 98. Table 3.5 lists all
the possible cases of the two uncertain events. Figure 3.4
shows the belief network structure of the noisy-or model.


                                              157

The node with the double wall is a deterministic logical or
gate.

     Feature           Feature         Feature
  Present due to    Present due to   Present at all
       57               98           (E1ORE2)
       (El)             (E2)       ____________
       Yes              Yes             Yes
       Yes              No              Yes
       No               Yes             Yes
       No               No              No

     Table 3.5: Possible cases for a noisy-or node


            PyE1E2


                       0

Figure 3.4: Belief Network corresponding to a Noisy-Or
                     Gate Model

The only case in Table 3.5 in which the feature is absent is
the fourth case. Thus, the conditional probability that the
feature is absent, given this state, is the probability of that
fourth case. The probability that the feature is present is
one minus the probability that the feature is absent.

3.3 Document Scoring

The Bayesian inversion described in Section 2.1 yields, for
each document, the posterior probability that the document
is relevant to each state. We calculate the posterior
probability that the document is relevant to each topic by
summing the posterior probabilities of all of the states in
which the topic appears.

For example, the posterior probability for topic 57 is the
sum of the posterior probabilities of the five states in which
topic 57 is relevant (refer to Table 3.4). The states are (57
97 98), (57 97-98), (57-97 98), (57-97-98-74) and
(57 74).

The final list of documents for each topic contains the top
1000 documents, ranked in descending order according to
the posterior probability that they are relevant to that topic.

4      Experimental Environment

4.1 Hardware and Software Environment

The system was developed by one programmer over the
period of six months. It was written in C using the
Symantec Think C 5.0 compiler under Macintosh System
7.0.1. The experiments were run on two machines,
depending on availability: a Macintosh llfx and a Centris
650. Both machines had 8 Mb of memory and a 500 Mb
hard disk.

4.2 Summary of System Data Structures

Table 4.1 is a list of the data structures used by the system.
Each is listed with the sections of this paper in which it is
described and with an indicator of whether it was provided
by NIST (N), assessed manually by us (M), or generated
automatically by the system (A).

    Data Structure (Sections Source)
    Descriptions of Routing Topics (3.1, N)
    Training Documents (3.1, N)
    Candidate Feature List (3.1, AIM)
    Final Feature List (3.1, 4.3.2, A)
    Relevance Judgements on the Training Documents
     (3.2.1, N)
    Topic Relationships (3.2.1, AIM)
    State Prior Probabilities (3.2.2, 4.3.2, M)
    Feature Conditional Probabilities (3.2.4, A)
    Test Documents (3.3, N)
    Document Posterior Probabilities (3.3, A)
    Final list of retrieved documents (3.3, A)

     Table 4.1: Data Structures used by the System

The candidate feature list could be defined entirely
automatically by sufficiently intelligent procedures such as
phrase and proper name identification. The state prior
probabilities are the only data items that must be assessed
manually.

The data files needed by the system, other than those
provided by NIST, occupy about 100 Kb on disk.

4.3 Training Phase

4.3.1 Building the Data Structures

We are a Category B participant in TREC-2. As such, we
used a subset (1ust the WSJ articles) of the full training
collection. Of these we used only those (roughly) 29,000
articles for which there is some relevance judgment
available. In addition, because we considered only ten of
the TREC-2 routing topics, we used only the 5941
documents that have a relevance judgment on at least one
of the ten topics.

158

The inputs to a training run are the candidate feature list,
the list of topic relationships, the training documents, and
the training relevance judgements. The outputs of a
training run are the final list of features and the feature
conditional probabilities.

The time to complete a training run is about 1.5 hours.

4.3.2 Defining the Query

The design of the TREC-2 experiment required that before
we receive the test data, we submit the definitions of our
system and of our particular query to NIST. The query
definition includes the final feature list, the list of topic
relationships, and the state prior probabilities. All of the
other data is determined automatically, as shown in
Table 4.1.

We ran about two dozen sample experiments that applied
the system to the training documents to gauge its
performance with different query definitions. In these tests,
we varied only two of the data inputs, the state prior
probabilities and the final feature list. The list of topic
relationships remained constant throughout these tests.

It took much longer than expected to finish programming
the system. Thus, all of the experiments to define the query
were completed in the two weeks immediately preceding
the final submission of our query definition to NIST.

4.4 Test Phase

As a Category B participant in TREC-2, we ran our routing
experiment on just the SJMN articles.

The inputs to the test runs were the final list of features, the
feature conditional probabilities, the list of topic
relationships, and the test documents. The output of the test
run is the final list of retrieved documents.

A test run takes about 5.5 hours. This includes
decompressing the 150 MB of test data, one file at a time,
when it is needed.


S      Results
The TREC-2 designation for our system is "idsra2." Figure
5.1 shows the precision-recall curves for the ten topics we
considered, excluding topic 88, for which no TREC-2
participant found any relevant documents.

Table 5.1 shows several measures of performance for our
results, along with the best, median, and worst values for
those measures among all TREC-2 participants. We again
exclude topic 88.

precision
 1.00                                                        Topic         Average   Precision

 0.90                                                        Number   idsra2   Best    Median   Worst

 0.90
                                                               57     0.387    0.460    0.374   0.000
 0.70
                                                               61     0.464    0.464    0.083   0.000
 0.60
                                                               74     0.008    0.074    0.008   0.000
 0.50
                                                               85     0.174    0.353    0.174   0.000
 0.40
                                                               89     0.081    0.259    0.077   0.000
 0.30
      -~             --                                        90     0.025    0.025    0.000   0.000
 0.20
                                                               97     0.383    0.383    0.202   0.002
 0.10      90        85
           74                                                  98     0.282    0.427    0.334   0.000
     0

      0  0.1 0.2  0.3  0.4   0.5 0.6  0.7  0.9   0.9 1.0       99     0.700    0.700    0.509   0.000
                            Recall

     Figure 5.1: Precision vs. Recall for system idsr~


Topic        Relevant Retrieved at 100

Number     idsra2      Best      Median       Worst
     57      17           18        17           0
     61      19           19        9            0
     74       6           16        6            0
     85      33           54        33           1
     89       2            3        2            0
     90       1            1        0            0
     97      25           28        18           1
     98      24           24        17           0
     99      60           60        52           0
Table 5. la: Relevant documents in the top 100 retrieved,

             for idsra2 and for all systems

 Topic       Relevant Retrieved at 1000

Number     idsra2      Best      Median       Worst
     57      18           19        18           0
     61      24           25        24           0
     74      11           31        11           1
     85      88           115       88           2
     89       2            4        2            0
     90       1            3        0            0
     97      27           32        27           1
     98      26           29        26           0
     99      70           70        66           0

Table 5. lb: Relevant documents in the top
             for idsra2 and for all systems

                                1000 retrieved,


                                             159

                                                Table 5.lc: Average precision (as defined for TREC-2),
                                                          for idsra2 and for all systems


                                               6     Conclusions and Future Directions

                                               We believe that we have made significant progress to
                                               developing an information retrieval architecture that:
                                                 is oriented towards assisting users with stable
                                                    informafion needs in routing large amounts of time-
                                                    sensifive material
                                                * gives users an intuitive language with which to specify
                                                    their information needs
                                                 requires modest computational resources, and
                                                * can integrate relevance feedback and training data with
                                                    users' judgements to incrementally improve retrieval
                                                    performance.

                                               We are encouraged by the test results. We have not had
                                               very much time to analyze the results but we intend to try to
                                               understand why we did very well on some topics and not so
                                               well on others. Very preliminary analysis suggests that the
                                               features for the topics in which we did well (e.g., 61 and
                                               99) were much more informafive than the ones on which
                                               we did very poorly (e.g., 74).

                                               We have many ideas for future research. These ideas fall
                                               into three basic categories: probabilistic representation,
                                               user interface, and inference methods.

                                               The most important improvements we would like to make
                                               are in the category of probabilistic representation of the
                                               topic and the document. One research goal is to develop a
                                               way to intuitively represent relationships between features.
                                               Also, we would like to explore more sophisticated feature
                                               extractors that recognize phrases, synonyms, and features
                                               derived from natural language processing. We believe that
                                               achievement of these goals could lead to significant
                                               improvements in performance.

In conjunction with these representational improvements,
we would like to design a complete user interface that
would allow users to make the needed probabilistic
judgements easily and intuitively. We would also like to
develop an explanation facility that could describe why one
document was preferred over another.

There are new exact and inexact algorithms that could
handle these representational modifications. We would like
to experiment with these algorithms to see if they are
suitable. We would also like to implement a relevance
feedback mechanism based on the Bayesian concept of
equivalent sample size.

Finally, while the information retrieval problem has been
viewed primarily (and rightly so) as an evidential reasoning
problem, we take the position that a decision-theoretic
perspective is more accurate since information retrieval is a
decision process. We believe that this perspective can
provide additional insight and eventually improve upon the
probabilistic approaches that have been developed.


Acknowledgment

We would like the thank Dr. Richard Tong for his
assistance and guidance.

Appendix Bayesian Inference

We calculate the posterior probability that a document is
relevant to a state using Bayes' rule and an assumption of
conditional independence between the features.

The Bayesian inversion formula is this:
       p(sldocument)=p(s IF) = p(FIs)p(s)
                                 p(F)

where 5 is the state, and F is a binary feature vector with

         = 1 if feature i is present (the event fj+)

       F~= 0 if feature i is absent (the event ~

The set of all features that are present is denoted F+. All
other features are absent and are in the set denoted F-.

The first term in the numerator is expanded under the
assumption that the features are conditionally independent
given the state.


       p(FIs)= [1 p(ij+ Is) Y' p(Ij Is)
               iinF+        iinF~

The second term in the numerator is the state prior, p( s),
which can be estimated as described in Section 3.2.


                                              160

The denominator is a normalization constant obtained by
summing over the values for the numerator for all the
states.

      p(F)= ~   p(FIs)p(s)
            all 5

p( F) is the probability that one would observe the
particular set of features F.


References

Chang, K. C., & Fung, R. M. (1989). Node Aggregation for
Distributed Inference in Bayesian Networks. In IJCAI 89.
Detroit: Morgan Kaufmann.

Fung, R. M., & Chang, K. C. (1989). Weighing and
Integrating Evidence for Stochastic Simulation in Bayesian
Networks. In R. D. S. M. Henrion L.N. Kanal and J.F.
Lemmer (Eds.), Uncertainty and Artificial Intelligence 5
Elsevier Science Publishers B.V. (North-Holland).

Fung, R. M., Crawford, S. L., Appelbaum, L., & Tong, R.
M. (1990). A Probabilistic Concept-based Architecture for
Information Retrieval. In Proceedings of the ACM
International Conference on Information Retrieval
Brussels, Belgium.

Heckerman, D. E. (1989). A tractable algorithm for
diagnosing multiple diseases. In Proceedings of the Fifth
Workshop on Uncertainty in Artificial Intelligence, (pp.
162-173). Detroit, MI.

Heckerman, D. E. (1991). Probabilistic Similarity
Networks. The MIT Press.

Howard, R. A., & Matheson, J. E. (1981). Influence
diagrams. In R. A. Howard & J. E. Matheson (Eds.),
Readings on the Principles and Applications of Decision
Analysis (pp.721-762). Menlo Park, Ca.: Strategic
Decisions Group.

Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local
computations with probabilities on graphical structures and
their application to expert systems. JRSS~ 50, 157-224.

Maron, M. E., & Kuhns, J. L. (1960). On relevance,
probabilistic indexing and information retrieval. Journal of
the ACM, 7, 216-244.

Paice, C. D. (1977). Information Retrieval and the
Computer. London.

Pearl, I. (1988). Probabilistic Reasoning in Intelligent
Systems. San Mateo, Ca.: Morgan Kaufman.

Robertson, S. E., & Sparck Jones, K. (1976). Relevance
Weighting of Search Terms. Journal of the American
Society for Information Science(May-June), 129-146.

Shachter, R. D. (1986). Evaluating influence diagrams.
InOperations Research (pp.871-882).

Shachter, R. D., D'Ambrosio, B., & Del Favero, B. (1990).
Symbolic Probabilistic Inference in Belief Networks. In
AAAI-90, (pp.126-131). Boston: Morgan Kaufmann.

Turtle, H., & Croft, W. B. (1990). Inference networks for
document retrieval. In J. L. Vidick (Ed.), ACM SIG IR,
(pp.1-24). Brussels, Belgium.


                                            161