Bayesian Inference with Node Aggregation for Information Retrieval Brendan Del Favero Robert Fung Institute for Decision Systems Research 350 Cambridge Avenue, Suite 380 Palo Alto, CA 94306 idsr @ netcom.com 1 Introduction Information retrieval can be viewed as an evidential reasoning problem. Given a representation of a document (e.g., the presence or absence of selected words and phrases), and a representation of an information need (e.g., topics of interest), the problem of information retrieval is to infer the degree to which the document matches the information need. Since probability theory is the classical choice for automating evidential reasoning, probabilistic approaches to information retrieval are natural and have had a long history, starting in the 1960's (Maron & Kuhns, 1960). In this paper we describe research that adapts and applies Bayesian networks, a new technology for probabilistic representation and inference, to information retrieval. The technology has substantial advantages over older technologies including an intuitive representation and a set of efficient inference algorithms. We discuss the Bayesian network technology and probabilistic information retrieval in Section 2 of this paper. Our research is directed at developing a probabilistic information retrieval architecture that: * is oriented towards assisting users that have stable information needs in routing (i.e., sorting through) large amounts of time-sensitive material, * gives users an intuitive language with which to specify their information needs, * requires modest computational resources (i.e., memory and CPU speed), and * can integrate relevance feedback and training data with users' judgements to incrementally improve retrieval performance. Towards these goals, we have developed a system that allows a user to specify: multiple topics of interest (i.e., information needs), qualitative and quantitative relationships between the topics, document features that relate to the topics, and quantitative relationships b~tween these features and the topics. The system runs on a Macintosh II computer and can use training data to estimate any of the quantitative values in the system. We discuss the particular methods we developed and used in our system in Section 3. 151 We participated in the exploratory group (Category B) of the 1993 Text Retrieval Conference (TREC-2), sponsored by the National Institute of Standards and Technology (MST). As a participant in the exploratory group, we were tasked with working with a subset of the TREC-2 training and test data. Our training data consisted of Wall Street Journal (WSJ) articles and our test data consisted of San Jose Mercury News (SJMN) articles. We chose a subset of 10 topics out of the 50 TREC-2 routing topics to best illustrate the methods and concepts we developed. The choice of the 10 topics was reported to the TREC coordinators prior to our training runs and, of course, prior to our receipt of the test data. We generated routing queries for each of the 10 chosen topics, trained against the WSJ training set to improve our queries, and tested these queries against the SJMN articles in the test data set. Our system was developed entirely within the duration of the TREC-2 project (January 93 to June 93) including the document handling, feature extraction, inference, and reporting capabilities. Our TREC-2 effort consisted of the two authors. We describe the experimental set-up in Section 4 and the result of our test run in Section 5. We are very encouraged by the test results and have many ideas for future research, which we discuss in Section 6. 2 Background In this section we describe the Bayesian network technology and outline the previous efforts in probabilistic information retrieval. 2.1 Bayesian Networks While probability theory provides a suitable theoretical foundation for evidential reasoning, a technology based on probability theory that is computationally tractable and that includes an effective methodology for acquiring the needed probabilistic information has been lacking. Recent developments in Bayesian networks have provided these features. As the name suggests, the technology is based on a network representation of probabilistic information (Howard & Matheson, 1981; Pearl, 1988). A Bayesian network represents beliefs and knowledge about a particular class of situations. The use of Bayesian networks is similar to expert system technologies. Given a Bayesian network (i.e., a knowledge base) for a class of situations and evidence (i.e., facts) about a particular situation of that class, conclusions can be drawn about that situation. The technology has been used in a wide variety of situations, including medical diagnosis, military situation assessment, and machine vision. A Bayesian network is a directed acyclic graph where each node represents a random variable (i.e., a set of mutually exclusive and collectively exhaustive propositions). Each set of arcs into a node represents a probabilistic dependence between the node and its predecessors (the nodes at the other end of arcs). The primary technical innovation of Bayesian networks is the representation, through the network's structure, of conditional independence relations between the variables in a network. This innovation not only provides an intuitive representation to acquire probabilistic information but also renders inference tractable for large numbers of real-world situations. Inferences can be drawn from a Bayesian network with a wide variety of algorithms. There are exact algorithms (Lauritzen & Spiegelhalter, 1988; Shachter, 1986; Shachter, D'Ambrosio, & Del Favero, 1990) as well as approximate algorithms (Fung & Chang, 1989) Inference algorithms compute a probability distribution of some combination of variables in the network given all the evidence represented in the network. Since the introduction of the Bayesian network technology, several efforts have been made to apply it to information retrieval (Fung, Crawford, Appelbaum, & Tong, 1990; Turtle & Croft, 1990). The results are promising. However, several recent innovations have been made in the Bayesian network technology that have not yet been applied to information retrieval. The innovations involve the representation of conditional independence relations that are finer than those currently represented at the network level. One of these innovations is Similarity Networks (Heckerman, 1991) developed by David Reckerman at Stanford University. Another innovation for representing the relationship between variables, the "co- occurrence diagram," was developed on this project. 2.3 Probabilistic Information Retrieval Most probabilistic information retrieval techniques can be illustrated graphically through the use of Bayesian networks. In a simple formulation, there are n topics of interest (ti....tn) and m identifiable document features (f 1....~m) The information retrieval problem is to compute the posterior probability p(ij If 1....~m) for each topic, given a quantification (i.e., the joint probability p(tl, tn)) of the frequency that the topics appear in the corpus and a quantification (i.e., the conditional probability p(f 1....~m ti.....tn)) of the relationship of the "presence" of a topic in a document and the "presence" of features. 152 Figure 2.1 is a Bayesian network representing a retrieval model with one topic of interest and three features. The node t represents the events "the document is relevant to topic t" and "the document is not relevant to topic t." The nodes Ii represent the events "the feature Ii is present in the document" and "the feature ~ is not present (is absent) in the document." The prior probabilities of the events t and not-t are stored in the t node. The conditional probabilities p( Ii It) are stored in each of the feature nodes. Figure 2.1: The two-level Bayesian network model of information retrieval Because of the lack of arcs between the feature nodes, this diagram embodies the assumption, used in many probabilistic systems, that the features are conditionally independent of each other given the topic. Using the Bayesian network form of Bayes' rule called arc reversal (Shachter, 1986), the posterior probability of the event t can be computed. The inversion formulas are straightforward and computationally feasible (they are described in the Appendix). Figure 2.2 shows the network in Figure 2.1 with the arcs reversed. It represents a model in which knowing whether one or more of the features are present provides information (i.e., the posterior distribution on node t) about whether the topic is relevant to the document. Figure 2.2: The Information retrieval model after Bayesian inversion To address multiple topics within this framework, a model similar to the one represented by Figure 2.1 would be generated for each topic, and inference would be carried out on each topic separately. However, these multiple models fail to represent possible relationships between the topics. As a consequence, the acquisition of consistent feature- topic probabilities and the interpretation and comparison of multiple topic probabilities are problematic. For example, it would be impossible using this framework to compute the probability that a document was relevant to two selected topics or to compute the probability that a document was relevant to at least one of two selected topics. Bayesian networks can be used to explicitly represent the relationship between topics. For example, consider Figure 2.3 in which the two topics ti and t2 are related. x?s¾ 0 Figure 2.4: Multiple-topic query represented as a single compound-topic query 3 Retrieval Architecture This section describes the information retrieval architecture we have developed. Figure 2.3: Information Retrieval Model with Two Related Topics The Bayesian inference problem with multiple topics becomes more complicated than before. To compute the posterior probability of each topic, all the other topics must be removed through marginalization as well as reversing the topic-feature arcs as before. In addition, any joint distribution between the topics can be computed. To perform these computations, a general inference algorithm is needed. However, there is a way to simply a multiple topic network to look like a single topic network (Figure 2.1). The topics can be combined into a compound topic (node 5 in Figure 2.4) (Chang & Fung, 1989) whose range is all possible present-or-absent combinations of ti and t2. An advantage of this representation is that the same simple computational formulas can be used as before. A disadvantage of this representation is that the intermediate query, 5, must contain (in the worst case) 2n states, representing all possible present-or-absent combinations of its n parent topics. However, this worst case is rarely seen, because the relationships between topics are typically sparse. Building this intermediate query is one of the innovations of our research and is described in Section 3.2. The inputs to the system are the descriptions of topics of interest and a set of documents to test. The output of the system is list of documents ranked by their degree of relevance to each topic. The system computes the degree of relevance as the probability that a document is relevant to a topic, for every document-topic pair. To help in this task, the system is given a training set of documents to which relevance judgements have been attached. There are several components to our retrieval system. The feature extraction component, described in Section 3.1, translates a document from its raw text form to a simpler internal representation that the system can use. The query construction component, described in Section 3.2, translates the description of a set of topics into an internal representation. The document scoring component, described in Section 3.3, uses Bayesian inference to calculate a measure of a document's relevance to a topic, given the internal representations of both document and topic. 3.1 Featnre Selection and Extraction Any observable characteristic of the text of a document that may be clearly defined (e.g., as either present or absent) may be regarded as a feature. Our system looks for two types of features in the text of a document: single words and pairs of adjacent single words. If a feature appears at least once in a document, it is counted as "present." Otherwise, it is counted as "absent." The internal representation of a document is therefore a binary vector, each element indicating the presence or absence of the corresponding feature in the document. The system removes many common suffixes from a word using Paice's stemming rules (Paice, 1977). This means, for example, that if either of the words "walking" or 153 "walks" is present in a document, the system considers the root word "walk" to be present in the document. The system must be given a list of features for which to look. This target list was constructed in three steps. First, a list of candidate features was generated from the descriptions of the 50 TREC-2 routing topics. The text of the descriptions was broken up into individual words, from which the suffixes (if any) were removed. Duplicate words and common words (stop words) were removed. A number of two-word features (such as phrases and proper names) were identified by hand. This procedure created a list of about 1400 candidate features. Second, the system extracted the relative frequency information for each of these features from the training documents. Then, for each topic (the 10 plus an additional topic representing "none of the above,") the system sorted the candidate features in descending order according to the F4 formula of Robertson and Sparck Jones (Robertson & Sparck Jones, 1976), which we used is a measure of the ability of a feature to characterize a topic. Third, the top 30 features for each topic (and the top 60 features for "none of the above") were combined into a single list. After removing duplicates, this yielded the final list of 229 features. We tried numerous feature selection strategies and settled on this one as the most satisfactory. 3.2 Query Representation and Construction In preparation for the inference step, the 10 single-topic queries are combined into one multiple-topic query. As mentioned in Section 2.3, this aggregation can in the worst case require 210 different states in the multiple-topic query. In this section, we describe how to reduce the number of states required by considering the relationships between the topics. Once the states of the query have been structured, we must assign numerical values to the Bayesian model. We describe how to generate estimates of the prior probabilities of each of these states in Section 3.2.3. We describe in Section 3.2.4 how to define, for each feature and state, the conditional probability that the feature appears in a document given that the document is relevant to that state. 3.2.1 Pairwise Relationships Between Topics There are six possible pairwise relationships between any two topics t1 and t2: 1. t1 and t2 are mutually exclusive: if there is no document relevant to both topics 2. t1 is a subset of t2 if all documents relevant to t1 are also relevant to t2 3. t2 is a subset of t1 if all documents relevant to t2 are also relevant to t1 4. t1 and t2 are equivalent if t1 and t2 are subsets of each other (they satisfy Relations 2 and 3) 5. t1 and t2 are dependent if knowing that a document is relevant to t1 gives you some information about whether the document is relevant to t2 6. t1 and t2 are independent if knowing that a document is relevant to t1 gives you no information on whether the document is relevant to t2 Each of the Relations 1-5 is a type of dependence between topics. In a belief network, topics satisfying Relations 1-5 would be connected by an arc. To ensure that two topics satisfy only one of the relations, Relation 5 (dependence) is defined as any type of dependence that is different from those of Relations 1-4. Relation 6, independence, is represented in a belief network by the absence of an arc between the topics. The distinction between dependence (Relation 5) and independence (Relation 6) is useful in calculating the probabilities of combinations of the topics, as described in Section 3.2.3. Relations 1-4 can be identified by testing whether the defining conditions are satisfied in the training set. If the topics satisfy none of the first four relations, then a chi- square test can distinguish between Relations 5 and 6. If there are too few documents with relevance judgements for both topics to make reliable conclusions (our cutoff was 13 documents), the system makes an assessment, then prompts the user to verify it manually. To fully explore the pairwise relationships between topics, we selected ten of the fifty TREC-2 routing topics to use in our models and tests. The topics are listed in Table 3.1. 154 Topic Topic Description Number 57 Financial Health of MCI 61 Israeli Role in the Iran-Contra Affair 74 Conflicting Policies of the US Government 85 Official Corruption in any government 88 Crude Oil Price Trends 89 Downstream investing by OPEC members 90 Data on Proven Reserves of Oil and Gas 97 Fiber Optic Applications 98 Fiber Optic Equipment Manufacturers 99 Iran-Contra Affair Table 3.1: List of the 10 topics we selected The ten topics were selected to provide an interesting mixture of relationships among the topics. Before looking at the relevance information in the training set, we assessed these relationships manually, as shown in Table 3.2. The table is symmetrical about the main diagonal. 57 61 74 85 88 89 90 97 98 e i i ~ d e dI~~~~~ sd i d e d i i i i ~sd i 5 d e i i i i sd i i e d d i - i~i~d~e~d - i ~ d e ~ ~ d i i e d d i i d e sd sd sd i i e d = dependent 5 = subset sd = subset or dependent i = independent e = equivalent empty = mutually exclusive 57 61 74 85 88 89 90 97 98 99 57 61 74 85 88 89 90 97 98 99 e ~ ~ 57 61 ~ e d 5 * * d 74 ~d~d e d * d 85 5 d e ~ * * d* ~d 88 ~ ~ e i i ~ * * 89 i e i ~ * 90 * ~ i i e ~ ~ 97 ~d ~* ~* ~* ~ ~ ~ e d * 98 ~ 99 * d ~d d ~* ~* ~ ~ ~ e * = manual intervention required Table 3.3: Pairwise relationships determined from the data The system's assessments match the manual assessments quite well. The system found even more mutually exclusive pairs than we had intuitively thought. One surprise is between topics 61 ("Israeli role in the Iran- 99 Contra Affair") and topic 91 ("Iran Contra Affair"). One might assume that 61 is a subset 99, but there are indeed documents in the training set that are relevant to 61 but not to 99. Thus, the relation between 61 and 99 is dependence rather subset. The information in Table 3.3 can be expressed as a directed graph, as shown in Figure 3.1. Each topic is a node. Two topics are connected by an arc if there is at least one document containing them both. There is no arc between mutually exclusive topics. The arcs marked "i" connect independent topics, the unmarked arcs connect topics that are dependent, and the directed arc points to a subset from its superset. Figure 3.1 is not a belief network. It may be considered a "co-occurrence diagram," since topics that are relevant together (that co-occur) in the collection are connected. 97 Table 3.2: Pairwise relationships between 10 TREC topics, assessed manually before looking at the training data Table 3.3 shows the topic relationships that were generated automatically from the relevance judgements on the training documents. Manual verification of the system's assessment was required in about half of the cases (marked with an asterisk). The table is symmetrical about the main diagonal. 155 57 98 ½ 74 61 I>½f 99 85 88 89 90 Figure 3.1: Relationships diagram among 10 TREC topics 3.2.2 Generating states from the topics The states of the multiple-topic query can be generated automatically by enumerating all possible "relevant or not relevant" combinations of topics, subject to two rules: 1. Two mutually exclusive topics cannot both be relevant within the same state. 2. A topic that is a subset of another cannot be relevant within a state unless its superset topic is also relevant. By construction, there is always one state, U-, to which a document is relevant if it is relevant to none of the topics. For the ten topics, these rules reduce the number of states drastically from the theoretical maximum of 1024 = 210 states to the actual number of 28 states. A state is identified by listing the topics to which a document is relevant (or not relevant) if it is relevant to that state. For instance, documents relevant to the state (61-74 85-99) are relevant to topics 61 and 85 and are not relevant to topics 74 and 99. The list of states appears in the first column of Table 3.4. 3.2.3 Generating State Priors In theory, the prior probabilities of the states can be calculated from their relative frequencies in the training set. However, since there are very few documents in the TREC- 2 training set that were evaluated for three or more topics, a different way to estimate the priors was required. One estimate is provided by factoring the states prior into products of the priors of smaller compound topics. This can be accomplished without manual intervention. For instance, for the independent topics 88, 89, and 90, only three numbers are needed (the prior probabilities of the three topics in the training set) to compute the probabilities of all seven states containing the three topics. For the state (61-74 85-99), two numbers are needed: the prior probability of the compound topic (-74 85-99) and the probability that topic 61 is relevant given that topic 85 is relevant. The priors obtained by this method are shown in the second column of Table 3.4. They are expressed as inverse frequencies: the number given is the number of weeks (on average) between articles relevant to a state, assuming 1000 documents per week. State Average Average Weeks between Weeks between Articles, Articles, Assessed Assessed Automatically Manually -88 -89 90 < 1 3 -88 89 -90 < 1 3 -88 89 90 20 7 88 -89 -90 < 1 2 88 -89 90 20 4 88 89 -90 40 4 88 89 90 4000 9 57 97 98 10 8 57 97 -98 < 1 5 57 -97 98 < 1 6 57 -97 -98 < 1 3 -57 97 98 1 6 -57 97 -98 < 1 4 -57 -97 98 < 1 3 61 74 85 99 33 20 -61 74 85 99 14 3 61 74 85 -99 < 1 50 -61 74 85 -99 < 1 3 61 -74 85 99 < 1 20 -61 -74 85 99 < 1 5 61 -74 85 -99 5 100 -61 -74 85 -99 2 2 74 -85 99 < 1 3 74 -85 -99 < 1 2 -74 -85 99 < 1 3 57 74 < 1 20 85 98 50 U _______________ Table 3.4: The states and their prior probabilities expressed as frequencies (assuming 1000 articles per week) generated by two assessment methods However, these priors proved unsatisfactory and required manual override, for three reasons. First, the training set is a very biased sample of the WSJ. The training documents were selected by the retrieval systems of TREC-1 to be intentionally relevant to at least one of the topics. Thus, the prior derived from the training set tends to be a gross overestimate of the true prior. Second, the time period of the training set (1987 to 1992) is different from that of the test set (only 1991). The frequency of the topics relative to each other changes over time. For instance, there are many fewer articles on the Iran-Contra affair (relative to the other topics) in 1991 than there were in 1987-1990. We adjusted the priors to match the relative frequencies expected in the test set. 156 Third, the priors are widely divergent from the intuition of any user who reads a newspaper. The numbers suggest, for instance, that about 1 out of every 100 articles in the WSJ (and, by analogy, in the SJMN) are relevant to topic 57, "MCI." Any reader of the WSJ or the SJMN knows that this estimate is much too high - the relative frequency is probably closer to 1 in 500 or 1 in 1000 than to 1 in 100. Thus, the prior probabilities were assessed manually. It is assumed that the user has some knowledge of the test domain (in this case, articles in the SJMN) and can with some thought assess the relative frequencies of various states as a part of specifying the query for the particular routing request. For each state, we ask the user what is the average number of weeks between the publication of articles relevant to that state. This number is presented in the third column of Table 3.4. It can be converted to a prior probability by combininbg it with the assumption that there are 1000 documets per week. The prior probability of U , p( U ), is calculated as one minus the sum of the priors of all the other states, to ensure that the probability of all states together is one. 3.2.4 Feature Conditional Probabilities The inference algorithm requires, for each feature and for each state, the conditional probability of the feature given the state. These probabilities cannot be obtained directly from the relative frequencies obtained from the training set, because there are few documents that are relevant to more than one topic at a time. We approximate these probabilities by using a structure called a noisy-or gate. The noisy-or gate combines the effects of two or more factors, each of which may contribute to the presence of a feature. It is a model of disjunctive interaction, as described in (Pearl, 1988). It has been used in medical decision research to calculate the probability of a particular symptom being present, given diseases that cause the symptom (Heckerman, 1989). In the context of information retrieval, a feature may be present due to any of the topics tha~ are relevant in the state. For each state-feature pair, we build a noisy-or model. The contributing factors are the topics that are relevant within the state. The effect is the feature's presence or absence in the document. For example, consider a feature f and the state (57-97 98). The feature may be present due to topics 57 or 98. It cannot be present due to topic 97 because that topic is not relevant within the state. Let El be the event that the feature is present due to topic 57, and let E2 be the event that the feature is present due to topic 98. Table 3.5 lists all the possible cases of the two uncertain events. Figure 3.4 shows the belief network structure of the noisy-or model. 157 The node with the double wall is a deterministic logical or gate. Feature Feature Feature Present due to Present due to Present at all 57 98 (E1ORE2) (El) (E2) ____________ Yes Yes Yes Yes No Yes No Yes Yes No No No Table 3.5: Possible cases for a noisy-or node PyE1E2 0 Figure 3.4: Belief Network corresponding to a Noisy-Or Gate Model The only case in Table 3.5 in which the feature is absent is the fourth case. Thus, the conditional probability that the feature is absent, given this state, is the probability of that fourth case. The probability that the feature is present is one minus the probability that the feature is absent. 3.3 Document Scoring The Bayesian inversion described in Section 2.1 yields, for each document, the posterior probability that the document is relevant to each state. We calculate the posterior probability that the document is relevant to each topic by summing the posterior probabilities of all of the states in which the topic appears. For example, the posterior probability for topic 57 is the sum of the posterior probabilities of the five states in which topic 57 is relevant (refer to Table 3.4). The states are (57 97 98), (57 97-98), (57-97 98), (57-97-98-74) and (57 74). The final list of documents for each topic contains the top 1000 documents, ranked in descending order according to the posterior probability that they are relevant to that topic. 4 Experimental Environment 4.1 Hardware and Software Environment The system was developed by one programmer over the period of six months. It was written in C using the Symantec Think C 5.0 compiler under Macintosh System 7.0.1. The experiments were run on two machines, depending on availability: a Macintosh llfx and a Centris 650. Both machines had 8 Mb of memory and a 500 Mb hard disk. 4.2 Summary of System Data Structures Table 4.1 is a list of the data structures used by the system. Each is listed with the sections of this paper in which it is described and with an indicator of whether it was provided by NIST (N), assessed manually by us (M), or generated automatically by the system (A). Data Structure (Sections Source) Descriptions of Routing Topics (3.1, N) Training Documents (3.1, N) Candidate Feature List (3.1, AIM) Final Feature List (3.1, 4.3.2, A) Relevance Judgements on the Training Documents (3.2.1, N) Topic Relationships (3.2.1, AIM) State Prior Probabilities (3.2.2, 4.3.2, M) Feature Conditional Probabilities (3.2.4, A) Test Documents (3.3, N) Document Posterior Probabilities (3.3, A) Final list of retrieved documents (3.3, A) Table 4.1: Data Structures used by the System The candidate feature list could be defined entirely automatically by sufficiently intelligent procedures such as phrase and proper name identification. The state prior probabilities are the only data items that must be assessed manually. The data files needed by the system, other than those provided by NIST, occupy about 100 Kb on disk. 4.3 Training Phase 4.3.1 Building the Data Structures We are a Category B participant in TREC-2. As such, we used a subset (1ust the WSJ articles) of the full training collection. Of these we used only those (roughly) 29,000 articles for which there is some relevance judgment available. In addition, because we considered only ten of the TREC-2 routing topics, we used only the 5941 documents that have a relevance judgment on at least one of the ten topics. 158 The inputs to a training run are the candidate feature list, the list of topic relationships, the training documents, and the training relevance judgements. The outputs of a training run are the final list of features and the feature conditional probabilities. The time to complete a training run is about 1.5 hours. 4.3.2 Defining the Query The design of the TREC-2 experiment required that before we receive the test data, we submit the definitions of our system and of our particular query to NIST. The query definition includes the final feature list, the list of topic relationships, and the state prior probabilities. All of the other data is determined automatically, as shown in Table 4.1. We ran about two dozen sample experiments that applied the system to the training documents to gauge its performance with different query definitions. In these tests, we varied only two of the data inputs, the state prior probabilities and the final feature list. The list of topic relationships remained constant throughout these tests. It took much longer than expected to finish programming the system. Thus, all of the experiments to define the query were completed in the two weeks immediately preceding the final submission of our query definition to NIST. 4.4 Test Phase As a Category B participant in TREC-2, we ran our routing experiment on just the SJMN articles. The inputs to the test runs were the final list of features, the feature conditional probabilities, the list of topic relationships, and the test documents. The output of the test run is the final list of retrieved documents. A test run takes about 5.5 hours. This includes decompressing the 150 MB of test data, one file at a time, when it is needed. S Results The TREC-2 designation for our system is "idsra2." Figure 5.1 shows the precision-recall curves for the ten topics we considered, excluding topic 88, for which no TREC-2 participant found any relevant documents. Table 5.1 shows several measures of performance for our results, along with the best, median, and worst values for those measures among all TREC-2 participants. We again exclude topic 88. precision 1.00 Topic Average Precision 0.90 Number idsra2 Best Median Worst 0.90 57 0.387 0.460 0.374 0.000 0.70 61 0.464 0.464 0.083 0.000 0.60 74 0.008 0.074 0.008 0.000 0.50 85 0.174 0.353 0.174 0.000 0.40 89 0.081 0.259 0.077 0.000 0.30 -~ -- 90 0.025 0.025 0.000 0.000 0.20 97 0.383 0.383 0.202 0.002 0.10 90 85 74 98 0.282 0.427 0.334 0.000 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.9 0.9 1.0 99 0.700 0.700 0.509 0.000 Recall Figure 5.1: Precision vs. Recall for system idsr~ Topic Relevant Retrieved at 100 Number idsra2 Best Median Worst 57 17 18 17 0 61 19 19 9 0 74 6 16 6 0 85 33 54 33 1 89 2 3 2 0 90 1 1 0 0 97 25 28 18 1 98 24 24 17 0 99 60 60 52 0 Table 5. la: Relevant documents in the top 100 retrieved, for idsra2 and for all systems Topic Relevant Retrieved at 1000 Number idsra2 Best Median Worst 57 18 19 18 0 61 24 25 24 0 74 11 31 11 1 85 88 115 88 2 89 2 4 2 0 90 1 3 0 0 97 27 32 27 1 98 26 29 26 0 99 70 70 66 0 Table 5. lb: Relevant documents in the top for idsra2 and for all systems 1000 retrieved, 159 Table 5.lc: Average precision (as defined for TREC-2), for idsra2 and for all systems 6 Conclusions and Future Directions We believe that we have made significant progress to developing an information retrieval architecture that: is oriented towards assisting users with stable informafion needs in routing large amounts of time- sensifive material * gives users an intuitive language with which to specify their information needs requires modest computational resources, and * can integrate relevance feedback and training data with users' judgements to incrementally improve retrieval performance. We are encouraged by the test results. We have not had very much time to analyze the results but we intend to try to understand why we did very well on some topics and not so well on others. Very preliminary analysis suggests that the features for the topics in which we did well (e.g., 61 and 99) were much more informafive than the ones on which we did very poorly (e.g., 74). We have many ideas for future research. These ideas fall into three basic categories: probabilistic representation, user interface, and inference methods. The most important improvements we would like to make are in the category of probabilistic representation of the topic and the document. One research goal is to develop a way to intuitively represent relationships between features. Also, we would like to explore more sophisticated feature extractors that recognize phrases, synonyms, and features derived from natural language processing. We believe that achievement of these goals could lead to significant improvements in performance. In conjunction with these representational improvements, we would like to design a complete user interface that would allow users to make the needed probabilistic judgements easily and intuitively. We would also like to develop an explanation facility that could describe why one document was preferred over another. There are new exact and inexact algorithms that could handle these representational modifications. We would like to experiment with these algorithms to see if they are suitable. We would also like to implement a relevance feedback mechanism based on the Bayesian concept of equivalent sample size. Finally, while the information retrieval problem has been viewed primarily (and rightly so) as an evidential reasoning problem, we take the position that a decision-theoretic perspective is more accurate since information retrieval is a decision process. We believe that this perspective can provide additional insight and eventually improve upon the probabilistic approaches that have been developed. Acknowledgment We would like the thank Dr. Richard Tong for his assistance and guidance. Appendix Bayesian Inference We calculate the posterior probability that a document is relevant to a state using Bayes' rule and an assumption of conditional independence between the features. The Bayesian inversion formula is this: p(sldocument)=p(s IF) = p(FIs)p(s) p(F) where 5 is the state, and F is a binary feature vector with = 1 if feature i is present (the event fj+) F~= 0 if feature i is absent (the event ~ The set of all features that are present is denoted F+. All other features are absent and are in the set denoted F-. The first term in the numerator is expanded under the assumption that the features are conditionally independent given the state. p(FIs)= [1 p(ij+ Is) Y' p(Ij Is) iinF+ iinF~ The second term in the numerator is the state prior, p( s), which can be estimated as described in Section 3.2. 160 The denominator is a normalization constant obtained by summing over the values for the numerator for all the states. p(F)= ~ p(FIs)p(s) all 5 p( F) is the probability that one would observe the particular set of features F. References Chang, K. C., & Fung, R. M. (1989). Node Aggregation for Distributed Inference in Bayesian Networks. In IJCAI 89. Detroit: Morgan Kaufmann. Fung, R. M., & Chang, K. C. (1989). Weighing and Integrating Evidence for Stochastic Simulation in Bayesian Networks. In R. D. S. M. Henrion L.N. Kanal and J.F. Lemmer (Eds.), Uncertainty and Artificial Intelligence 5 Elsevier Science Publishers B.V. (North-Holland). Fung, R. M., Crawford, S. L., Appelbaum, L., & Tong, R. M. (1990). A Probabilistic Concept-based Architecture for Information Retrieval. In Proceedings of the ACM International Conference on Information Retrieval Brussels, Belgium. Heckerman, D. E. (1989). A tractable algorithm for diagnosing multiple diseases. In Proceedings of the Fifth Workshop on Uncertainty in Artificial Intelligence, (pp. 162-173). Detroit, MI. Heckerman, D. E. (1991). Probabilistic Similarity Networks. The MIT Press. Howard, R. A., & Matheson, J. E. (1981). Influence diagrams. In R. A. Howard & J. E. Matheson (Eds.), Readings on the Principles and Applications of Decision Analysis (pp.721-762). Menlo Park, Ca.: Strategic Decisions Group. Lauritzen, S. L., & Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. JRSS~ 50, 157-224. Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7, 216-244. Paice, C. D. (1977). Information Retrieval and the Computer. London. Pearl, I. (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, Ca.: Morgan Kaufman. Robertson, S. E., & Sparck Jones, K. (1976). Relevance Weighting of Search Terms. Journal of the American Society for Information Science(May-June), 129-146. Shachter, R. D. (1986). Evaluating influence diagrams. InOperations Research (pp.871-882). Shachter, R. D., D'Ambrosio, B., & Del Favero, B. (1990). Symbolic Probabilistic Inference in Belief Networks. In AAAI-90, (pp.126-131). Boston: Morgan Kaufmann. Turtle, H., & Croft, W. B. (1990). Inference networks for document retrieval. In J. L. Vidick (Ed.), ACM SIG IR, (pp.1-24). Brussels, Belgium. 161