Machine Learning for Knowledge-Based Document Routing (A Report on the TREC-2 Experiment) Richard M. Tong, Lee A. Appelbaum Advanced Decision Systems (a division of Booz~Al1en & Hamilton, Inc.) 1500 Plymouth Street, Mountain ~w, CA 94043 1 Introduction This paper contains a description of the experi- ments performed by Advanced Decision Systems as part of the Second Text Retrieval Conference (TREC-2).1 The overall system we have developed for TREC-2 demonstrates how we can combine statistically~riented machine learning techniques with a commercially available knowledge-based information retrieval system. As in TREC- 1, the tool we using for the fully automatic construction of routing queries is based on the Classification and Regression Trees (CART) algorithm.2 However, in a departure from TREC-1, we have expanded our definition of what consti- tutes a document "feature" within the CART algorithm, and also explored how the CART output can be used as the basis of topic definitions that can be interpreted by the TOPICŪ retrieval system developed by Verity, Inc. of Mountain View, CA. Section 2 of the paper contains a review of the CART algorithm itself and the data structures it produces. Section 3 of the paper describes the two basic algorithms we have.devised for converting CART output into TOPIC read- able ifies. Section 4 of the paper contains a description of our experimental procedure and an analysis of the official results, as well as data from a series of auxiliary experi- ments. Section 5 contains some general comments on overall performance. We conclude in Section 6 with a brief discus- sion of possible future research directions. 1. Requests for flirther information about the IREC-2 experiments should be directed to the authors at the address above, or electroni- cally to either rtong~ads . corn or lee@ads . corn. 2. A comprehensive discussion of the CART algorithm can be found in [1]. Details of previous work on the use of CART for information retrieval are presented in [2] and in [3). 253 2 The CART Algorithm CART has been shown to be useful when one has access to datasets describing known classes of observations, and wishes to obtain rules for classifying future observations of unknown class~xactly as in the document routing prob- lem. CART is particularly attractive when the dataset is "messy" (i.e., is noisy and has many missing values) and thus unsuitable for use with more traditional classification techniques. In addition, and particularly important for the document routing application, if it is important to be able to specify both the misclassification costs and the prior proba- bilities of class membership then CART has a direct way of incorporating such information into the tree building pro- cess. Finally, CART can generate auxiliary information, such as the expected misclassification rate for the classifier as a whole and for each terminal node in the tree, that is useful for the document routing problem. 2.1 CART Processing Flow Figure 1 shows how the CART algorithm is used to construct the optimal classification tree based on the training data provided to iLThe diagram shows the four sub-pro- cesses used to generate the optimal tree, T*. The "raw" train- ing data (i.e., the original texts of the articles), together with the class specifications (i.e., the training data relevance judg- ments) and the feature specifications (i.e., the words defined to be features), are input to the Feature Extraction Module. The output is a set of vectors that record the class member- ship and the features contained in the training data. These vectors, together with class priors and the cost function (these are optional), are input to the Tree Growing Module which then constructs the maximal tree ~ that charac- terizes the training data. Since this tree overfits the data, the next step is to construct a series of nested sub-trees by prun- ing Tmax to the root. The sequence of sub-trees (T~~~> T1> >Tn) are input to the Tree Selection Module which per- Largest Tree (Tmax) mum:'mm Raw Feature Tree Tree Optimal Training Growing Tree (T*) Data (T Class Specs. Class Priors Feature Specs. Cost Function Nested Sub-Trees max>Ti >T2 ...>Tn) CART Training Vectors Figure 1: CART Processing How forms a cross-validation analysis on the sequence and selects that tree with the lowest cross-validation error.3 This is T*. In our TREC-2 system, we take T* and convert it into a TOPIC outline file as descrileed in Section 3. That is, rather than use CART itself to perform the routing on the unseen documents (as we did in ThEC-l), we use the CART trees as skeletons for TOPIC concepts, and then have TOPIC do the routing. An advantage of this is that we can make use of TOPIC's extensively optimized text database capabilities, thus allowing us to easily generate the output files needed for the official scoring program. 2.2 Data Structures Generated by CART To illustrate the processing that CART performs, we will use an example taken from the TREC-2 corpus. A example query is as follows: Tipster Topic Description Number: 097 Domain: Science and Technology Topic: Fiber Optics Applications <desc> Description: Document must identify instances of fiber optics technology actually in use. <narr> Narrative: To be relevant, a document must describe actual operational situations in which fiber optics are being employed, or will be employed. A document describing future fiber optics use will be relevant only if contracts have been signed concerning the future application. <con> Concept(s): 1. fiber optic, light 2. telephone, LAN, television <fac> Factor(s) <def> Definition(s) 1. Fiber optics refers to technology in which information is passed via laser light transmitted through glass or plastic fibers. <\top> This is a very comprehensive statement of infor- mation need and provides a rich set of features that we can use for CART.4 Our basic procedure is to extract from the information need statement all the unique content words and then stem them, which gives the following list: ACT APPL CONCERN CONTRACT DESCRIB DOCU EMPLOY FIB FUT GLASS INFORM LAN LAS LIGHT OPER OPT PAS PLAST REF RELEV SIGN SITU TECHNOLOG TELEPHON TELEV TRANSMIT VIA5 3. The algorithm actually minimizes with respect to both the cross-validation error and the tree complexity. So that if two trees have statistically indistinguishable error rates, then the smaller of the two trees will be selected as optimal. 254 4. In general, determining what set of features to use for CART is a matter of experience and judgement. For the ~IREC-2 corpus we have made use of the information need statements, but other approaches, such as using all the unique words in the training set, are equally valid. In fact, it is this freedom of choice of features that gives CART a great deal of its flexibility. Given this list of features we can generate the CART training vectors by testing all the documents in the training set for the presence of absence of the feature.6 The resulting vectors, together with the ground truth classifica- tion, are the input to CART. The output from CART con- tains information about the sequence of nested decision trees and a specification of the optiaal tree. For our example, the maximal tree is shown in Figure 2. This tree is to be read from left to right. Thus the Associated with each decision node is information that tells: * what level in the tree the test occurs; the k value - in all cases k ranges from k=1, which corresponds to the maximal tree, to k~kmax, which corresponds to the null tree and where kmax is dependent on the actual tree grown by CART; class 0 (k=2,Rt=0.230318) CONCERN<=0.50 (k--2,ac=0,R~=0.25l256) class 1 (k=2,R~=0.0l9586) TRANSMIT<=0.50 (k=2,ac=O1Rt=0.020938) class 0 (k=2,Rt=0.000000) GLASS<=0.50 (k=5,ac=0,Rt=0.261725) class 1 (k=2,Rt=0.007834) TRANSMIT<=0.50 (k=2,ac=0,R~=0.0l0469) class 0 (k=2,Rt=0.000000) CONCERN<=0.50 (k--2,ac=0,Rt=0.010469) class 0 (k=2,Rt=0.000000) VIA<=0.50 (k=5,ac=0,R~=0.30360l) class 1 (k=41Rt=0.003917) TRANSMIT<~0.50 (k=5,ac=l,Rt=0.007834) class 0 (k=4,R~=0.000000) SIGN<--0.50 (k=5,ac=0,Rt=0.345477) class 1 (k--5,Rt=0.003917) LIGHT<=0.50 (k=6,ac=0,Rt=0.376884) class 0 (k=3,Rt=0.000000) GLASS<=0.50 (k=5,ac=0,R~=0.03l407) class 0 (k=3,R~=0.0l0469) TRANSMIT<=0.50 (k=3,ac=0,R~-0.03l407) class 0 (k=3,Rt=0.010469) VIA<=0.50 (k=3,ac=l,Rt=0.0l95~6) class 1 (k--3,Rt=0.003917) CONTRACT<=0.50 (k=7,ac=l,Rt=0.497487) class 1 (k=4,Rt=0.019586) SIGN<=0.50 (k=4,ac=l,Rt=0.023503) class 0 (k=4,R~=0.000000) LIGHT<=0.50 (k=6,ac=l,R~=0.02742l) class 0 (k=4,Rt=0.000000) Figure 2: Maximal Tree first test is on the presence of the stem CONTRACT. If the stem is present (i.e., if the test CONTRACT<=O .5 fails) then the tree branches downwards (this is leftward in CART jargon); if the stem is not present (i.e., the test succeeds) then the tree branches upwards (rightward). 5. We only use the <nan>, <con> and <del> fields, and the stop word list is taken from (4]. The stemming algorithm is taken from [5). 6. In our TREC-1 experiments we also made use of the frequency of occurrence of the features in the documents. Since our ultimate goal here is to generate `IDPIC trees we restrict ourselves to just testing for the presence or absence of the feature - this allows us to perform a straightforward conversion between CART and F)PIC. 255 the class to be assigned to this node if the tree were pruned here; the ac value - in the present case ac= 1 for the class that corresponds to a relevant doc- ument and ac 0 for the class that corresponds to a non-relevant document; and * the error rate of the node; the Rt value - this indi- cates the resubstitution error rate fbr the specific node. The terminal nodes in the tree have similar infor- mation associated with them. Notice though that terminal nodes, by definition, specify to which class the node corre- sponds. CART also generates the following table of error estimates that are used in selecting the optimal tree. Here k is the k-value, I T I is the size of the tree (measured in terms of the number of terminals), R (T) is the resubstitution rate for the overall tree, and Rcv (T) is the cross-validation rate for the overall tree. Recall that CART minimizes with respect to both size and cross-validation rate, so that for our example, the optimal tree is when k=4. k IT R(T) Rcv(T) 1 16 0.3100 0.3037 2 11 0.3140 0.2866 3 8 0.3206 0.2353 *4 5 0.3323 0.2296 5 2 0.4043 0.3891 6 1 0.4975 0.6285 So, pruning the maximal tree so that all nodes have k>4 gives us the following optimal tree shown in Figure 3. 3 20 0.1313 0.2935 4 10 0.1761 0.2758 5 7 0.1971 0.2815 *6 6 0.2050 0.2929 7 5 0.2154 0.3101 8 4 0.2273 0.3101 9 2 0.2681 0.3328 10 1 0.4975 0.6264 so that the optimal tree is as shown in Figure 4. 3 Transforming CART Trees into TOPIC Outline Specifications The trees produced by CART are not directly usable by TOPIC because they represent the information we need (i.e., the decision function and the associated decision variables) in a form that is incompatible with the TOPIC class 0 (k=5,RL=0.261725) VIA<=0.50 (k--5,ac--0,Rt-0.303601) class 1 (k=5,R~=0.007834) SIGN<=0.50 (k--5,ac=0,R~=0.345477) class 1 (~=5,Rt=0.003917) LIGHT<=0.50 (k=6,ac=0,Rt=0.376884) class 0 (k=5,Rt=0.031407) CONTRACT<=0.50 (k=7,ac=1~R~=0.497487) class 1 (k=6,Rt=0.027421) Figure 3: Optimal Tree Using Stemmed Features Notice that by pruning away the deeper nodes in the tree we are left with a tree that tests on just five of the original 26 stems, and has three paths that lead to class - 1 nodes (i.e., a decision that the document is relevant). The unstemmed version of our procedure is identi- cal except that we do no stemming of the unique words extracted from the information need statement. For the example, this produces the following list of features: ACTUAL APPLICATION CONCERNING CONTRACTS DESCRIBE DESCRIBING DOCUMENT EMPLOYED FIBER FIBERS FUTURE GLASS INFORMATION LAN LASER LIGHT OPERATIONAL OPTIC OPTICS PASSED PLASTIC REFERS RELEVANT SIGNED SITUATIONS TECHNOLOGY TELEPHONE TELEVISION TRANSMITTED VIA and the following error table: k TI R(T) Rcv(T) 1 36 0.0932 0.3168 2 33 0.0972 0.2763 256 knowledge-representation. The main problem then is to define a transformation from the CART representation to the TOPIC representation that at least preserves the decision information and perhaps augments it so that we get improved routing performance. In this section we explore two possible strategies for con- structing these transformations. The first is a strict re-coding of the information in the CART tree. The second generalizes the intent of the CART tree but adds no new information. Both of these techniques are "automatic," in the sense that, once various parameters have been chosen, the algorithms work without human intervention. 3.1 First Canonical Form The first canonical from completely preserves the decision function generated by CART and involves a simple mapping of the CART tree into TOPIC outline file format.7 To do this, we begin by observing that each path in the CART tree from the root to a class-i leaf node consti- tutes a conjunction of tests of decision variables. Since we class 0 (k--lO,Rt-0.0628l4) TELEPHONE<=0.50 (k=ll,ac=l,R~=0.497487) class 1 (k=8,R~-0.l42l39) TRANSMITTED<=0.50 (k=9,ac=l,Rt=0.153984) class 0 (k=8,R~=0.000000) LIGHT<=0.50 (k=l0,ac=l,R~=0.2053l2) class 0 (k=7,Rt=0.000000) TRANSMITTEL<=0.50 (k=9,ac=0,R~=0.0l0469) class 1 (k=7,R~=0.000000) ACTUAL<=0.50 (k=9,ac=0,R~=0.03l407) class 1 (k=9,R~=0.000000) Figure 4: Optimal Tree Using Unstemmed Features have constrained the tests to be of the form "Is word X present or not?" we can easily model this conjunction in TOPIC using AND and NOT operators. Since there will gen- erally be multiple paths to class - 1 leaf nodes, the algo- rithm combines the separate paths in the TOPIC definition using the OR operator. To illustrate, the optimal tree for our example con- tains the following conjunctive descriptions of class-i leaf nodes: 1. CONTRACT 2. SIGN and not LIGHT and not CONTRACT 3. VIA and not SIGN and not LIGHT and not CONTRACT which leads directly to a TOPIC outline of the form: Topic~097 <Or> * 0.77 TopicStyle~097~l <Or> ** 0.99 TopicPath~097_1-1 <And> `CONTRACT' `LIGHT' `SIGN' `VIA' ** 1.00 TopicPath~097_1-2 <And> `CONTRACT' `LIGHT' `SIGN' ** 0.97 TopicPath~097_1-3 <And> `CONTRACT' Here we use a notation for topic names that ensures uniqueness. Notice that in this model we use the resubstim- tion rates (actually 1-Rt) to define a weight for the individ- ual conjuncts and the overall cross-validation rate (actually 1-~~) as the weight for the disjuncts. Note also that since TOPIC only uses weights defined to two decimal places we have rounded the weights derived from the CART tree. 7. TOPIC uses a representation for ooncepts that can be recorded in so~alled outline format. A collection of such specifications is used by TOPIC to built the concept trees used for retrieval. 257 3.2 Second Canonical Form The second canonical form makes use of the deci- sion variables chosen by CART but not the actual decision function. This model is based on two observations: * first, that the set of variables used by CART are, when taken as a whole, indicative of the general topic of the information need statement, and * second, that every variable used in the tree is on the path to at least one class-O node and at least one class-i node. Thus, from an information retrieval perspective, all the decision variables have some contribution to make to an assessment of the relevance of a document. So rather than use the specific decision function constructed by CART we can replace this with one that can be thought of as "The more features present in a document the better." This is modelled straightforwardly in TOPIC using the ACCRUE operator. To illustrate, the maximal tree for our example contains the following set of decision variables: CONCERN CONTRACT GLASS LIGHT SIGN TRANSMIT VIA which leads directly to a TOPIC outline of the form: Topic~097 <Or> * 0.70 TopicSty1e~097~2 <Accrue> ** 0.25 `CONCERN' ** 0.75 `CONTRACT' ** 0.75 `GLASS' ** 0.75 `LIGHT' ** 0.75 `SIGN' ** 0.75 `TRANSMIT' ** 0.75 `VIA The weighting scheme we have used gives higher values to variables (features) that are in the optimal tree, intermediate values to variables on the fringe of the optimal tree (there are none of these in the current example), and lower values to features outside the optimal tree.8 At this point the spe- cific values chosen represent our "best guess" at a weighting scheme, further experimentation Will undoubtedly reveal a better strategy. As in the first canonical form, the overall weight for the TOPIC tree is based on the cross-validation rate for the maximal tree. 4 The TREC-2 Experiments For TREC-2 we again focused only on the docu- ment routing problem. Since our technique requires training data it does not easily lend itseff to the ad hoc retrieval problem and so rather than "force-fit" it we chose to gener- ate four sets of results for the routing queries (topics 51- 100). Fach set of results was generated totally automati- cally. The results sets are labell~ adsl, ads2, ads3, and ads4, and the table below shows to which combinations of features and TOPIC models they correspond. Table 1: Results Identification Result Set Word Features TOPIC Model adsl stemmed model-i ads2 unstemmed model-i ads3 stemmed model-2 ad~ unstemmed model-2 and two sets of training vectors labelled with the ground truth information.9 Since CART is a statisti- cally-oriented classifier, we decided to minimize the "noise" in the training sets by using only the Wall Street Journal articles identified in the qrel files. Fur- ther, for all but topics 80 and 81, we used just the Wall Street Journal articles on Disk 2. * Second, we grew the CART trees from this training data. Since we had two sets of training data for each topic, we grew two trees for each topic. * Third, we used the algorithms described in Section 3 to convert the CART trees into a TOPIC readable form. This produced four TOPIC definitions for each of the information need statements. Table 1 above shows the various combinations. * Fourth, we ran the TOPIC definitions against the indexed unseen data. 10 Again, to minimize noise effects, we used only the Associated Press articles on Disk 3 to generate our official results. * Fifth, we sorted and merged the results generated by TOPIC and converted them into the TREC format for scoring by NIST. 4.2 Discussion of Official Results The official results for adsl and ads2, together with the unofficial results for ads3 and adA, are shown in Table 2. Although we generated four sets of results, the resource constraints at MST resulted in only adsl and ads2 being officially scored. Reference in the remainder of the paper to scores associated with ads3 and adA are to the unolficial score generated by us using the TREC-2 scoring program and the published qrels for the routing topics. 4.1 The Experimental Procedure The experimental procedure for TRBC-2 consists of five basic steps. We briefly describe each of these: * First, we generated the CART training data from the information need statements and the ground truth files (i.e., the qrels) provided by NIST. This produced two feature sets for each topic (corresponding to the stemmed and unstemmed versions of the features), 8. A variable is in the optimal tree if its k-value is greater than k*; is on the fringe if k=k*; and outside the optimal tree if k~*. Note that in general the individual features appear at multiple locations in the tree. Our strategy is to remove duplicates by retaining the instance with the highest k-value. 258 Table 2: TREC-2 Results (AP Only) Run No. No. Rel. A~ Exact ID Retr. Rel. Ret. Prec. Prec. adsi 40,423 5,677 822 0.0195 0.0390 ads2 33,034 5,677 1,468 0.0821 0.1092 ads3 49,006 5,677 1,182 0.0168 0.0374 adA 50,000 5,677 1,847 0.0630 0.0868 The first observation is that the trees built using exact words as features (i.e., results ads2 and adA) had higher precision than those built using word stems. We 9. The feature specification and extraction procedure we used is identical to that used in ThEC-1 and is described in detail in the ll~EC-l proceedings. The only differences are the addition of a stemmed version of the features and the fact that we do not make use of the feature count information. 10. We are grateful to Verity Inc. for allowing us to have access to their computer systems and databases. believe that the explanation for this is that by using stemmed versions of the features we added a significant amount of "noise" to the sample space. That is, given the relatively small size of the training sets, using stems tends to reduce the discriminating power of any given feature with respect to the training sets. This manifests itself indireefly in two ways. First, the optimal trees built with stems are gener- ally smaller than those built using exact words; and, second, optimal trees built with stems have higher cross-validation error rates than those built using exact words. The second observation is that the ~R)PIC trees built using model-2 (i.e., results ads3 and ads4) had better recall than those built using model-i. The explanation for this is straightforward. Since the model-2 trees essentially use all the features extracted from the information need statement in a generalized disjunction, they provide much broader coverage than the model-i trees which often use just one or two of the features. Queryid (Num): 52 Total number of documents over all queries Retrieved: 1000 Relevant: 345 Rel_ret: 328 Interpolated at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision all rel docs: Recall - Precision Averages: 0.6667 0.4684 0.4032 0 4032 0.4032 0.4032 0.4013 0.4013 0.4013 0.4003 0.0000 (non-interpolated) over 0.3828 Precision: The third observation is, of course, that these are not strong results since on average all the models performed in the low end of the scores reported by MST in the sum- mary routing table. This is somewhat disappointing since our results in TREC-1 led us to believe that we might be able to do significanfly better.11 Notwithstanding the fact that we did not explore some of the ideas discussed in the JREC-1 paper (e.g., the use of concepts rather than words as features, and the use of surrogate split information), we are now inclined to the view that the output from tools like CART are best used as the basis for manually constructed routing topics. To begin to explore this idea, we performed a number of auxiliary tests that we report in the following section. 4.3 Auxiliary Experiments To explore the idea of using the CART output as the "skeleton" for a manually constructed routing query, we selected two model-2 trees to determine whether a minimal set of "edits" could significanfly improve their performance. We selected Topics 52 and 54 since they represented one topic for which the automatically generated tree did well (Topic 52) and one for which the automatically generated tree did poorly (Topic 54). The scores for Topic 52 (for the AP corpus only) are shown below: 11. We do note, however, that for a number of topics we did rather well in comparison with other Systems (i.e., Topics 51 and 75), and that in absolute terms we produced a number of trees that had greater than 30% R-precision (i.e., for Topics 52, 58, 78 and 93). 259 0.4000 0.3000 At 15 docs: 0.4667 At 20 docs: 0.6000 At 30 docs: 0.6333 At 100 docs: 0.4100 At 200 docs: 0.3550 At 500 docs: 0.3820 At 1000 docs: 0.3280 R-Precision (precision after R num_rel for a query) docs retrieved) Exact: 0.3739 Here we see that this topic tree does well - recall is excel- lent and precision is sustained even at high recall levels. in contrast, the topic tree for Topic 54 produces the following results: Queryid (Num): Total number of Retrieved: Relevant: Re1~ret: Interpolated at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision all rel docs: 54 documents over 1000 65 64 Recall - Precision Averages: 0.1323 0.1323 0.1323 0.1323 0.1323 0.1323 all queries 0.1323 0.1102 0.1102 0.1097 0.0000 (non-interpolated) over 0.0907 At 5 docs: At 10 docs: Precision: At 5 docs: 0.0000 At 10 docs: 0.0000 At 15 docs: 0.0000 At 20 docs: 0.0000 At 30 docs: 0.0000 At 100 docs: 0.0500 At 200 docs: 0.0600 At 500 docs: 0.1080 At 1000 docs: 0.0640 R-Precision (precision after R (= num~rel for a query) docs retrieved): Exact: 0.0308 Thus although recall is very good, precision is completely unsatisfactory. Our conjecture is that the automatically con- structed model-2 trees while generally giving good recall give poor precision because they contain many extraneous features, or features that should he combined. To illustrate this, we considered the model-2 tree for Topic 52, as a start- ing point for a manually constructed tree. The initial model- 2 tree is: Topic~52 <Or> * 0.86 Topic~Style~52~2 <Accrue> ** 0.75 "AFRICA" ** 0.25 "AFRICAN" ** 0.75 "APARTHEID" ** 0.25 `ARMS" ** 0.25 "BAN" ** 0.25 "BLACK" ** 0.25 "COMPANY" ** 0.25 "COMPLIANCE" ** 0.25 "CONTR~C'TS" ** 0.25 "CORPORATE" ** 0.25 "DISCUSS" ** 0.25 "DOCUMENT" ** 0.50 "DOMINATION" ** 0.25 "GOVERNMENT" ** 0.25 "INTERNATIONAL" ** 0.25 "INVESTMENT" ** 0.25 "ORGANIZATION" ** 0.25 "PRESSURE" ** 0.75 "PRETORIA" ** 0.25 "REDUCTION" ** 0.25 "RESPONSE" ** 0.75 "SANCTIONS" ** 0.75 "SOUTH" ** 0.25 "TIES" ** 0.25 "TRADE" ** 0.25 "UNITED" Obviously there are a number of features here that are basi- cally "noise" - for example the words `COMPANY" and `RESPONSE'; and other words are clearly elements of a larger phrase - for example the words "SOUTH" and 260 AFRICA ". Notice that, in general, words with lower scores are always candidates for elimination. The result of this pruning exercise was the follow- ing revised definition for Topic 52: Topic~52 <Or> * 0.86 TopicStyle~52-2 <Accrue> ** 0.50 5_Africa <Accrue> 0.50 `SOUTH AFRICA' 0.50 "PRETORIA" ** 0.50 `SANCTIONS' ** 0.20 Topic~52~Support <Accrue> 0.50 "APARTHEID" ~ 0.50 <Near> `BAN' `TRADE' ~** 0.50 <Near> `BAN' `INVESTMENT' So although we have added no new features, we have com- bined "SOUTH" and "AFRICA" and used this together with "PRETORIA" to define a concept called S_Africa. We have also used "APARTHEID"; and "BAN" with N TRADE' and `INVESTMENT" to define another concept called Topic~52~Support. Finally we adjusted the weights to give more prominence to S_Africa than Top- ic_52_Support. The results for this modified topic description are: Queryid (Num): 52 Total number of documents over all queries Retrieved: 1000 Relevant: 345 Rel_ret: Interpolated at 0.00 at 0.10 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision all rel docs: 312 Recall - Precision Averages: 1.0000 1.0000 0.9780 0.9766 0.9603 0.9067 0.8620 0.8620 0.8620 0.7422 0.0000 (non-interpolated) over 0.8305 Precision: At 5 docs: 1.0000 At 10 docs: 1.0000 At 15 docs: 1.0000 At 20 docs: 1.0000 At 30 docs: 1.0000 At 100 docs: 0.9700 At 200 docs: 0.8900 At 500 docs: 0.6220 At 1000 docs: 0.3120 R-Precision (precision after R (= num~re1 for a query) docs retrieved): Exact: 0.8493 So that although recall decreased slightly (we now retrieve 312 rather than 328 of the 345 relevant documents), the pre- cision is improved by nearly 50 percentage points. Obvi- ously, this is a significant improvement and was achieved with minimal manual inpuL The time required to make these changes was only of the order of 10 minutes. We repeated this exercise with Topic 54, the initial model-2 tree for which is: Topic~54 <Or> * 0.93 Topic_Path_54_2 <Accrue> ** 0.75 "AGREEMENT" ** 0.75 "ARIANE" ** 0.75 "ARIANESPACE" ** 0.75 "ATLAS" ** 0.75 "COMMERCIAL" ** 0.75 "CONTRACT" ** 0.75 "DELTA" ** 0.75 "DOCUMENT" ** 0.75 "DOUGLAS" ** 0.75 "DYNAMICS" ** 0.75 "INDUSTRY" ** 0.75 "LAUNCH" ** 0.75 "MARTIN" ** 0.75 "MENTION" ** 0.75 "PAYLOAD" ** 0.75 "PRELIMINARY" ** 0.75 "RELEVANT" ** 0.75 "RESERVATION" ** 0.75 "ROCKET" ** 0.75 "SATELLITE" ** 0.75 "SERVICES" ** 0.75 "SPACE" ** 0.75 "TENTATIVE" ** 0.50 "TITAN11 Using the same kinds of procedures (i.e., removing extraneous words and combining words into phrases) we constructed the following modified tree: Topic~54 <Or> * 0.93 Topic_Path_54_2 <Accrue> ** 0.10 `AGREEMENT' ** 0.75 "ARIANE" ** 0.75 "ARIANESPACE" ** 0.75 Atlas_Rocket <Sentence> "ATLAS" "ROCKET" ** 0.75 Commercial_Satellite <Sentence> "COMMERCIAL" "SATELLITE" ** 0.75 "DELTA 1111 261 ** 0.75 "MCDONNELL DOUGLAS" ** 0.75 11GENERAL DYNAMICS" ** 0.10 "LAUNCH" ** 0.75 "MARTIN MARIETTA" ** 0.75 "PAYLOAD" ** 0.75 "ROCKET" ** 0.75 "SATELLITE" ** 0.75 "SPACE" ** 0.50 "TITAN" ** 0.75 1CONTRACT1 ** 0.75 1LAUNCH SERVICE1 Notice that here we actually added words that were part of obvious proper names (i.e., the "GENERAL" of "GENERAL DYNAMICS", the "MARIETTA" of "MARTIN MARl- ETTA", and the "MCDONNELL" of "MCDONNELL DOU- GLAS"), but otherwise nothing was added. We also adjusted the weights on `AGREEMENT' and "LAUNCH" to de- emphasize their importance. The results of running this modified query are: Queryid (Num): 54 Total number of documents over all queries Retrieved: 1000 Relevant: 65 Rel_ret: 65 Interpolated Recall - Precision Averages: at 0.00 0.5800 at 0.10 0.5800 at 0.20 0.5800 at 0.30 0.5800 at 0.40 0.5800 at 0.50 0.4592 at 0.60 0.4592 at 0.70 0.4324 at 0.80 0.3355 at 0.90 0.1474 at 1.00 0.0657 Average precision (non-interpolated) over all rel docs: 0.3889 Precision: At 5 docs: 0.2000 At 10 docs: 0.1000 At 15 docs: 0.3333 At 20 docs: 0.4000 At 30 docs: 0.4667 At 100 docs: 0.4500 At 200 docs: 0.2700 At 500 docs: 0.1260 At 1000 docs: 0.0650 R-Precision (precision after R (= num_rel for a query) docs retrieved) Exact: 0.4615 So that, again for very little manual input' we achieved a significant improvement in precision performance; and this time at no cost to recall. It is interesting to compare our results with Verity's scores for these two topics. To do this we re-scored Verity's TOPIC2 results on the AP corpus alone.12 For Topic 52, Verity's results were: Queryid (Num): 52 Total number of documents over all queries Retrieved: Relevant: 1000 345 Rel_ret: 317 Interpolated Recall - Precision Averages: at 0.00 1.0000 at 0.10 0.9833 at 0.20 0.9342 at 0.30 0.8607 at 0.40 0.8314 at 0.50 0.7425 at 0.60 0.7125 at 0.70 0.6704 at 0.80 0.6161 at 0.90 0.3952 at 1.00 0.0000 Average precision (non-interpolated) over all rel docs: 0.7159 Precision: At S docs: 1.0000 At 10 docs: 1.0000 At 15 docs: 1.0000 At 20 docs: 1.0000 At 30 docs: 1.0000 At 100 docs: 0.9000 At 200 docs: 0.7900 At 500 docs: 0.5820 At 1000 docs: 0.3170 R-Precision (precision after R (= num_rel for a query) docs retrieved) Exact: 0.6812 Here we see better recall (317 of the 345 relevant docu- ments retrieved) but with slightly lower precision. The TOPIC2 tree for this topic is much more complex than the one we developed, which explams the better recall. Notice however that both trees gave perfect precision for the first 30 documents. For Topic 54, Verity's TOPIC2 results were: Queryid (Num): 54 Total number of documents over all queries Retrieved: 1000 65 Relevant: Rel_ret: 65 Interpolated Recall - Precision Averages: at 0.00 1.0000 at 0.10 0.9130 12. We are grateful to Verity for allowing us to exarnine their ThEC-2 results in detail. 262 at 0.20 at 0.30 at 0.40 at 0.50 at 0.60 at 0.70 at 0.80 at 0.90 at 1.00 Average precision all rel docs: 0.9130 0.9130 0.9000 0.7609 0.5942 0.5679 0.5049 0.2027 0.0927 (non-interpolated) over 0.6838 Precision: At 5 docs: 1.0000 At 10 docs: 0.8000 At 15 docs: 0.8667 At 20 docs: 0.9000 At 30 docs: 0.9000 At 100 docs: 0.5100 At 200 docs: 0.2900 At 500 docs: 0.1280 At 1000 docs: 0.0650 R-Precision (precision after R (= num_rel for a query) docs retrieved) Exact: 0.5846 This shows the same recall performance (i.e., all 65 relevant documents were retrieved) but substantially better precision performance. Through the first 30 documents TOPIC2 gave excellent results, whereas our modified model-2 result was only half as good. Again however, the TOPIC2 tree is much more complex, and required more effort to develop.13 Overall, we are impressed by the improved perfor- mance we were able to achieve with minimal manual effort. These auxiliary experiments provide at least suggestive evi- dence of the value of automatic generation of initial trees. The extent to which this is consistently achievable will require further investigation, and we hope to report on this in TREC-3. 5 Commentary The official results of our ThEC-2 experiments demonstrate that automatic construction of routing queries from training documents is indeed feasible. The queries pro- duced are in fact binary classification trees that are optimal with respect to size (measured in terms of the number of ter- minals in the tree) and the estimated error rate of the tree. Unfortunately, however, these trees generally appear to have poor performance. In a few cases the trees were com- parable with the results from other sites, but they mostly 13. We do not have precise figures for the amount of effort needed to build the Verity TOPIC2 trees, but in general each topic required several hours of effort. seemed unsatisfactory. We have not been able to identify any specific conditions under which we could expect the CART trees to perform well. We believe, nevertheless, that further experiments to assess the utility of using a variety of extensions to the basic CART technique would still be of interest. In particu- lar, the use of "low-level" concepts as features, surrogate split information, larger training sets, and features drawn from the complete corpus rather than the information need statements, are all likely to assist in the construction of more effective trees. The main focus of our TREC-2 effort has been to explore the idea that the CART trees can form the basis of a semi-automated approach to building knowledge-based descriptions of routing topics. To that end, we developed two techniques for converting CART output into a form that can be used by the TOPIC retrieval engine from Verity, Inc. In the first technique we perform a "lossless" transformation of the CART tree into a TOPIC tree. In the second we gener- alize the CART tree, although add no new features. As the examples contained in the paper show, the TOPIC trees generated in the first canonical form are "skel- etal." This is just as we would expect. Since CART is a par- simonious classifier, rather than a broad-based information retrieval tool, it produces minimally complex decision trees with respect to the expected misclassification rate. From an information retrieval perspective, the sec- ond canonical form seems more like the ones we might have built by hand. It uses a range of features and gives them weights based on their ability to discriminate among the tralning data. In effect, we have used CART to indicate wliich of the features are of use in defining the topic and then generalized the CART decision function using our (external) knowledge of the information retrieval problem. Using these automatically constructed TOPIC trees as a starting point, we conducted a limited series of tests to assess the impact of performing minimal "editing" of these trees. For the two topics selected, we were able to produce significant performance galns with edits that added no new information (at least at the level of the features used) and that took of the order of only a few minutes to implement. We are encouraged by these results, and, while the generalization of them will require a more carefully con- trolled series of experiments, we are now of the opinion that the most effective role for machine learning techniques in information retrieval is as a tool for producing candidate descriptions of information need. These candidates can then be reviewed by end-users who can easily make obvious cor- 263 rections and modifications. We intend to explore this idea in more detail and report on our results in ThEC-3. 6 Future Research There a number of directions in which we might develop the basic research ideas presented in this paper. We briefly consider a number of them here. We currenfly use just two classes (relevant and not relevant), but nothing in CART prevents it working with multiple classes. For the document routing problem, there is a case to be made for adding a third class - unknown rele- vance. Adding such a class might allow us to make use of larger tralning sets without the costs associated with devel- oping large ground truths. One way of extending the skeleton TOPIC trees produced by our tool is to make use of external lexical resources. For example, we might investigate the use of WordNet as a way to expand each of the classification fea- tures into a set of related words. Similarly, we might investi- gate the use of TOPIC's own lexical resources (e.g., the thesaurus and Soundex tools) by replacing the unstemmed words in the topic outline files with the appropriate TOPIC operator (e.g., <SQUNDEX> word, or <THESARUS> word). Mthough we have used CART as the module for building the initial classification tree, we might be interested in exploring other tree building tools that have been used in the machine learning community. For example, the C4.5 algorithm by Quinlan [6], or various algorithms based on Bayesian methods such as Minimum Message Length mod- els and decision graphs [7]. All of these tools generate deci- sion trees from training data but offer different mathematical philosophies to justify their approaches. Finally, as in all machine learning problems, the initial choice of features over which to learn is extremely critical to the overall success of the process. An investiga- tion of various extended feature definition tools (e.g., recog- nizing key phrase and proper names), as well as exploring the impact of making different assumptions from TOPIC about how the lexical tokens in the texts are to be treated, would almost certainly yield important insights. References 1. Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone, C. J. Classification and regression trees. Pacific Grove, CA: Wadsworth & Brooks; 1984. 2. Crawford, S. L.; Fung, R. M.; Appelbaum, L.A.; Tong, R. M. Classification trees for information retrieval. In: Proceedings of the eighth international workshop on machine learning (ML91). San Mateo, CA: Morgan Kaufinann; 1991. 3. Tong, R. M.; Winkler, A.; Gage, P. Classification trees for document routing. In: Harman, D. K., ed. The first text retrieval conference (ThEC-1). NIST Special Publication SP-500-207, Gaithersburg, MD: March 1993. 4. Fox, C. A stop list for general text. SIGIR Forum, 24(1,2):19~5; Fall/Winter 1989/90. 5. Paice, C. D. Another stemmer. SIGIR Forum, 24(3):5~1; Fall 1990. 6. Quinlan, J. R. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaulmann; 1992. 7. Oliver, J. J.; Dowe, D. L.; Wallace, C. S. Inferring decision graphs using the minimum message length principle. In: Proceedings 5th Australian joint conf. on M. 1992. 264