CLASSIFICATION TREES FOR DOCUMENT ROUTING A Report on the TREC Experiment Richard Tong, Adam Winkler, Pamela Gage Advanced Decision Systems (a division of Booz.Allen & Hamilton) 1500 Plymouth Street, Mountain View, CA 94043 1. Introduction In this paper we describe an approach to document routing on the TREC corpus that employs a technique for the automatic construction of classification trees. The approach makes use of the Classification and Regression Trees (CART) algorithm that has seen application in various areas of machine learning [1,2]. Our initial work with this algorithm [3] has demonstrated that probabilistic struc- tures can be automatically acquired from a training set of documents with respect to a single target concept, or a set of related concepts. These structures can then be applied to individual documents to derive a posterior probability that the document is about a par- ticular target concept. In addition, CART provides the user with a mechanism for explic- itly controlling the precision and recall performance trade-offs. In the CART approach, the task is to provide a direct assessment of the probability that a given concept is in a document given all possible combinations of evidence. Classi- fication trees are used as the probabilistic structure for representing this direct assess- ment. When used to perform the document routing task, the CART algorithm takes as input a collection of example documents (the training set), each of which consists of a known class assignment and a vector of features (e.g., whether or not the document is about the topic of interest, together with a list of the feature words found in the docu- ment). In the general case, the feature values may be boolean, integer, real or nominal and, in addition, may be both noisy and incomplete. The output from CART is a binary decision tree that can be used to classify subsequent documents for which the class is not known. When the training set is noisy, trees of maximal size are invariably too large in that a smaller tree will perform better in terms of classification accuracy. This is the stan- dard statistical "overfitting" problem. CART addresses the overfitting problem via tree pruning and error estimation algorithms that locate the "right sized" tree, ensuring a parsimonious, accurate classifier. The remainder of the paper is divided into a number of sections. In Section 2 we briefly describe the processing steps performed by our system, first to generate the opti- mal classification tree and then to use the tree for classification. In Section 3 we illustrate how the TREC data is processed. In Section 4 we discuss our "official" TREC results. 209 Then finally in Section 5 we offer some observations on the overall value of using CART as the basis of a document routing system. 2. The CART Algorithm CART has been shown to be useful when one has access to datasets describing known classes of observations, and wishes to obtain rules for classifying future observa- tions of unknown class~~xactly as in the document routing problem. CART is particu- larly attractive when the dataset is "messy" (i.e., is noisy and has many missing values) and thus unsuitable for use with more traditional classification techniques. In addition, and particularly important for the document routing application, if it is important to be able to specify both the misclassification costs and the prior probabilities of class mem- bership then CART has a direct way of incorporating such information into the tree building process. Finally, CART can generate auxiliary information, such as the expected misclassification rate for the classifier as a whole and for each terminal node in the tree, that is useful for the document routing problem. Figure 1 shows how the CART algorithm is used first to construct the optimal classifi- cation tree and then to generate a classification decision. The upper part of the diagram CART Largest Training 1Vectors Tree (Tmax) ii'm'mm Raw Feature Tree Tree Tree Optimal Training Extraction Growing Tree (T*) Data Class Specs. Class Priors Feature Specs. Cost Function Nested Sub-Trees (Tmax>T1 >T2... >Tn) Feature Vectors Raw Test ~~Featurel11~l..~ Classification Data ~ Decision meaturespecs. Optimal Tree Figure 1: CART Processing Flow 210 shows the four sub-processes used to generate the optimal tree, T*. The "raw" training data (i.e., the original texts of the articles), together with the class specifications (i.e., the training data relevance judgments) and the feature specifications (i.e., the words defined to be features), are input to the Feature Extraction Module. The output is a set of vectors that record the class membership and the features contained in the training data. These vectors, together with class priors and the cost function (these are optional), are input to the Tree Growing Module which then constructs the maximal tree (T~~) that character- izes the training data. Since this tree overfits the data, the next step is to construct a series of nested sub-trees by pruning Tmax to the root. The sequence of sub-trees (T~~~> T1> ... >Tn) are input to the Tree Selection Module which performs a cross-validation analy- sis on the sequence and selects that tree with the lowest cross-validation error1. This is T*. The lower part of the diagram shows how T* is used to classify documents of unknown class. As in tree building, the "raw" test data (i.e., the unprocessed texts of the test documents) are passed, together with the feature specifications, to the Feature Extraction Module, which in turn generates a feature vector for each text. These vectors are passed to the Classifier Module which uses the optimal tree to make the classification decision (i.e., whether the document is relevant or not). 3. Working with the TREC Corpus Since the application of the CART algorithm to document routing is a relatively unex- plored area of research (see [3] for some of our preliminary results with a small test cor- pus), we chose to participate in TREC as a Category B group. That is, we worked only with the Wall Street Journal articles and used only the first 25 query topics. In addition, since CART is intended to be used with training data, it most readily lends itself to the document routing problem. Thus we did not address the ad hoc retrieval problem in our experiments. Our primary goal in working with the TREC data was to develop a totally automatic approach to generating document classification trees, working only with the information need statements and the training data provided. That is, we wanted to test the hypothe- sis that CART would be a valuable tool in situations where the user can formulate an information need in some detail and can provide examples of documents that are rele- vant and examples of those that are not. Such a scenario is common in organizations that monitor large volumes of real-time electronic documents. To illustrate how we used the information need statements, consider the following topic description: 1. The algorithm actually minimizes with respect to both the cross-validation error and the tree complexity So that if two trees have statistically indistinguishable error rates, then the smaller of the two trees will be selected as optimal. 211 Tipster Topic Description Number: 001 Domain: International Economics Topic: Antitrust Cases Pending <desc> Description: Document discusses a pending antitrust case. <narr> Narrative: To be relevant, a document will discuss a pending antitrust case and will identify the alleged violation as well as the government entity investigating the case. Identification of the industry and the companies involved is optional. The antitrust investigation must be a result of a complaint, NOT as part of a routine review. <con> Concept(s): 1. antitrust suit, antitrust objections, antitrust investigation, antitrust dispute 2. monopoly, bid-rigging, illegal restraint of trade, insider trading, price-f ixing 3. acquisition, merger, takeover, buyout 4. Federal Trade Commission (FTC), Interstate Commerce Commission (ICC), Justice Department, U.S. Securities and Exchange Commission (SEC), Japanese Fair Trade Commission 5. NOT antitrust immunity <fac> Factor(s) <def> Definition(s): Antitrust - Laws to protect trade and commerce from unlawful restraints and monopolies or unfair business practices. Acquisition - The taking over by one company of a co trolling interest in another, also called a takeover. The action may be friendly or unfriendly. Merger - The acquisition by one corporation of the stock of another. The acquiring corporation then retires the other1s stock and dissolves that corporation. Therefore, only one corporation retains its identity in a merger. <Itop> Since this is a very comprehensive description it contains many topic-specific words. Recognizing this, our approach is to: * extract the <desc>, <narr>, <COn>, and <def> fields from the topic specification, * concatenate them, removing the SGML-style tags and the field labels (i.e., the DesCription:,Narrative:,COnCept (S) :,and Definition(s): strings), * map all the words into lowercase, remove duplicates and stop words2 from the resulting description, and use the remaining list of words as the set of features. 2. We used a stop word list published by Fox [4] that contains 421 words. 212 Note that we do no further processing of the topic text. We do no phrasal analysis, no stemming, no term expansion and, in particular, we do nothing with the NOT-clauses. Applying this procedure to the topic above results in the following list of feature words: acquisition: antitrust: commerce: commission: department: exchange: ftc: fair: federal: icc: identification: interstate: japanese: justice: laws: merger: sec: securities: trade: acquir- ing: action: alleged: bid: business: buyout: called: companies: company: complaint: controlling: corporation: discuss: dispute: dissolves: document: entity: fixing: friendly: government: iden- tify: identity: illegal: immunity: industry: insider: investigat- ing: investigation: involved: merger: monopolies: monopoly: objections: optional: pending: practices: price: protect: rele- vant: restraint: restraints: result: retains: retires: review: rigging: routine: stock: suit: takeover: taking: trading: unfair: unfriendly: unlawful: This feature specification is used as input to the Feature Extraction module (see upper part of Figure 1) along with the original training data and the relevance judgements associated with the topic. A key characteristic of the training data provided for TREC is that there were very few relevance judgments for each topic. Thus we were only able to generate a limited number of training vectors to be used by the CART algorithm. For each document in the training set for which there is a relevance judgement, we determine the number of occur- rences of each feature (using only the <text> fields). Thus for each topic we get a set of training vectors that looks like those shown below (augmented here, for presentation purposes, with the topic and document ids): 001 QO WSJ870320-0188 (0 (0,1,0,5,0,0,3...6)) 001 QO W~J870320-0069 (0 (2,0,9,l,0,4,l,...,0)) 001 QO WSJ870424-0084 (1 (l,3,0,5,2,0,0,...,2)) Where in each vector the first element indicates whether the document is relevant to the topic (0 mean no, 1 mean yes) and the remainder of the elements represent counts of the occurrence of the features in the document. Thus document WSJ870320-0188 is not-rele- vant to Topic 1 and contains zero occurrences of the first feature, one occurrence of the second feature, etc. This training data is input to the Tree Growing module (see Figure 1) along with the cost function and class priors3. CART then grows the largest tree, prunes this into a set of nested sub-trees and uses cross validation to suggest the optimal tree. For the example 3. In our "official" TREC experiments we set the priors to be P(rel)=O.O1 and P(non-rel)=O.99, and the ratio of the cost of misclassifying a relevant document to the cost of misclassifying a non-rele- vant document to be 100:1. That is, we assumed that there are relatively few relevant documents in the collection and that we want to emphasize recall significantly over precision. 213 above, the optimal tree is: class 0 (0.000) co~pany<=0 .50 class 1 (0.981) federal<=9 .00 class 0 (0.000) takeover<=l .50 class 0 (0.000) securities<=l .50 class 0 (0.000) company<=4 .50 class 0 (0.000) fair<=0 .50 class 0 (0.000) This is a the binary classification tree against which new documents can be tested. The tree is read from left to right. Branches upwards correspond to "yes" answers to the test at the node; branches downwards correspond to no answers. Class 0 terminals cor- respond to the "not relevant" decision; Class 1 terminals correspond to the "relevant" decision. The numbers in parentheses are estimates of the probability that the document 4 is relevant Thus in this example if a document has fewer than 0.5 occurrences of the word corn- pany (i.e., the word does not appear) then the document is not relevant. On the other hand if the document has more than 0.5 occurrences (i.e., the word does appear) then additional features are tested to lead to a classification decision. Note that a test often requires multiple occurrences of the feature (e.g., for the word federal). Also note that this tree has only one Class 1 terminal, and that the pattern of features that indicate a rel- evant document can be written as a logical expression of feature counts. Thus we have: company>0 & fair & cornpany<5 & securities<2 & takeover<2 & federal<lO as a description of a relevant document. Other overall system features of note are: * The exact word is used as the basis of the feature, although our imple- mentation for TREC the actual test is case insensitive and is a substring match-so the feature fair would match with Fair or FAIR or fairs as well as with fairground. We do not, though, perform any stemming per Se. * No additional data structures are needed by CART. That is, we did not use any external data (e.g., thesauri or knowledge-bases) to help with tree construction. We spent approximately one-person month of effort in developing the 4. These are generated by CART from re-substitution error estimates and are, therefore, overly optirnistic. 214 system infrastructure needed to handle the TREC data and to produce the official results. We used an "off-the-shelf" version of CART (written in C). * The data were stored in compressed form and only uncompressed as needed. This was a satisfactory strategy for tree construction since the training sets involved relatively few documents. However, the time over- head was unacceptable when we came to do the actual test classifications. 4. Official Results and Performance Analysis ADS submitted two sets of results for the routing queries. In the first set (denoted adsbal) we used the classification trees generated by exactly the information provided by the training data. In the second set (denoted adsba2) we used trees generated using an augmented training data. To generate this additional data we randomly selected an additional block of 50 Wall Street Journal articles from the training corpus and then one of us made the relevance judgements with respect to the 25 topics. This data was then added to the original collection of relevance judgements to provide a larger training set from which the second set of trees were grown. As noted above, we used a set of priors that reflect the low density of relevant documents, together with a cost function that encourages recall over precision. We also performed a number of auxiliary tests to help with our interpretation of the official results. These are all described in the following sec- tions. 4.1 The Baseline Experiment The baseline experiment (adsbal) was designed to explore how well our approach could do with absolutely no manual intervention and with the minimum of training data. So for this experiment we used just those documents in the training set for which there were relevance judgments. Table 1 shows the performance on the baseline experiment together with the perfor- mance from the other Category B systems. We have chosen to show only the number of relevant-retrieved documents at the 200 document cut-off point since we believe that this gives a more accurate picture of the ability of the system to perform document routing than do the precision and recall numbers5. Table 1: Performance on Baseline Experiment Rel-Ret @ 200 Topic# #Rel adsbal Max Median Mm 1 131 2 67 32 2 2 172 15 33 21 9 3 304 3 130 48 3 4 20 1 18 7 1 215 Table 1: Performance on Baseline Experiment Rel-Ret @ 200 To pic# #Rel adsbal Max Median Min 5 55 4 45 29 4 6 68 4 46 20 4 7 92 1 46 30 1 8 133 4 37 16 2 9 157 13 41 29 13 10 149 94 109 88 46 11 61 15 55 26 9 12 82 14 56 15 3 13 93 7 93 26 7 14 156 38 73 s2 23 15 515 29 74 49 23 16 58 2 44 17 2 17 69 23 53 23 9 18 95 38 49 38. 14 19 664 74 147 99 56 20 274 111 179 121 56 21 16 12 16 14 0 22 106 28 79 40 8 23 30 2 27 7 2 24 253 37 96 41 29 25 13 1 12 9 1 Overall performance of the baseline case is mixed and our analysis any obvious correlation between performance and factors such as: does not reveal * the number of features extracted from the information need statements (the maximum number was 161, the minimum was 17, with the median being 39), * the complexity of the topics (some are straightforward-such as Topic 13 "Mitsubishi Heavy Industries Ltd.", whereas others involve complex con- difionals-such as Topic 1 ~`Anh trust Cases Pending"), 5. We return to this point in the final section of the paper. 216 * the number of training examples (Table 2 shows the size of the training sets-notice that for adsbal the set size was approximately 30 for each topic with approximately 10 relevant instances for each topic), and * the size of the optimal tree (Table 2 also shows the size of the optimal tree generated for each topic-this varied from a tree with only one terminal node to a tree with 10 terminal nodes, with the median being 2 terminal nodes). In fact, given the paucity of information actually used to generate the classification trees, perhaps the most surprising aspect of these results is how well some of the trees perform. There were many instances in which the adsbal trees generated the minimum response. However, there were several results around the median score and even one significantly above the median. As an example of a tree that performed moderately, consider the optimal tree for Topic 22 "Counternarcotics." This had only one decision node-a split on the word coca6. The actual tree is: class 0 (0.050) coca<=0 .50 class 1 (0.862) That is, the classification is based in the presence of absence of the word coca. If it is present, then the document is marked as "relevant" and the estimate of the probability of it being relevant is 0.862; if it is not present, then the document is marked as "non-rele- vant" but in fact still has a small probability (0.050) of being relevant. As we see from the table, this tree identifies 28 out of a possible 106 relevant documents, by the 200 docu- ment cut-off point. A tree that performed poorly is the one for Topic 9 "Candidate Sightings." Here the optimal tree had three decision nodes: class 0 (0.000) camp<=0 .50 class 0 (0.000) ran<=l .50 class 1 (0.948) sen<=9 .00 class 0 (0.000) Thus a relevant document is one which contains the word camp, the word sen (an abbreviation for Senator) nine times or less, and has a least two occurrences of the word ran. This tree only identified 13 of 157 relevance documents and was the worst perform- ing of the Category B systems. On the other hand, the tree for Topic 10 "MDS Treatments" performed very well. It 6. Recall that our feature extraction algorithm would match coca with any word with coca as a substring. So coca1 cocaine and Coca-Cola would all match. 217 also had only one decision node, but correctly identified 94 of 149 relevant documents: class 0 (0.126) azt<=0 .50 class 1 (1.000) Thus relevant documents are those that contain the word azt (the name of a drug for treating AIDS patients). 4.2 The Effect of Additional Training Data Our second set of official results was designed to investigate the sensitivity of system performance to the size of the training sets. To provide additional training samples we randomly selected a block of 50 Wall Street Journal articles7 for which we generated rele- vance judgements for the first 25 topics. In practice this gave us some additional relevant articles, but mostly contributed to the non-relevant examples. Table 2 shows the effect of adding the additional documents-notice that some articles in the new set were already included in the original training data. Table 2: Size of Training Sets and Optimal Trees Size of Optimal Tree Size of Training Set Topic # adsbal absda2 adsbal absba2 Total Rel Total Rel 1 7 7 34 7 83 `7a 2 8 14 31 7 81 8 3 4 5 30 9 80 9 4 3 5 29 12 79 12 5 3 3 30 18 79 18a 6 3 2 28 15 78 15 7 2 3 28 10 78 10 8 1 19 32 7 82 8 9 4 4 23 4 73 4 10 2 4 20 12 69 12b 11 2 2 21 12 71 12 12 8 11 35 8 85 8 13 2 2 12 8 62 8 14 10 13 30 5 80 6 7. The articles were in the block starting with W5J870311-0102 and ending with W5J870324-O001. 218 Table 2: Size of Training Sets and Optimal Trees Size of Optimal Tree Size of Training Set Topic # adsbal absda2 adsbal absba2 Total Rel Total Rel 15 3 13 27 7 77 10 16 2 3 36 3 86 3 17 2 2 30 12 80 12 18 2 3 18 14 68 14 19 2 3 30 12 80 15 20 2 3 30 14 80 14 21 2 2 33 12 82 12b 22 2 2 28 10 78 10 23 2 2 29 10 79 10 24 2 2 48 10 98 10 25 2 2 39 6 89 6 a. Some relevant documents in the augmented training set already in the original training set. b. Some non-relevant documents in the augmented training set already in the original training set. The new training data did not have a significant impact on the size of optimal tree, except in those cases where there were additional relevant documents-that is, for topics 6, 8, 14, 15 and 19. The changes here were quite dramatic. For example, in the case of Topic 8 "Economic Projections" the addition of just one more relevant article changed the optimal tree from one with only one terminal node to one with 19! This suggests, of course, that for this topic the training data do not provide a very representative sample of texts. Of more interest, however, is whether these additional training data had any effect on the overall performance of the system. Table 3 shows the official results for absda2. The results for adsbal are retained for comparison, as are the results for the other Category B systems. Table 3: Performance with Additional Training Data Rel-Ret @ 200 Topic# #Rel -_________ _________ _________ adsbal absba2 Max Median Mm 1 131 2 25 67 32 2 2 172 15 9 33 21 9 219 Table 3: Performance with Additional Training Data Rel-Ret @ 200 Topic# #Rel _________ _________ adsbal absba2 Max Median Mm 3 304 3 15 130 48 3 4 20 1 1 18 7 1 5 55 4 29 45 29 4 6 68 4 10 46 20 4 7 92 1 22 46 30 1 8 133 4 2 37 16 2 9 157 13 25 41 29 13 10 149 94 69 109 88 46 11 61 15 9 55 26 9 12 82 14 3 56 15 3 13 93 7 25 93 26 7 14 156 38 23 73 52 23 15 515 29 49 74 49 23 16 58 2 17 44 17 2 17 69 23 53 53 23 9 18 95 38 14 49 38 14 19 664 74 56 147 99 56 20 274 111 63 179 121 56 21 16 12 0 16 14 0 22 106 28 8 79 40 8 23 30 2 2 27 7 2 24 253 37 29 96 41 29 25 13 1 1 12 9 1 As for adsbal, the results are mixed. For 10 topics performance improved; for 13 top- ics performance got worse; and for 2 topics performance was unchanged. We also do not see any strong correlation between change in performance and change in the number of positive instances in the training set, or change in the optimal tree size. There is some indication that the while the extra positive instances tended to produce significant changes in optimal tree size, these new trees also tend to have poorer performanc~sug- gesting, as we should expect, that the tree construction process is highly sensitive to local changes in these small training sets. Nevertheless there are some interesting results. 220 For example, the tree for Topic 22 ~`Counternarcotics" becomes: class 0 (0.000) drug<=0 .50 class 1 (0.905) Thus although the tree size is unchanged, the test is now on the word drug instead of coca. As we might expect this turns out to be a much less generally useful test and the tree only identifies 8 of the relevant articles-actually the minimum retrieved by a Cate- gory B system. The tree for Topic 15 ~~CEO" is one which shows marked change from the adsbal ver- sion, growing from two decision nodes to twelve, and retrieving 49 relevant documents instead of 29. The tree is: class 0 chiet<=0 .50 (0.000) class 0 (0.000) executive<=0 .50 class 1 (1.000) company<=0 .50 class 0 (0.000) executive<=l .50 class 0 (0.000) name<=0 .50 class 1 (1.000) coinpany<=l .50 class 0 (0.000) resign<=0 .50 class 0 (0.000) chief<=l .50 class 1 (1.000) name<=3 .50 class 0 executive<=9 .00 class 0 (0.000) appoint<=0 .50 class 0 (0.000) ceo<=0 .50 class 0 (0.000) (0.000) This is a much more complex structure than the other trees illustrated so far. Note that there are three terminal nodes that lead to a document being classified as relevant. Although they are in the same sub-tree defined by the expression: chiet>0 & ceo & `appoint & executive>0 & executive<l0 & name<4 they make minor distinctions based on the words company, name, resign and chief. Thus we have three further tests: executive<2 & company executive>l & company>l & resign>0 & chiet>l executive>l & coinpany<2 & name>0 221 to separate relevant from non-relevant documents. Finally, the tree for Topic 17 "Measures to Control Agrochemicals" does the best of all the Category B systems detecting 53 of the 69 relevant documents. The tree has one deci- sion node: class 0 (0.042) pes~icide<=0 .50 class 1 (0.920) That is, a test for the presence of the word pe~ticide-a surprisingly simple structure given it performance. 4.3 Sensitivity to the Choice of Optimal Tree Since the choice of optimal tree from the sequence of nested sub-trees is dependent on the cross-validation error estimates, and since the number of training samples is rather small, we might expect that there is a possibility that the tree selection process is in fact in error. To explore the sensitivity of the system's performance to errors in select- ing the optimal tree we ran an auxiliary experiment for Topic 22 "Counternarcotics". As in the official experiments, we used the adsbal dataset to generate a sequence of sub-trees, but instead of selecting just one (i.e., T*), we saved all the trees. We then used each tree to classify the test data and tabulated the results. These are show in Table 4. The Table 4: Performance as a Function of Tree Size Rel-Ret Recall Precision Tree No. Tree Size ~ @ 200 @ 200 @ 200 1 6 0.3932 12 0.1132 0.0600 2 4 0.5004 1 0.0094 0.0050 3 3 0.3931 24 0.2264 0.1200 4 (T*) 2 0.4645 28 0.2642 0.1400 5 1 0.7866 - - columns of the table record the tree identifier (Tree No.), the size of the tree in terms of the number of terminal nodes in the tree (Tree Size), the cross-validation error estimate ~ the number of relevant documents retrieved (Rel-Ret @ 200), and the recall and precision at the cut-off (Recall Oa 200 and Precision @ 200). For this topic, the maximum tree (T1) has six terminal nodes and an estimated error rate of 39%. The minimum tree (T5) has oniy one terminal node-which in this case clas- sifies all document as non-relevant-and an estimated error rate of 79%. The optimal tree (T4) has two terminal nodes but an estimated error rate of 46%. We see that the optimal tree did in fact generate the best result, increasing our confi- 222 dence that the tree selection process is working as intended despite the small sample sizes. 4.4 The Use of Surrogate Split Information A basic problem with the use of classification trees in the TREC experimental envi- ronment is that although we use the probability of being relevant as a mechanism for ordering the system's output, this is a very weak method for generating a ranking. In fact, since we typically have very few "output bins" for any given tree, for most topics we effectively generated no ordering that would discriminate over the top 200 docu- ments. In this situation we need to look for a mechanism for generating a "finer-grained" ranking. Our second auxiliary experiment was designed to explore the use of the surrogate split information generated by the CART experiment as a way of ordering the output. Surrogate splits are a feature of CART used too deal with the problem of missing data. They define alternative splitting criteria in the case that the primary feature measure- ment is not available8. In TREC trees these appear as additional tests on the frequency of occurrence of words. So, for example, in adsbal the optimal tree for Topic 22 "Countern- arcotics" is: class 0 (0.050) coca<=0 .50 class 1 (0.862) and the alternatives to the split on the word c~ca are: cocaine<=0 .50 [1.00] colombia<=0 .50 (0.84) drug<=0 .5 [0.78] where the number in brackets indicate how correlated the surrogate split is with the optimal split. Thus in this example a split on coca and cocaine are identical in term of their ability to classify the documents. We also show the next two highly correlated splits. To use this information for output ranking we took the output from the adsbal tree for Topic 22 (i.e., the one shown above) and then for each document classified as relevant by this tree we gave it a weighted score based on the number of surrogate split tests it also satisfied. Thus if a document contained only the word coca it received a score of 1.00. If a document contained coca and colombia then it received a score of 1.84. In gen- eral a document received a score that was the sum of the correlation coefficients for those tests that were satisfied9. 8. Of course in the document routing problem addressed by TREC we do not have the notion of a missing measurement-a word is either present or not. However, it is easy to imagine document routing scenarios in which this does apply-for example, when working with noisy transmis- sions-so that the surrogate split information would be extremely useful. 223 The effect of using this information is shown in Table 5. The table demonstrates the impact on system performance for Topic 22 of adsbal. Without any ordering we detected Table 5: Performance of T* as a Function of Output Ordering Rel-Ret Recall Precision Tree/Output Ordering @ 200 @ 200 @ 200 T*: no additional ordering 28 0.2642 0.1400 T*: ordering based on surrogate splits 43 0.4057 0.2150 28 relevant documents in the first 200, but with the ordering scheme just described we were able to improve this to 43 in the first 200. This in turn translated into an increase of 14 points of recall and 7 points of precision at the 200 document point. The surrogate splits also give us some insights into the overall behavior of the tree selection process as the number of training samples change. Thus although the optimal tree for Topic 22 in adsba2 was: class 0 (0.050) drug<=0 .50 class 1 (0.862) with worse performance than the optimal tree for adsbal, when we look at the top three surrogate splits we see that they are: cocaine<=0 .50 [0.89) coca<=0 .50 [0.88] government<=0 .5 [0.86) That is although the optimal split for the augmented training set changed, we still see the importance of the same set of word features, which in turn indicates are certain stability in the underlying feature space. We might predict that as the training set increases in size that we would see the splits also becoming more stable. 4.5. Commentary The TREC corpus represents a significant challenge for our system. Our previous results with a small corpus, while encouraging, did not allow us to evaluate how well the technique might do with realistically sized document collections. Our conclusion based on the results we have from TREC is that CART does exhibit some interesting behaviors on a realistic corpus, and that, despite the small size of the training sets and the restricted choice of features, for some topics it produces competitive results. So although the overall performance is moderate (relative to the better performing systems at TREC), we believe that the absolute performance (given that the system is totally auto- 9. As yet, there is no theoretical justificafion for this algorithm. It does however have the intuitive property that documents that satisfy additional splits get a higher score in proportion to the 11power" of those splits. 224 matic) is at least encouraging and definitely acceptable in several instances. Some specific observations on the performance of the current implementation of the CART algorithm are: * Relying on the re-substitution estimates for the terminal nodes is a very weak method for producing an output ranking. The estimates themselves are not very good and when combined with optimal trees that emphasize recall over precision give a largely undifferentiated output. As we noted above, a scheme that makes use of surrogate split information to generate a post hoc ranking shows much promise as a technique for improving our scores in the TREC context. * While our approach is totally automatic, it is restricted to using as fea- tures only those words that appear in the information need statement. This is obviously a limitation since the use of even simple query expan- sion techniques (e.g., stemming and/or a synonym dictionary) is likely to provide a richer and more effective set of initial features. * Using words as features is possibly too "low-level" to ever allow stable, robust classification trees to be produced. At a minimum, we probably need to consider working with concepts rather than individual words. Not only would this reduce the size of the feature space but would proba- bly result in more intuitive trees. The disadvantage of this that it is not clear where the concepts would come from, other than from a manually constructed knowledge-base of some sort. * We need to work with much bigger and more representative training sets. Our preliminary experiment in this area shows, not surprisingly, that adding more training examples can lead to dramatic changes in the classi- fication trees. As a final comment, we would like to suggest that the overall evaluation paradigm used in TREC does not properly assess the performance of systems on the routing task. Although ad hoc retrieval and routing are similar when viewed in terms of the basic tech- nology, systems designed and built to support these two applications have significantly different requirements. In particular, operational routing systems do not usually empha- size output ordering but instead focus on optimizing the trade-off between detection and false alarm rate. In this respect, at least, we believe that recall and fallout are better indi- cators of routing performance than recall and precision. Furthermore, artificially limiting reported output to the first 200 documents automatically discriminates against those routing systems that actually do attempt to perform the recall/fallout trade-off. A fairer set-up for the routing component of TREC would be to allow systems to report exactly those documents marked as relevant. Comparison of systems would be more complex since different systems will produce different numbers of documents, but individual scores would give a better picture of routing performance. 225 5. CART as the Kernel of a Document Routing System Users today are faced with ever increasing amounts on real-time text that must be sifted for relevant information. This information space is: massive-both in terms of the number of sources and the volume of material available within each source; dynamic- sources and their contents are changing constantly and especially within the time hori- zon of any specific analysis problem; and heterogeneous~ach source represents infor- mation in different ways and independently of the others. In this environment, users require tools that can perform document filtering and selection that are: easy to learn and use, easily adaptable to changing and ill-defined information needs, portable across sources and analysis domains, and give good preci- sion and recall. With our TREC results in hand, we are in a position to briefly consider the potential role of CART as a central component in such an operational document rout- ing system. Figure 2 illustrates how a such a system might be organized. In our scenario, we docum~~preprocessor~ I Detector ~1~ Compiled Profiles relevant documents~ users Router I non-re~~nt documents U A f I Learning stored Algorithms K documents I user U- real-time processing patn Figure 2: A Document Routing System Concept imagine that documents enter the system in real-time and after preprocessing to extract features are passed to the detection module where they are either rejected as being non- relevant to any of the user profiles stored in the system, or are marked as relevant and passed to the router for transmission to users. Note that in our proposed architecture, learning takes place in parallel to actual operation of the real-time routing system so that the classification trees can be updated as necessary when new training instances become 226 available or when users interests change. While we would make no claim that CART alone is sufficient to guarantee high-per- formance detection and routing, we do believe that its ability to work automatically with any size data set and with any set of specified features means that it can be a very cost effective component of such a system. Indeed we believe that it is probably best used as an initial filter to screen out non-relevant documents and that CART's output might then be fed to a more language-oriented algorithm to decrease the false alarm rate. To summarize, we consider CART to be an important tool in our arsenal of effective and efficient document detection and routing technologies. While the results from TREC are preliminary, we believe that they do demonstrate that CART has a number of advan- tages over other approaches, namely: * classification trees are constructed automatically from specifications of features and a set of examples, * the learning algorithm generates an "optimal" classifier together with useful auxiliary data and statistics such as the misclassification probabil- ity, * prior class probabilities can be used if known, * specification of the misclassification cost function provides for direct con- trol of the fallout and recall of the classifier, and the classification trees are easily understood and interpreted by end users. In addition, the CART algorithm is completely language independent, in the sense that it make no assumptions about the inherent features of the source language of the docu- ment-all that it requires are "features" and training examples. Further, the features themselves can be extracted from document externals as well as document internals. In TREC-2 we will explore some of the extensions discussed in Section 4 to show how CART can indeed be integrated into a system to help users who are faced with the task of searching through the "information wilderness." References [1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks, Pacific Grove, CA. 1984. [2] S. L. Crawford. Extensions to the CART Algorithm. International lournal of Man- Machine Studies, 31:197-217, 1989. [3] 5. L. Crawford, R. M. Fung, L. A. Appelbaum, and R. M. Tong. Classification Trees for Information Retrieval. Proceedings of the Eighth International Workshop on Machine Learning (ML91). Morgan Kaufmann, San Mateo, CA. 1991. (4] C. Fox. A Stop List for General Text. SIGIR Forum, 41:19-35, Fall/Winter 1989/90. 227