CLASSIFICATION TREES FOR DOCUMENT ROUTING
                 A Report on the TREC Experiment

                  Richard Tong, Adam Winkler, Pamela Gage
                      Advanced Decision Systems
                  (a division of Booz.Allen & Hamilton)
              1500 Plymouth Street, Mountain View, CA 94043


1. Introduction

  In this paper we describe an approach to document routing on the TREC corpus that
employs a technique for the automatic construction of classification trees. The approach
makes use of the Classification and Regression Trees (CART) algorithm that has seen
application in various areas of machine learning [1,2].

  Our initial work with this algorithm [3] has demonstrated that probabilistic struc-
tures can be automatically acquired from a training set of documents with respect to a
single target concept, or a set of related concepts. These structures can then be applied to
individual documents to derive a posterior probability that the document is about a par-
ticular target concept. In addition, CART provides the user with a mechanism for explic-
itly controlling the precision and recall performance trade-offs.

  In the CART approach, the task is to provide a direct assessment of the probability
that a given concept is in a document given all possible combinations of evidence. Classi-
fication trees are used as the probabilistic structure for representing this direct assess-
ment. When used to perform the document routing task, the CART algorithm takes as
input a collection of example documents (the training set), each of which consists of a
known class assignment and a vector of features (e.g., whether or not the document is
about the topic of interest, together with a list of the feature words found in the docu-
ment). In the general case, the feature values may be boolean, integer, real or nominal
and, in addition, may be both noisy and incomplete. The output from CART is a binary
decision tree that can be used to classify subsequent documents for which the class is not
known. When the training set is noisy, trees of maximal size are invariably too large in
that a smaller tree will perform better in terms of classification accuracy. This is the stan-
dard statistical "overfitting" problem. CART addresses the overfitting problem via tree
pruning and error estimation algorithms that locate the "right sized" tree, ensuring a
parsimonious, accurate classifier.

  The remainder of the paper is divided into a number of sections. In Section 2 we
briefly describe the processing steps performed by our system, first to generate the opti-
mal classification tree and then to use the tree for classification. In Section 3 we illustrate
how the TREC data is processed. In Section 4 we discuss our "official" TREC results.


                              209

Then finally in Section 5 we offer some observations on the overall value of using CART
as the basis of a document routing system.

2. The CART Algorithm

  CART has been shown to be useful when one has access to datasets describing
known classes of observations, and wishes to obtain rules for classifying future observa-
tions of unknown class~~xactly as in the document routing problem. CART is particu-
larly attractive when the dataset is "messy" (i.e., is noisy and has many missing values)
and thus unsuitable for use with more traditional classification techniques. In addition,
and particularly important for the document routing application, if it is important to be
able to specify both the misclassification costs and the prior probabilities of class mem-
bership then CART has a direct way of incorporating such information into the tree
building process. Finally, CART can generate auxiliary information, such as the expected
misclassification rate for the classifier as a whole and for each terminal node in the tree,
that is useful for the document routing problem.
  Figure 1 shows how the CART algorithm is used first to construct the optimal classifi-
cation tree and then to generate a classification decision. The upper part of the diagram

                           CART          Largest
                    Training 1Vectors  Tree (Tmax)
               ii'm'mm
   Raw           Feature                                   Tree
                                 Tree          Tree                  Optimal
  Training      Extraction      Growing                              Tree (T*)
   Data


               Class Specs.    Class Priors
               Feature Specs.  Cost Function

                                                  Nested Sub-Trees
                                                (Tmax>T1 >T2... >Tn)

                        Feature
                        Vectors

     Raw
     Test  ~~Featurel11~l..~                   Classification
     Data  ~                                     Decision


              meaturespecs.      Optimal
                                  Tree

                           Figure 1: CART Processing Flow


                                        210

shows the four sub-processes used to generate the optimal tree, T*. The "raw" training
data (i.e., the original texts of the articles), together with the class specifications (i.e., the
training data relevance judgments) and the feature specifications (i.e., the words defined
to be features), are input to the Feature Extraction Module. The output is a set of vectors
that record the class membership and the features contained in the training data. These
vectors, together with class priors and the cost function (these are optional), are input to
the Tree Growing Module which then constructs the maximal tree (T~~) that character-
izes the training data. Since this tree overfits the data, the next step is to construct a series
of nested sub-trees by pruning Tmax to the root. The sequence of sub-trees (T~~~> T1>
... >Tn) are input to the Tree Selection Module which performs a cross-validation analy-
sis on the sequence and selects that tree with the lowest cross-validation error1. This is
T*.

  The lower part of the diagram shows how T* is used to classify documents of
unknown class. As in tree building, the "raw" test data (i.e., the unprocessed texts of the
test documents) are passed, together with the feature specifications, to the Feature
Extraction Module, which in turn generates a feature vector for each text. These vectors
are passed to the Classifier Module which uses the optimal tree to make the classification
decision (i.e., whether the document is relevant or not).

3. Working with the TREC Corpus

  Since the application of the CART algorithm to document routing is a relatively unex-
plored area of research (see [3] for some of our preliminary results with a small test cor-
pus), we chose to participate in TREC as a Category B group. That is, we worked only
with the Wall Street Journal articles and used only the first 25 query topics. In addition,
since CART is intended to be used with training data, it most readily lends itself to the
document routing problem. Thus we did not address the ad hoc retrieval problem in our
experiments.

  Our primary goal in working with the TREC data was to develop a totally automatic
approach to generating document classification trees, working only with the information
need statements and the training data provided. That is, we wanted to test the hypothe-
sis that CART would be a valuable tool in situations where the user can formulate an
information need in some detail and can provide examples of documents that are rele-
vant and examples of those that are not. Such a scenario is common in organizations that
monitor large volumes of real-time electronic documents.

  To illustrate how we used the information need statements, consider the following
topic description:


  1. The algorithm actually minimizes with respect to both the cross-validation error and the tree
  complexity So that if two trees have statistically indistinguishable error rates, then the smaller of
  the two trees will be selected as optimal.


                              211

      <top>
      <head> Tipster Topic Description
      <num> Number: 001
      <dom> Domain: International Economics
      <title> Topic: Antitrust Cases Pending
      <desc> Description:
      Document discusses a pending antitrust case.
      <narr> Narrative:
      To be relevant, a document will discuss a pending antitrust case
      and will identify the alleged violation as well as the government
      entity investigating the case. Identification of the industry and
      the companies involved is optional. The antitrust investigation
      must be a result of a complaint, NOT as part of a routine review.
      <con> Concept(s):
      1.  antitrust suit, antitrust objections, antitrust investigation,
      antitrust dispute
      2.  monopoly, bid-rigging, illegal restraint of trade, insider
      trading, price-f ixing
      3.  acquisition, merger,  takeover, buyout
      4.  Federal Trade Commission (FTC), Interstate Commerce Commission
      (ICC), Justice Department, U.S. Securities and Exchange Commission
      (SEC), Japanese Fair Trade Commission
      5.  NOT antitrust immunity
      <fac> Factor(s)
      <def> Definition(s):
      Antitrust - Laws to protect trade and commerce from unlawful
      restraints and monopolies or unfair business practices.
      Acquisition - The taking over by one company of a co trolling
      interest in another, also called a takeover. The action may be
      friendly or unfriendly.
      Merger - The acquisition by one corporation of the stock of
      another. The acquiring corporation then retires the other1s stock
      and dissolves that corporation. Therefore, only one corporation
      retains its identity in a merger.
      <Itop>

   Since this is a very comprehensive description it contains many topic-specific words.
Recognizing this, our approach is to:
    *   extract the <desc>, <narr>, <COn>, and <def> fields from the topic
        specification,
    *   concatenate them, removing the SGML-style tags and the field labels (i.e.,
        the DesCription:,Narrative:,COnCept (S) :,and Definition(s):
        strings),
    *   map all the words into lowercase, remove duplicates and stop words2
        from the resulting description, and
        use the remaining list of words as the set of features.

   2. We used a stop word list published by Fox [4] that contains 421 words.


                                    212

Note that we do no further processing of the topic text. We do no phrasal analysis, no
stemming, no term expansion and, in particular, we do nothing with the NOT-clauses.

  Applying this procedure to the topic above results in the following list of feature
words:
     acquisition: antitrust: commerce: commission: department:
     exchange: ftc: fair: federal: icc: identification: interstate:
     japanese: justice: laws: merger: sec: securities: trade: acquir-
     ing: action: alleged: bid: business: buyout: called: companies:
     company: complaint: controlling: corporation: discuss: dispute:
     dissolves: document: entity: fixing: friendly: government: iden-
     tify: identity: illegal: immunity: industry: insider: investigat-
     ing: investigation: involved: merger: monopolies: monopoly:
     objections: optional: pending: practices: price: protect: rele-
     vant: restraint: restraints: result: retains: retires: review:
     rigging: routine: stock: suit: takeover: taking: trading: unfair:
     unfriendly: unlawful:

This feature specification is used as input to the Feature Extraction module (see upper
part of Figure 1) along with the original training data and the relevance judgements
associated with the topic.

  A key characteristic of the training data provided for TREC is that there were very
few relevance judgments for each topic. Thus we were only able to generate a limited
number of training vectors to be used by the CART algorithm. For each document in the
training set for which there is a relevance judgement, we determine the number of occur-
rences of each feature (using only the <text> fields). Thus for each topic we get a set of
training vectors that looks like those shown below (augmented here, for presentation
purposes, with the topic and document ids):
     001 QO WSJ870320-0188 (0 (0,1,0,5,0,0,3...6))
     001 QO W~J870320-0069 (0 (2,0,9,l,0,4,l,...,0))

     001 QO WSJ870424-0084 (1 (l,3,0,5,2,0,0,...,2))

Where in each vector the first element indicates whether the document is relevant to the
topic (0 mean no, 1 mean yes) and the remainder of the elements represent counts of the
occurrence of the features in the document. Thus document WSJ870320-0188 is not-rele-
vant to Topic 1 and contains zero occurrences of the first feature, one occurrence of the
second feature, etc.

  This training data is input to the Tree Growing module (see Figure 1) along with the
cost function and class priors3. CART then grows the largest tree, prunes this into a set of
nested sub-trees and uses cross validation to suggest the optimal tree. For the example

  3. In our "official" TREC experiments we set the priors to be P(rel)=O.O1 and P(non-rel)=O.99, and
  the ratio of the cost of misclassifying a relevant document to the cost of misclassifying a non-rele-
  vant document to be 100:1. That is, we assumed that there are relatively few relevant documents in
  the collection and that we want to emphasize recall significantly over precision.


                                  213

above, the optimal tree is:
          class 0 (0.000)
     co~pany<=0 .50
                                class 1 (0.981)
                            federal<=9 .00
                                class 0 (0.000)
                        takeover<=l .50
                            class 0 (0.000)
                    securities<=l .50
                        class 0 (0.000)
              company<=4 .50
                    class 0 (0.000)
          fair<=0 .50
              class 0  (0.000)

This is a the binary classification tree against which new documents can be tested. The
tree is read from left to right. Branches upwards correspond to "yes" answers to the test
at the node; branches downwards correspond to no answers. Class 0 terminals cor-
respond to the "not relevant" decision; Class 1 terminals correspond to the "relevant"
decision. The numbers in parentheses are estimates of the probability that the document
        4
is relevant

  Thus in this example if a document has fewer than 0.5 occurrences of the word corn-
pany (i.e., the word does not appear) then the document is not relevant. On the other
hand if the document has more than 0.5 occurrences (i.e., the word does appear) then
additional features are tested to lead to a classification decision. Note that a test often
requires multiple occurrences of the feature (e.g., for the word federal). Also note that
this tree has only one Class 1 terminal, and that the pattern of features that indicate a rel-
evant document can be written as a logical expression of feature counts. Thus we have:
     company>0 &  fair & cornpany<5
     & securities<2 & takeover<2 & federal<lO

as a description of a relevant document.
   Other overall system features of note are:
   *  The exact word is used as the basis of the feature, although our imple-
      mentation for TREC the actual test is case insensitive and is a substring
      match-so the feature fair would match with Fair or FAIR or fairs
      as well as with fairground. We do not, though, perform any stemming
      per Se.
   *  No additional data structures are needed by CART. That is, we did not
      use any external data (e.g., thesauri or knowledge-bases) to help with tree
      construction.
      We spent approximately one-person month of effort in developing the

  4. These are generated by CART from re-substitution error estimates and are, therefore, overly
  optirnistic.

                                 214

      system infrastructure needed to handle the TREC data and to produce the
      official results. We used an "off-the-shelf" version of CART (written in C).
     * The data were stored in compressed form and only uncompressed as
      needed. This was a satisfactory strategy for tree construction since the
      training sets involved relatively few documents. However, the time over-
      head was unacceptable when we came to do the actual test classifications.

4. Official Results and Performance Analysis

  ADS submitted two sets of results for the routing queries. In the first set (denoted
adsbal) we used the classification trees generated by exactly the information provided
by the training data. In the second set (denoted adsba2) we used trees generated using
an augmented training data. To generate this additional data we randomly selected an
additional block of 50 Wall Street Journal articles from the training corpus and then one
of us made the relevance judgements with respect to the 25 topics. This data was then
added to the original collection of relevance judgements to provide a larger training set
from which the second set of trees were grown. As noted above, we used a set of priors
that reflect the low density of relevant documents, together with a cost function that
encourages recall over precision. We also performed a number of auxiliary tests to help
with our interpretation of the official results. These are all described in the following sec-
tions.

4.1 The Baseline Experiment

  The baseline experiment (adsbal) was designed to explore how well our approach
could do with absolutely no manual intervention and with the minimum of training
data. So for this experiment we used just those documents in the training set for which
there were relevance judgments.

  Table 1 shows the performance on the baseline experiment together with the perfor-
mance from the other Category B systems. We have chosen to show only the number of
relevant-retrieved documents at the 200 document cut-off point since we believe that this
gives a more accurate picture of the ability of the system to perform document routing
than do the precision and recall numbers5.
                 Table 1: Performance on Baseline Experiment
                                       Rel-Ret @ 200
        Topic#    #Rel
                           adsbal    Max     Median    Mm
          1        131        2        67      32        2

          2        172       15        33      21        9

          3        304        3      130       48        3

          4         20        1        18       7        1


                               215

                     Table 1: Performance on Baseline Experiment
                                                 Rel-Ret @ 200
        To pic#       #Rel
                                  adsbal       Max        Median       Min

            5           55           4           45          29           4

            6           68           4           46          20           4

            7           92           1           46          30           1

            8           133          4           37          16           2

            9           157         13           41          29          13

           10           149         94          109          88          46

           11           61          15           55          26           9

           12           82          14           56          15           3

           13           93           7           93          26           7

           14           156         38           73          s2          23

           15           515         29           74          49          23

           16           58           2           44          17           2

           17           69          23           53          23           9

           18           95          38           49          38.         14

           19           664         74          147          99          56

           20           274         111         179         121          56

           21           16          12           16          14           0

           22           106         28           79          40           8

           23           30           2           27           7           2

           24           253         37           96          41          29

           25           13           1           12           9           1

  Overall performance of the baseline case is mixed and our analysis
any obvious correlation between performance and factors such as:

does not reveal

   *  the number of features extracted from the information need statements
      (the maximum number was 161, the minimum was 17, with the median
      being 39),
   *  the complexity of the topics (some are straightforward-such as Topic 13
      "Mitsubishi Heavy Industries Ltd.", whereas others involve complex con-
      difionals-such as Topic 1 ~`Anh trust Cases Pending"),

  5. We return to this point in the final section of the paper.


                              216

   * the number of training examples (Table 2 shows the size of the training
     sets-notice that for adsbal the set size was approximately 30 for each
     topic with approximately 10 relevant instances for each topic), and
   * the size of the optimal tree (Table 2 also shows the size of the optimal tree
     generated for each topic-this varied from a tree with only one terminal
     node to a tree with 10 terminal nodes, with the median being 2 terminal
     nodes).

  In fact, given the paucity of information actually used to generate the classification
trees, perhaps the most surprising aspect of these results is how well some of the trees
perform. There were many instances in which the adsbal trees generated the minimum
response. However, there were several results around the median score and even one
significantly above the median.

  As an example of a tree that performed moderately, consider the optimal tree for
Topic 22 "Counternarcotics." This had only one decision node-a split on the word
coca6. The actual tree is:
         class 0  (0.050)
     coca<=0 .50
         class 1  (0.862)

That is, the classification is based in the presence of absence of the word coca. If it is
present, then the document is marked as "relevant" and the estimate of the probability of
it being relevant is 0.862; if it is not present, then the document is marked as "non-rele-
vant" but in fact still has a small probability (0.050) of being relevant. As we see from the
table, this tree identifies 28 out of a possible 106 relevant documents, by the 200 docu-
ment cut-off point.

  A tree that performed poorly is the one for Topic 9 "Candidate Sightings." Here the
optimal tree had three decision nodes:
         class 0  (0.000)
     camp<=0 .50
                   class 0 (0.000)
             ran<=l .50
                   class 1 (0.948)
         sen<=9 .00
             class 0  (0.000)

Thus a relevant document is one which contains the word camp, the word sen (an
abbreviation for Senator) nine times or less, and has a least two occurrences of the word
ran. This tree only identified 13 of 157 relevance documents and was the worst perform-
ing of the Category B systems.

  On the other hand, the tree for Topic 10 "MDS Treatments" performed very well. It

  6. Recall that our feature extraction algorithm would match coca with any word with coca as a
  substring. So coca1 cocaine and Coca-Cola would all match.


                               217

also had only one decision node, but correctly identified 94 of 149 relevant documents:

          class 0  (0.126)
     azt<=0 .50
          class 1  (1.000)

Thus relevant documents are those that contain the word azt (the name of a drug for
treating AIDS patients).

4.2 The Effect of Additional Training Data

  Our second set of official results was designed to investigate the sensitivity of system
performance to the size of the training sets. To provide additional training samples we
randomly selected a block of 50 Wall Street Journal articles7 for which we generated rele-
vance judgements for the first 25 topics. In practice this gave us some additional relevant
articles, but mostly contributed to the non-relevant examples. Table 2 shows the effect of
adding the additional documents-notice that some articles in the new set were already
included in the original training data.

                  Table 2: Size of Training Sets and Optimal Trees

                   Size of Optimal Tree          Size of Training Set

      Topic #                                adsbal           absda2
                  adsbal    absba2
                                         Total    Rel    Total     Rel

          1           7        7          34       7      83        `7a
          2           8       14          31       7      81        8

          3           4        5          30       9      80        9
          4           3        5          29      12      79        12

          5           3        3          30      18      79       18a

          6           3        2          28      15      78        15
          7           2        3          28      10      78        10

          8           1       19          32       7      82        8

          9           4        4          23       4      73        4
        10            2        4          20      12      69       12b
        11            2        2          21      12      71        12
        12            8       11          35       8      85        8
        13            2        2          12       8      62        8

        14           10       13          30       5      80        6


  7. The articles were in the block starting with W5J870311-0102 and ending with W5J870324-O001.


                                      218

                   Table 2: Size of Training Sets and Optimal Trees

                   Size of Optimal Tree           Size of Training Set

      Topic #                                  adsbal           absda2
                   adsbal   absba2
                                           Total    Rel     Total    Rel
         15          3         13           27       7        77      10

         16          2         3            36       3        86       3
         17          2         2            30       12       80      12

         18          2         3            18       14       68      14

         19          2         3            30       12       80      15

         20          2         3            30       14       80      14

         21          2         2            33       12       82     12b

         22          2         2            28       10       78      10

         23          2         2            29       10       79      10

         24          2         2            48       10       98      10

         25          2         2            39       6        89       6

   a. Some relevant documents in the augmented training set already in the original training set.
   b. Some non-relevant documents in the augmented training set already in the original training
   set.

  The new training data did not have a significant impact on the size of optimal tree,
except in those cases where there were additional relevant documents-that is, for topics
6, 8, 14, 15 and 19. The changes here were quite dramatic. For example, in the case of
Topic 8 "Economic Projections" the addition of just one more relevant article changed the
optimal tree from one with only one   terminal node to one with 19! This suggests, of
course, that for this topic the training data do not provide a very representative sample
of texts.

  Of more interest, however, is whether these additional training data had any effect on
the overall performance of the system. Table 3 shows the official results for absda2. The
results for adsbal are retained for comparison, as are the results for the other Category B
systems.

                Table 3: Performance with Additional Training Data

                                               Rel-Ret @ 200
   Topic#       #Rel                           -_________ _________ _________
                         adsbal         absba2    Max       Median     Mm
      1         131         2             25        67        32         2

      2         172        15             9         33        21         9


                                       219

                 Table 3: Performance with Additional Training Data

                                               Rel-Ret @ 200
   Topic#       #Rel                                       _________ _________
                          adsbal       absba2      Max       Median       Mm
       3         304           3         15         130         48           3

       4         20            1         1          18           7           1

       5         55            4         29         45          29           4

       6         68            4         10         46          20           4

       7         92            1         22         46          30           1

       8         133           4         2          37          16           2

       9         157         13          25         41          29         13

      10         149         94          69         109         88         46

      11         61          15          9          55          26           9

      12         82          14          3          56          15           3

      13         93            7         25         93          26           7

      14         156         38          23         73          52         23

      15         515         29          49         74          49         23

      16         58            2         17         44          17           2

      17         69          23          53         53          23           9

      18         95          38          14         49          38         14

      19         664         74          56         147         99         56

      20         274        111          63         179        121         56

      21         16          12          0          16          14           0

      22         106         28          8          79          40           8

      23         30            2           2        27           7           2

      24         253         37          29         96          41         29

      25         13            1           1        12           9           1

   As for adsbal, the results are mixed. For 10 topics performance improved; for 13 top-
ics performance got worse; and for 2 topics performance was unchanged. We also do not
see any strong correlation between change in performance and change in the number of
positive instances in the training set, or change in the optimal tree size. There is some
indication that the while the extra positive instances tended to produce significant
changes in optimal tree size, these new trees also tend to have poorer performanc~sug-
gesting, as we should expect, that the tree construction process is highly sensitive to local
changes in these small training sets. Nevertheless there are some interesting results.

                                       220

  For example, the tree for Topic 22 ~`Counternarcotics" becomes:
         class 0 (0.000)
     drug<=0 .50
         class 1 (0.905)

Thus although the tree size is unchanged, the test is now on the word drug instead of
coca. As we might expect this turns out to be a much less generally useful test and the
tree only identifies 8 of the relevant articles-actually the minimum retrieved by a Cate-
gory B system.
  The tree for Topic 15 ~~CEO" is one which shows marked change from the adsbal ver-
sion, growing from two decision nodes to twelve, and retrieving 49 relevant documents
instead of 29. The tree is:

     class 0
chiet<=0 .50

(0.000)

       class 0  (0.000)
 executive<=0 .50
                            class 1  (1.000)
                       company<=0 .50
                            class 0  (0.000)
                  executive<=l .50
                                  class 0  (0.000)
                            name<=0 .50
                                  class 1  (1.000)
                       coinpany<=l .50
                                  class 0  (0.000)
                            resign<=0 .50
                                       class 0  (0.000)
                                  chief<=l .50
                                       class 1  (1.000)

                                 name<=3 .50
                                       class 0
                            executive<=9 .00
                                 class 0  (0.000)
                 appoint<=0 .50
                      class 0  (0.000)
            ceo<=0 .50
                 class 0  (0.000)

                                                (0.000)

This is a much more complex structure than the other trees illustrated so far. Note that
there are three terminal nodes that lead to a document being classified as relevant.
Although they are in the same sub-tree defined by the expression:
     chiet>0 &  ceo & `appoint & executive>0 & executive<l0 & name<4

they make minor distinctions based on the words company, name, resign and chief.
Thus we have three further tests:
     executive<2 &  company
     executive>l & company>l & resign>0 & chiet>l
     executive>l & coinpany<2 & name>0


                                 221

to separate relevant from non-relevant documents.

  Finally, the tree for Topic 17 "Measures to Control Agrochemicals" does the best of all
the Category B systems detecting 53 of the 69 relevant documents. The tree has one deci-
sion node:
         class 0  (0.042)
     pes~icide<=0 .50
         class 1  (0.920)

That is, a test for the presence of the word pe~ticide-a surprisingly simple structure
given it performance.

4.3 Sensitivity to the Choice of Optimal Tree

  Since the choice of optimal tree from the sequence of nested sub-trees is dependent
on the cross-validation error estimates, and since the number of training samples is
rather small, we might expect that there is a possibility that the tree selection process is
in fact in error. To explore the sensitivity of the system's performance to errors in select-
ing the optimal tree we ran an auxiliary experiment for Topic 22 "Counternarcotics".

  As in the official experiments, we used the adsbal dataset to generate a sequence of
sub-trees, but instead of selecting just one (i.e., T*), we saved all the trees. We then used
each tree to classify the test data and tabulated the results. These are show in Table 4. The

                  Table 4: Performance as a Function of Tree Size

                                        Rel-Ret    Recall   Precision
       Tree No.   Tree Size     ~        @ 200     @ 200     @ 200

           1          6         0.3932     12      0.1132    0.0600

           2          4         0.5004     1       0.0094    0.0050

           3          3         0.3931     24      0.2264    0.1200

        4 (T*)        2         0.4645     28      0.2642    0.1400

           5          1         0.7866               -         -


columns of the table record the tree identifier (Tree No.), the size of the tree in terms of
the number of terminal nodes in the tree (Tree Size), the cross-validation error estimate
~ the number of relevant documents retrieved (Rel-Ret @ 200), and the recall and
precision at the cut-off (Recall Oa 200 and Precision @ 200).

  For this topic, the maximum tree (T1) has six terminal nodes and an estimated error
rate of 39%. The minimum tree (T5) has oniy one terminal node-which in this case clas-
sifies all document as non-relevant-and an estimated error rate of 79%. The optimal tree
(T4) has two terminal nodes but an estimated error rate of 46%.

  We see that the optimal tree did in fact generate the best result, increasing our confi-


                                    222

dence that the tree selection process is working as intended despite the small sample
sizes.

4.4 The Use of Surrogate Split Information

   A basic problem with the use of classification trees in the TREC experimental envi-
ronment is that although we use the probability of being relevant as a mechanism for
ordering the system's output, this is a very weak method for generating a ranking. In
fact, since we typically have very few "output bins" for any given tree, for most topics
we effectively generated no ordering that would discriminate over the top 200 docu-
ments. In this situation we need to look for a mechanism for generating a "finer-grained"
ranking.

   Our second auxiliary experiment was designed to explore the use of the surrogate
split information generated by the CART experiment as a way of ordering the output.
Surrogate splits are a feature of CART used too deal with the problem of missing data.
They define alternative splitting criteria in the case that the primary feature measure-
ment is not available8. In TREC trees these appear as additional tests on the frequency of
occurrence of words. So, for example, in adsbal the optimal tree for Topic 22 "Countern-
arcotics" is:
          class 0   (0.050)
      coca<=0 .50
          class 1   (0.862)

and the alternatives to the split on the word c~ca are:
      cocaine<=0 .50  [1.00]
      colombia<=0 .50 (0.84)
      drug<=0 .5      [0.78]

where the number in brackets indicate how correlated the surrogate split is with the
optimal split. Thus in this example a split on coca and cocaine are identical in term of
their ability to classify the documents. We also show the next two highly correlated
splits.

   To use this information for output ranking we took the output from the adsbal tree
for Topic 22 (i.e., the one shown above) and then for each document classified as relevant
by this tree we gave it a weighted score based on the number of surrogate split tests it
also satisfied. Thus if a document contained only the word coca it received a score of
1.00. If a document contained coca and colombia then it received a score of 1.84. In gen-
eral a document received a score that was the sum of the correlation coefficients for those
tests that were satisfied9.


   8. Of course in the document routing problem addressed by TREC we do not have the notion of a
   missing measurement-a word is either present or not. However, it is easy to imagine document
   routing scenarios in which this does apply-for example, when working with noisy transmis-
   sions-so that the surrogate split information would be extremely useful.


                                  223

  The effect of using this information is shown in Table 5. The table demonstrates the
impact on system performance for Topic 22 of adsbal. Without any ordering we detected
          Table 5: Performance of T* as a Function of Output Ordering

                                                 Rel-Ret   Recall  Precision
                Tree/Output Ordering              @ 200    @ 200    @ 200

         T*: no additional ordering                28      0.2642   0.1400
         T*: ordering based on surrogate splits    43      0.4057   0.2150

28 relevant documents in the first 200, but with the ordering scheme just described we
were able to improve this to 43 in the first 200. This in turn translated into an increase of
14 points of recall and 7 points of precision at the 200 document point.

  The surrogate splits also give us some insights into the overall behavior of the tree
selection process as the number of training samples change. Thus although the optimal
tree for Topic 22 in adsba2 was:
         class 0    (0.050)
     drug<=0 .50
         class 1    (0.862)

with worse performance than the optimal tree for adsbal, when we look at the top three
surrogate splits we see that they are:
     cocaine<=0 .50     [0.89)
     coca<=0 .50        [0.88]
     government<=0 .5   [0.86)

That is although the optimal split for the augmented training set changed, we still see the
importance of the same set of word features, which in turn indicates are certain stability
in the underlying feature space. We might predict that as the training set increases in size
that we would see the splits also becoming more stable.

4.5. Commentary
  The TREC corpus represents a significant challenge for our system. Our previous
results with a small corpus, while encouraging, did not allow us to evaluate how well
the technique might do with realistically sized document collections. Our conclusion
based on the results we have from TREC is that CART does exhibit some interesting
behaviors on a realistic corpus, and that, despite the small size of the training sets and
the restricted choice of features, for some topics it produces competitive results. So
although the overall performance is moderate (relative to the better performing systems
at TREC), we believe that the absolute performance (given that the system is totally auto-

  9. As yet, there is no theoretical justificafion for this algorithm. It does however have the intuitive
  property that documents that satisfy additional splits get a higher score in proportion to the
  11power" of those splits.


                                         224

matic) is at least encouraging and definitely acceptable in several instances.

  Some specific observations on the performance of the current implementation of the
CART algorithm are:
   *  Relying on the re-substitution estimates for the terminal nodes is a very
      weak method for producing an output ranking. The estimates themselves
      are not very good and when combined with optimal trees that emphasize
      recall over precision give a largely undifferentiated output. As we noted
      above, a scheme that makes use of surrogate split information to generate
      a post hoc ranking shows much promise as a technique for improving our
      scores in the TREC context.
   *  While our approach is totally automatic, it is restricted to using as fea-
      tures only those words that appear in the information need statement.
      This is obviously a limitation since the use of even simple query expan-
      sion techniques (e.g., stemming and/or a synonym dictionary) is likely to
      provide a richer and more effective set of initial features.
   *  Using words as features is possibly too "low-level" to ever allow stable,
      robust classification trees to be produced. At a minimum, we probably
      need to consider working with concepts rather than individual words.
      Not only would this reduce the size of the feature space but would proba-
      bly result in more intuitive trees. The disadvantage of this that it is not
      clear where the concepts would come from, other than from a manually
      constructed knowledge-base of some sort.
   *  We need to work with much bigger and more representative training sets.
      Our preliminary experiment in this area shows, not surprisingly, that
      adding more training examples can lead to dramatic changes in the classi-
      fication trees.

  As a final comment, we would like to suggest that the overall evaluation paradigm
used in TREC does not properly assess the performance of systems on the routing task.
Although ad hoc retrieval and routing are similar when viewed in terms of the basic tech-
nology, systems designed and built to support these two applications have significantly
different requirements. In particular, operational routing systems do not usually empha-
size output ordering but instead focus on optimizing the trade-off between detection and
false alarm rate. In this respect, at least, we believe that recall and fallout are better indi-
cators of routing performance than recall and precision. Furthermore, artificially limiting
reported output to the first 200 documents automatically discriminates against those
routing systems that actually do attempt to perform the recall/fallout trade-off. A fairer
set-up for the routing component of TREC would be to allow systems to report exactly
those documents marked as relevant. Comparison of systems would be more complex
since different systems will produce different numbers of documents, but individual
scores would give a better picture of routing performance.


                              225

5. CART as the Kernel of a Document Routing System

  Users today are faced with ever increasing amounts on real-time text that must be
sifted for relevant information. This information space is: massive-both in terms of the
number of sources and the volume of material available within each source; dynamic-
sources and their contents are changing constantly and especially within the time hori-
zon of any specific analysis problem; and heterogeneous~ach source represents infor-
mation in different ways and independently of the others.

  In this environment, users require tools that can perform document filtering and
selection that are: easy to learn and use, easily adaptable to changing and ill-defined
information needs, portable across sources and analysis domains, and give good preci-
sion and recall. With our TREC results in hand, we are in a position to briefly consider
the potential role of CART as a central component in such an operational document rout-
ing system.

  Figure 2 illustrates how a such a system might be organized. In our scenario, we

docum~~preprocessor~

I

   Detector


~1~
  Compiled
   Profiles

relevant
documents~              users
             Router


   I non-re~~nt documents

   U

A

f

I

Learning            stored
Algorithms  K documents


   I
  user

  U-
real-time processing patn

              Figure 2: A Document Routing System Concept

imagine that documents enter the system in real-time and after preprocessing to extract
features are passed to the detection module where they are either rejected as being non-
relevant to any of the user profiles stored in the system, or are marked as relevant and
passed to the router for transmission to users. Note that in our proposed architecture,
learning takes place in parallel to actual operation of the real-time routing system so that
the classification trees can be updated as necessary when new training instances become


                              226

available or when users interests change.

  While we would make no claim that CART alone is sufficient to guarantee high-per-
formance detection and routing, we do believe that its ability to work automatically with
any size data set and with any set of specified features means that it can be a very cost
effective component of such a system. Indeed we believe that it is probably best used as
an initial filter to screen out non-relevant documents and that CART's output might then
be fed to a more language-oriented algorithm to decrease the false alarm rate.

  To summarize, we consider CART to be an important tool in our arsenal of effective
and efficient document detection and routing technologies. While the results from TREC
are preliminary, we believe that they do demonstrate that CART has a number of advan-
tages over other approaches, namely:
   *  classification trees are constructed automatically from specifications of
      features and a set of examples,
   *  the learning algorithm generates an "optimal" classifier together with
      useful auxiliary data and statistics such as the misclassification probabil-
      ity,
   *  prior class probabilities can be used if known,
   *  specification of the misclassification cost function provides for direct con-
      trol of the fallout and recall of the classifier, and
      the classification trees are easily understood and interpreted by end users.

In addition, the CART algorithm is completely language independent, in the sense that it
make no assumptions about the inherent features of the source language of the docu-
ment-all that it requires are "features" and training examples. Further, the features
themselves can be extracted from document externals as well as document internals.

  In TREC-2 we will explore some of the extensions discussed in Section 4 to show how
CART can indeed be integrated into a system to help users who are faced with the task of
searching through the "information wilderness."

References

[1] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression
Trees. Wadsworth & Brooks, Pacific Grove, CA. 1984.

[2] S. L. Crawford. Extensions to the CART Algorithm. International lournal of Man-
Machine Studies, 31:197-217, 1989.

[3] 5. L. Crawford, R. M. Fung, L. A. Appelbaum, and R. M. Tong. Classification Trees for
Information Retrieval. Proceedings of the Eighth International Workshop on Machine Learning
(ML91). Morgan Kaufmann, San Mateo, CA. 1991.

(4] C. Fox. A Stop List for General Text. SIGIR Forum, 41:19-35, Fall/Winter 1989/90.


                             227