Machine Learning for Knowledge-Based Document Routing
                      (A Report on the TREC-2 Experiment)

                           Richard M. Tong, Lee A. Appelbaum

                               Advanced Decision Systems
                        (a division of Booz~Al1en & Hamilton, Inc.)
                      1500 Plymouth Street, Mountain ~w, CA 94043

1 Introduction
       This paper contains a description of the experi-
ments performed by Advanced Decision Systems as part of
the Second Text Retrieval Conference (TREC-2).1

       The overall system we have developed for TREC-2
demonstrates how we can combine statistically~riented
machine learning techniques with a commercially available
knowledge-based information retrieval system. As in TREC-
1, the tool we using for the fully automatic construction of
routing queries is based on the Classification and Regression
Trees (CART) algorithm.2 However, in a departure from
TREC-1, we have expanded our definition of what consti-
tutes a document "feature" within the CART algorithm, and
also explored how the CART output can be used as the basis
of topic definitions that can be interpreted by the TOPICŽ
retrieval system developed by Verity, Inc. of Mountain View,
CA.

       Section 2 of the paper contains a review of the
CART algorithm itself and the data structures it produces.
Section 3 of the paper describes the two basic algorithms we
have.devised for converting CART output into TOPIC read-
able ifies. Section 4 of the paper contains a description of our
experimental procedure and an analysis of the official
results, as well as data from a series of auxiliary experi-
ments. Section 5 contains some general comments on overall
performance. We conclude in Section 6 with a brief discus-
sion of possible future research directions.


1. Requests for flirther information about the IREC-2 experiments
should be directed to the authors at the address above, or electroni-
cally to either rtong~ads . corn or lee@ads . corn.
2. A comprehensive discussion of the CART algorithm can be
found in [1]. Details of previous work on the use of CART for
information retrieval are presented in [2] and in [3).


                                          253

2 The CART Algorithm
      CART has been shown to be useful when one has
access to datasets describing known classes of observations,
and wishes to obtain rules for classifying future observations
of unknown class~xactly as in the document routing prob-
lem. CART is particularly attractive when the dataset is
"messy" (i.e., is noisy and has many missing values) and
thus unsuitable for use with more traditional classification
techniques. In addition, and particularly important for the
document routing application, if it is important to be able to
specify both the misclassification costs and the prior proba-
bilities of class membership then CART has a direct way of
incorporating such information into the tree building pro-
cess. Finally, CART can generate auxiliary information, such
as the expected misclassification rate for the classifier as a
whole and for each terminal node in the tree, that is useful
for the document routing problem.

2.1 CART Processing Flow
      Figure 1 shows how the CART algorithm is used to
construct the optimal classification tree based on the training
data provided to iLThe diagram shows the four sub-pro-
cesses used to generate the optimal tree, T*. The "raw" train-
ing data (i.e., the original texts of the articles), together with
the class specifications (i.e., the training data relevance judg-
ments) and the feature specifications (i.e., the words defined
to be features), are input to the Feature Extraction Module.
The output is a set of vectors that record the class member-
ship and the features contained in the training data. These
vectors, together with class priors and the cost function
(these are optional), are input to the Tree Growing Module
which then constructs the maximal tree ~ that charac-
terizes the training data. Since this tree overfits the data, the
next step is to construct a series of nested sub-trees by prun-
ing Tmax to the root. The sequence of sub-trees (T~~~> T1>
  >Tn) are input to the Tree Selection Module which per-

                                              Largest
                                            Tree (Tmax)
                     mum:'mm
       Raw             Feature          Tree                         Tree          Optimal
      Training                        Growing                                      Tree (T*)
       Data


                                                      (T
                     Class Specs.    Class Priors
                     Feature Specs. Cost Function

                                                     Nested Sub-Trees
                                                    max>Ti >T2 ...>Tn)

                            CART
                        Training Vectors

                                Figure 1: CART Processing How

forms a cross-validation analysis on the sequence and
selects that tree with the lowest cross-validation error.3 This
is T*.

      In our TREC-2 system, we take T* and convert it
into a TOPIC outline file as descrileed in Section 3. That is,
rather than use CART itself to perform the routing on the
unseen documents (as we did in ThEC-l), we use the
CART trees as skeletons for TOPIC concepts, and then have
TOPIC do the routing. An advantage of this is that we can
make use of TOPIC's extensively optimized text database
capabilities, thus allowing us to easily generate the output
files needed for the official scoring program.

2.2 Data Structures Generated by CART

      To illustrate the processing that CART performs,
we will use an example taken from the TREC-2 corpus. A
example query is as follows:

  <top>
  <head> Tipster Topic Description
  <num> Number: 097
  <dom> Domain: Science and Technology
  <title> Topic: Fiber Optics Applications

  <desc> Description:
  Document must identify instances of fiber
  optics technology actually in use.

  <narr> Narrative:

  To be relevant, a document must describe
  actual operational situations in which
  fiber optics are being employed, or will
  be employed. A document describing future
  fiber optics use will be relevant only if
  contracts have been signed concerning the
  future application.

  <con> Concept(s):
  1. fiber optic,  light
  2. telephone,  LAN, television

  <fac> Factor(s)

  <def> Definition(s)
  1. Fiber optics refers to technology in
  which information is passed via laser
  light transmitted through glass or plastic
  fibers.
  <\top>


      This is a very comprehensive statement of infor-
mation need and provides a rich set of features that we can
use for CART.4 Our basic procedure is to extract from the
information need statement all the unique content words
and then stem them, which gives the following list:

  ACT APPL CONCERN CONTRACT DESCRIB DOCU
  EMPLOY FIB FUT GLASS INFORM LAN LAS LIGHT
  OPER OPT PAS PLAST REF RELEV SIGN SITU
  TECHNOLOG TELEPHON TELEV TRANSMIT VIA5

3. The algorithm actually minimizes with respect to both the
cross-validation error and the tree complexity. So that if two trees
have statistically indistinguishable error rates, then the smaller of
the two trees will be selected as optimal.


                                                254

4. In general, determining what set of features to use for CART is
a matter of experience and judgement. For the ~IREC-2 corpus we
have made use of the information need statements, but other
approaches, such as using all the unique words in the training set,
are equally valid. In fact, it is this freedom of choice of features
that gives CART a great deal of its flexibility.

      Given this list of features we can generate the
CART training vectors by testing all the documents in the
training set for the presence of absence of the feature.6 The
resulting vectors, together with the ground truth classifica-
tion, are the input to CART. The output from CART con-
tains information about the sequence of nested decision
trees and a specification of the optiaal tree.

      For our example, the maximal tree is shown in
Figure 2. This tree is to be read from left to right. Thus the

      Associated with each decision node is information
that tells:
 *  what level in the tree the test occurs; the k value -
    in all cases k ranges from k=1, which corresponds to
    the maximal tree, to k~kmax, which corresponds to
    the null tree and where kmax is dependent on the
    actual tree grown by CART;

                                                 class  0 (k=2,Rt=0.230318)
                                             CONCERN<=0.50   (k--2,ac=0,R~=0.25l256)
                                                        class 1  (k=2,R~=0.0l9586)
                                                  TRANSMIT<=0.50   (k=2,ac=O1Rt=0.020938)
                                                        class 0  (k=2,Rt=0.000000)
                                       GLASS<=0.50   (k=5,ac=0,Rt=0.261725)
                                                        class 1  (k=2,Rt=0.007834)
                                                  TRANSMIT<=0.50   (k=2,ac=0,R~=0.0l0469)
                                                        class 0  (k=2,Rt=0.000000)
                                             CONCERN<=0.50   (k--2,ac=0,Rt=0.010469)
                                                  class 0   (k=2,Rt=0.000000)
                                 VIA<=0.50   (k=5,ac=0,R~=0.30360l)
                                             class 1  (k=41Rt=0.003917)
                                       TRANSMIT<~0.50   (k=5,ac=l,Rt=0.007834)
                                             class 0  (k=4,R~=0.000000)
                            SIGN<--0.50 (k=5,ac=0,Rt=0.345477)
                                 class  1  (k--5,Rt=0.003917)
                      LIGHT<=0.50   (k=6,ac=0,Rt=0.376884)
                                 class 0   (k=3,Rt=0.000000)
                            GLASS<=0.50   (k=5,ac=0,R~=0.03l407)
                                       class 0   (k=3,R~=0.0l0469)
                                 TRANSMIT<=0.50    (k=3,ac=0,R~-0.03l407)
                                             class 0  (k=3,Rt=0.010469)
                                       VIA<=0.50   (k=3,ac=l,Rt=0.0l95~6)
                                             class 1  (k--3,Rt=0.003917)
                CONTRACT<=0.50   (k=7,ac=l,Rt=0.497487)
                                 class  1  (k=4,Rt=0.019586)
                            SIGN<=0.50  (k=4,ac=l,Rt=0.023503)
                                 class  0  (k=4,R~=0.000000)
                      LIGHT<=0.50   (k=6,ac=l,R~=0.02742l)
                            class 0  (k=4,Rt=0.000000)


                                             Figure 2: Maximal Tree

first test is on the presence of the stem CONTRACT. If the
stem is present (i.e., if the test CONTRACT<=O .5 fails)
then the tree branches downwards (this is leftward in CART
jargon); if the stem is not present (i.e., the test succeeds)
then the tree branches upwards (rightward).


5. We only use the <nan>, <con> and <del> fields, and the stop
word list is taken from (4]. The stemming algorithm is taken from
[5).
6. In our TREC-1 experiments we also made use of the frequency
of occurrence of the features in the documents. Since our ultimate
goal here is to generate `IDPIC trees we restrict ourselves to just
testing for the presence or absence of the feature - this allows us
to perform a straightforward conversion between CART and
F)PIC.


                                         255

    the class to be assigned to this node if the tree were
    pruned here; the ac value - in the present case
    ac= 1 for the class that corresponds to a relevant doc-
    ument and ac 0 for the class that corresponds to a
    non-relevant document; and
 *  the error rate of the node; the Rt value - this indi-
    cates the resubstitution error rate fbr the specific
    node.

      The terminal nodes in the tree have similar infor-
mation associated with them. Notice though that terminal
nodes, by definition, specify to which class the node corre-
sponds.

      CART also generates the following table of error
estimates that are used in selecting the optimal tree. Here k
is the k-value, I T I is the size of the tree (measured in terms
of the number of terminals), R (T) is the resubstitution rate
for the overall tree, and Rcv (T) is the cross-validation rate
for the overall tree. Recall that CART minimizes with
respect to both size and cross-validation rate, so that for our
example, the optimal tree is when k=4.
   k   IT     R(T)     Rcv(T)

   1   16     0.3100   0.3037
   2   11     0.3140   0.2866
   3     8    0.3206   0.2353
  *4     5    0.3323   0.2296
   5     2    0.4043   0.3891
   6     1    0.4975   0.6285


So, pruning the maximal tree so that all nodes have k>4
gives us the following optimal tree shown in Figure 3.

    3     20     0.1313    0.2935
    4     10     0.1761    0.2758
    5      7     0.1971    0.2815
   *6      6     0.2050    0.2929
    7      5     0.2154    0.3101
    8      4     0.2273    0.3101
    9      2     0.2681    0.3328
    10     1     0.4975    0.6264


so that the optimal tree is as shown in Figure 4.


3 Transforming CART Trees into TOPIC
Outline Specifications

        The trees produced by CART are not directly
usable by TOPIC because they represent the information we
need (i.e., the decision function and the associated decision
variables) in a form that is incompatible with the TOPIC

                                                  class  0  (k=5,RL=0.261725)
                                             VIA<=0.50  (k--5,ac--0,Rt-0.303601)
                                                  class  1  (k=5,R~=0.007834)
                                      SIGN<=0.50    (k--5,ac=0,R~=0.345477)
                                             class  1  (~=5,Rt=0.003917)
                                LIGHT<=0.50    (k=6,ac=0,Rt=0.376884)
                                      class   0 (k=5,Rt=0.031407)
                          CONTRACT<=0.50     (k=7,ac=1~R~=0.497487)
                                class  1   (k=6,Rt=0.027421)


                                    Figure 3: Optimal Tree Using Stemmed Features

Notice that by pruning away the deeper nodes in the tree we
are left with a tree that tests on just five of the original 26
stems, and has three paths that lead to class - 1 nodes (i.e.,
a decision that the document is relevant).

        The unstemmed version of our procedure is identi-
cal except that we do no stemming of the unique words
extracted from the information need statement. For the
example, this produces the following list of features:
   ACTUAL APPLICATION CONCERNING CONTRACTS
   DESCRIBE DESCRIBING DOCUMENT EMPLOYED
   FIBER FIBERS FUTURE GLASS      INFORMATION LAN
   LASER LIGHT OPERATIONAL OPTIC OPTICS
   PASSED PLASTIC REFERS RELEVANT SIGNED
   SITUATIONS TECHNOLOGY TELEPHONE TELEVISION
   TRANSMITTED VIA


and the following error table:

    k     TI      R(T)     Rcv(T)

    1     36     0.0932    0.3168
    2     33     0.0972    0.2763


                                                        256

knowledge-representation. The main problem then is to
define a transformation from the CART representation to the
TOPIC representation that at least preserves the decision
information and perhaps augments it so that we get
improved routing performance.

In this section we explore two possible strategies for con-
structing these transformations. The first is a strict re-coding
of the information in the CART tree. The second generalizes
the intent of the CART tree but adds no new information.
Both of these techniques are "automatic," in the sense that,
once various parameters have been chosen, the algorithms
work without human intervention.

3.1 First Canonical Form

      The first canonical from completely preserves the
decision function generated by CART and involves a simple
mapping of the CART tree into TOPIC outline file format.7
To do this, we begin by observing that each path in the
CART tree from the root to a class-i leaf node consti-
tutes a conjunction of tests of decision variables. Since we

                        class 0  (k--lO,Rt-0.0628l4)
                   TELEPHONE<=0.50   (k=ll,ac=l,R~=0.497487)
                                   class 1  (k=8,R~-0.l42l39)
                              TRANSMITTED<=0.50  (k=9,ac=l,Rt=0.153984)
                                   class 0  (k=8,R~=0.000000)
                        LIGHT<=0.50   (k=l0,ac=l,R~=0.2053l2)
                                         class 0 (k=7,Rt=0.000000)
                                   TRANSMITTEL<=0.50   (k=9,ac=0,R~=0.0l0469)
                                         class 1 (k=7,R~=0.000000)
                              ACTUAL<=0.50  (k=9,ac=0,R~=0.03l407)
                                   class 1  (k=9,R~=0.000000)


                                Figure 4: Optimal Tree Using Unstemmed Features

have constrained the tests to be of the form "Is word X
present or not?" we can easily model this conjunction in
TOPIC using AND and NOT operators. Since there will gen-
erally be multiple paths to class - 1 leaf nodes, the algo-
rithm combines the separate paths in the TOPIC definition
using the OR operator.

        To illustrate, the optimal tree for our example con-
tains the following conjunctive descriptions of class-i leaf
nodes:
   1.  CONTRACT
   2.  SIGN and not LIGHT and not CONTRACT
   3. VIA and not SIGN and not LIGHT and not
   CONTRACT


which leads directly to a TOPIC outline of the form:
   Topic~097 <Or>

   *   0.77 TopicStyle~097~l <Or>

   **   0.99 TopicPath~097_1-1 <And>
            `CONTRACT'
            `LIGHT'
            `SIGN'
         `VIA'
   **   1.00 TopicPath~097_1-2 <And>
            `CONTRACT'
            `LIGHT'
         `SIGN'
   **   0.97 TopicPath~097_1-3 <And>
         `CONTRACT'


        Here we use a notation for topic names that ensures
uniqueness. Notice that in this model we use the resubstim-
tion rates (actually 1-Rt) to define a weight for the individ-
ual conjuncts and the overall cross-validation rate (actually
1-~~) as the weight for the disjuncts. Note also that since
TOPIC only uses weights defined to two decimal places we
have rounded the weights derived from the CART tree.


7. TOPIC uses a representation for ooncepts that can be recorded
in so~alled outline format. A collection of such specifications is
used by TOPIC to built the concept trees used for retrieval.


                                                    257

3.2 Second Canonical Form

       The second canonical form makes use of the deci-
sion variables chosen by CART but not the actual decision
function. This model is based on two observations:
 *   first, that the set of variables used by CART are,
     when taken as a whole, indicative of the general topic
     of the information need statement, and
 *   second, that every variable used in the tree is on the
     path to at least one class-O node and at least one
     class-i node.

       Thus, from an information retrieval perspective, all
the decision variables have some contribution to make to an
assessment of the relevance of a document. So rather than
use the specific decision function constructed by CART we
can replace this with one that can be thought of as "The
more features present in a document the better." This is
modelled straightforwardly in TOPIC using the ACCRUE
operator.

       To illustrate, the maximal tree for our example
contains the following set of decision variables:
   CONCERN CONTRACT GLASS LIGHT SIGN TRANSMIT
   VIA


which leads directly to a TOPIC outline of the form:
   Topic~097 <Or>

   * 0.70 TopicSty1e~097~2 <Accrue>

   **  0.25 `CONCERN'
   **  0.75 `CONTRACT'
   **  0.75 `GLASS'
   **  0.75 `LIGHT'
   **  0.75 `SIGN'
   **  0.75 `TRANSMIT'
   **  0.75 `VIA


The weighting scheme we have used gives higher values to
variables (features) that are in the optimal tree, intermediate
values to variables on the fringe of the optimal tree (there

are none of these in the current example), and lower values
to features outside the optimal tree.8 At this point the spe-
cific values chosen represent our "best guess" at a weighting
scheme, further experimentation Will undoubtedly reveal a
better strategy. As in the first canonical form, the overall
weight for the TOPIC tree is based on the cross-validation
rate for the maximal tree.


4 The TREC-2 Experiments

      For TREC-2 we again focused only on the docu-
ment routing problem. Since our technique requires training
data it does not easily lend itseff to the ad hoc retrieval
problem and so rather than "force-fit" it we chose to gener-
ate four sets of results for the routing queries (topics 51-
100). Fach set of results was generated totally automati-
cally. The results sets are labell~ adsl, ads2, ads3, and
ads4, and the table below shows to which combinations of
features and TOPIC models they correspond.

             Table 1: Results Identification

   Result Set     Word Features    TOPIC Model

     adsl           stemmed          model-i

     ads2           unstemmed        model-i

     ads3           stemmed          model-2

     ad~            unstemmed        model-2

     and two sets of training vectors labelled with the
     ground truth information.9 Since CART is a statisti-
     cally-oriented classifier, we decided to minimize the
     "noise" in the training sets by using only the Wall
     Street Journal articles identified in the qrel files. Fur-
     ther, for all but topics 80 and 81, we used just the
     Wall Street Journal articles on Disk 2.
  *  Second, we grew the CART trees from this training
     data. Since we had two sets of training data for each
     topic, we grew two trees for each topic.
  *  Third, we used the algorithms described in Section 3
     to convert the CART trees into a TOPIC readable
     form. This produced four TOPIC definitions for each
     of the information need statements. Table 1 above
     shows the various combinations.
  *  Fourth, we ran the TOPIC definitions against the
     indexed unseen data. 10 Again, to minimize noise
     effects, we used only the Associated Press articles on
     Disk 3 to generate our official results.
  *  Fifth, we sorted and merged the results generated by
     TOPIC and converted them into the TREC format for
     scoring by NIST.

4.2 Discussion of Official Results

       The official results for adsl and ads2, together with
the unofficial results for ads3 and adA, are shown in
Table 2.

      Although we generated four sets of results, the
resource constraints at MST resulted in only adsl and ads2
being officially scored. Reference in the remainder of the
paper to scores associated with ads3 and adA are to the
unolficial score generated by us using the TREC-2 scoring
program and the published qrels for the routing topics.

4.1 The Experimental Procedure

      The experimental procedure for TRBC-2 consists
of five basic steps. We briefly describe each of these:
 *  First, we generated the CART training data from the
    information need statements and the ground truth
    files (i.e., the qrels) provided by NIST. This produced
    two feature sets for each topic (corresponding to the
    stemmed and unstemmed versions of the features),

8. A variable is in the optimal tree if its k-value is greater than k*;
is on the fringe if k=k*; and outside the optimal tree if k~*. Note
that in general the individual features appear at multiple locations
in the tree. Our strategy is to remove duplicates by retaining the
instance with the highest k-value.


                                        258

        Table 2: TREC-2 Results (AP Only)
 Run     No.     No.     Rel.    A~      Exact

  ID     Retr.   Rel.    Ret.   Prec.    Prec.

 adsi   40,423   5,677    822   0.0195   0.0390

 ads2   33,034   5,677   1,468  0.0821   0.1092

 ads3   49,006   5,677   1,182  0.0168   0.0374

 adA    50,000   5,677   1,847  0.0630   0.0868

       The first observation is that the trees built using
exact words as features (i.e., results ads2 and adA) had
higher precision than those built using word stems. We


9. The feature specification and extraction procedure we used is
identical to that used in ThEC-1 and is described in detail in the
ll~EC-l proceedings. The only differences are the addition of a
stemmed version of the features and the fact that we do not make
use of the feature count information.
10. We are grateful to Verity Inc. for allowing us to have access to
their computer systems and databases.

believe that the explanation for this is that by using
stemmed versions of the features we added a significant
amount of "noise" to the sample space. That is, given the
relatively small size of the training sets, using stems tends to
reduce the discriminating power of any given feature with
respect to the training sets. This manifests itself indireefly in
two ways. First, the optimal trees built with stems are gener-
ally smaller than those built using exact words; and, second,
optimal trees built with stems have higher cross-validation
error rates than those built using exact words.

      The second observation is that the ~R)PIC trees
built using model-2 (i.e., results ads3 and ads4) had better
recall than those built using model-i. The explanation for
this is straightforward. Since the model-2 trees essentially
use all the features extracted from the information need
statement in a generalized disjunction, they provide much
broader coverage than the model-i trees which often use
just one or two of the features.

Queryid  (Num):        52
Total number of documents over all queries
    Retrieved:       1000
    Relevant:         345
    Rel_ret:          328
Interpolated
    at  0.00
    at  0.10
    at  0.20
    at  0.30
    at  0.40
    at  0.50
    at  0.60
    at  0.70
    at  0.80
    at  0.90
    at  1.00
Average precision
all rel docs:

Recall   - Precision Averages:
      0.6667
      0.4684
      0.4032
      0  4032
      0.4032
      0.4032
      0.4013
      0.4013
      0.4013
      0.4003
      0.0000
      (non-interpolated)   over

                    0.3828
Precision:

      The third observation is, of course, that these are
not strong results since on average all the models performed
in the low end of the scores reported by MST in the sum-
mary routing table. This is somewhat disappointing since
our results in TREC-1 led us to believe that we might be
able to do significanfly better.11

      Notwithstanding the fact that we did not explore
some of the ideas discussed in the JREC-1 paper (e.g., the
use of concepts rather than words as features, and the use of
surrogate split information), we are now inclined to the
view that the output from tools like CART are best used as
the basis for manually constructed routing topics. To begin
to explore this idea, we performed a number of auxiliary
tests that we report in the following section.

4.3 Auxiliary Experiments

      To explore the idea of using the CART output as
the "skeleton" for a manually constructed routing query, we
selected two model-2 trees to determine whether a minimal
set of "edits" could significanfly improve their performance.
We selected Topics 52 and 54 since they represented one
topic for which the automatically generated tree did well
(Topic 52) and one for which the automatically generated
tree did poorly (Topic 54).

      The scores for Topic 52 (for the AP corpus only)
are shown below:


11. We do note, however, that for a number of topics we did rather
well in comparison with other Systems (i.e., Topics 51 and 75), and
that in absolute terms we produced a number of trees that had
greater than 30% R-precision (i.e., for Topics 52, 58, 78 and 93).


                                          259

                        0.4000
                        0.3000
     At     15 docs:    0.4667
     At     20 docs:    0.6000
     At     30 docs:    0.6333
     At   100 docs:     0.4100
     At   200 docs:     0.3550
     At   500 docs:     0.3820
     At  1000 docs:     0.3280
   R-Precision    (precision after R      num_rel
   for a query)   docs retrieved)
        Exact:          0.3739


Here we see that this topic tree does well - recall is excel-
lent and precision is sustained even at high recall levels.

        in contrast, the topic tree for Topic 54 produces the
following results:
   Queryid   (Num):
   Total number of
        Retrieved:
        Relevant:
        Re1~ret:
   Interpolated
        at  0.00
        at  0.10
        at  0.20
        at  0.30
        at  0.40
        at  0.50
        at  0.60
        at  0.70
        at 0.80
        at  0.90
        at  1.00
   Average precision
   all  rel docs:

54

   documents over
      1000
         65
         64
Recall -  Precision Averages:
     0.1323
     0.1323
     0.1323
     0.1323
     0.1323
     0.1323

all queries

0.1323
0.1102
0.1102
0.1097
0.0000
(non-interpolated)   over

0.0907

At    5 docs:
At   10 docs:

   Precision:
      At    5 docs:    0.0000
      At   10 docs:    0.0000
      At   15 docs:    0.0000
      At   20 docs:    0.0000
      At   30 docs:    0.0000
      At  100 docs:    0.0500
      At  200 docs:    0.0600
      At  500 docs:    0.1080
      At 1000 docs:    0.0640
   R-Precision   (precision after R  (= num~rel
   for a query)  docs retrieved):
        Exact:         0.0308


Thus although recall is very good, precision is completely
unsatisfactory.

        Our conjecture is that the automatically con-
structed model-2 trees while generally giving good recall
give poor precision because they contain many extraneous
features, or features that should he combined. To illustrate
this, we considered the model-2 tree for Topic 52, as a start-
ing point for a manually constructed tree. The initial model-
2 tree is:
   Topic~52 <Or>
   *   0.86 Topic~Style~52~2 <Accrue>
   **   0.75  "AFRICA"

   **   0.25  "AFRICAN"
   **   0.75  "APARTHEID"
   **   0.25  `ARMS"
   **   0.25  "BAN"
   **   0.25  "BLACK"
   **   0.25  "COMPANY"
   **   0.25  "COMPLIANCE"
   **   0.25  "CONTR~C'TS"
   **   0.25  "CORPORATE"
   **   0.25  "DISCUSS"
   **   0.25  "DOCUMENT"
   **   0.50  "DOMINATION"
   **   0.25  "GOVERNMENT"
   **   0.25  "INTERNATIONAL"
   **   0.25  "INVESTMENT"
   **   0.25  "ORGANIZATION"
   **   0.25  "PRESSURE"
   **   0.75  "PRETORIA"
   **   0.25  "REDUCTION"
   **   0.25  "RESPONSE"
   **   0.75  "SANCTIONS"
   **   0.75  "SOUTH"
   **   0.25  "TIES"
   **   0.25  "TRADE"
   **   0.25  "UNITED"


Obviously there are a number of features here that are basi-
cally "noise" - for example the words `COMPANY"   and
`RESPONSE'; and other words are clearly elements of a
larger phrase - for example the words "SOUTH" and


                                                    260

 AFRICA ". Notice that, in general, words with lower
scores are always candidates for elimination.

        The result of this pruning exercise was the follow-
ing revised definition for Topic 52:
   Topic~52 <Or>
   *   0.86 TopicStyle~52-2 <Accrue>
   **   0.50 5_Africa <Accrue>
         0.50  `SOUTH AFRICA'
         0.50  "PRETORIA"
   **   0.50  `SANCTIONS'
   **   0.20 Topic~52~Support <Accrue>
         0.50  "APARTHEID"
   ~     0.50 <Near>
                  `BAN'
                  `TRADE'
   ~**   0.50 <Near>
                  `BAN'
                  `INVESTMENT'


So although we have added no new features, we have com-
bined "SOUTH" and "AFRICA" and used this together
with "PRETORIA" to define a concept called S_Africa.
We have also used "APARTHEID"; and "BAN" with
N TRADE' and `INVESTMENT" to define another concept
called Topic~52~Support. Finally we adjusted the
weights to give more prominence to S_Africa than Top-
ic_52_Support.

        The results for this modified topic description are:
   Queryid   (Num):       52
   Total number of documents over all queries
        Retrieved:       1000
        Relevant:        345
        Rel_ret:
   Interpolated
        at 0.00
        at 0.10
        at 0.20
        at 0.30
        at 0.40
        at 0.50
        at 0.60
        at 0.70
        at 0.80
        at 0.90
        at 1.00
   Average precision
   all rel docs:

       312
Recall - Precision Averages:
     1.0000
     1.0000
     0.9780
     0.9766
     0.9603
     0.9067
     0.8620
     0.8620
     0.8620
     0.7422
     0.0000
     (non-interpolated)  over

                   0.8305
Precision:
 At     5 docs:    1.0000
 At    10 docs:    1.0000
 At    15 docs:    1.0000
 At    20 docs:    1.0000
 At    30 docs:    1.0000
 At   100 docs:    0.9700

      At  200 docs:    0.8900
      At  500 docs:    0.6220
      At 1000 docs:    0.3120
   R-Precision (precision after R    (= num~re1
   for a query)   docs retrieved):
        Exact:         0.8493


So that although recall decreased slightly (we now retrieve
312 rather than 328 of the 345 relevant documents), the pre-
cision is improved by nearly 50 percentage points. Obvi-
ously, this is a significant improvement and was achieved
with minimal manual inpuL The time required to make
these changes was only of the order of 10 minutes.

        We repeated this exercise with Topic 54, the initial
model-2 tree for which is:
   Topic~54 <Or>
   *   0.93 Topic_Path_54_2 <Accrue>
   **   0.75  "AGREEMENT"
   **   0.75  "ARIANE"
   **   0.75  "ARIANESPACE"
   **   0.75  "ATLAS"

   **   0.75  "COMMERCIAL"
   **   0.75  "CONTRACT"
   **   0.75  "DELTA"
   **   0.75  "DOCUMENT"
   **   0.75  "DOUGLAS"
   **   0.75  "DYNAMICS"
   **   0.75  "INDUSTRY"
   **   0.75  "LAUNCH"
   **   0.75  "MARTIN"
   **   0.75  "MENTION"
   **   0.75  "PAYLOAD"
   **   0.75  "PRELIMINARY"
   **   0.75  "RELEVANT"
   **   0.75  "RESERVATION"
   **   0.75  "ROCKET"
   **   0.75  "SATELLITE"
   **   0.75  "SERVICES"
   **   0.75  "SPACE"
   **   0.75  "TENTATIVE"
   **   0.50  "TITAN11


        Using the same kinds of procedures (i.e., removing
extraneous words and combining words into phrases) we
constructed the following modified tree:
   Topic~54 <Or>

   *   0.93 Topic_Path_54_2 <Accrue>

   **   0.10  `AGREEMENT'
   **   0.75  "ARIANE"
   **   0.75  "ARIANESPACE"
   **   0.75 Atlas_Rocket <Sentence>
                 "ATLAS"
                 "ROCKET"
   **   0.75 Commercial_Satellite <Sentence>
                 "COMMERCIAL"
                 "SATELLITE"
   **   0.75  "DELTA 1111


                                                     261

   **   0.75  "MCDONNELL DOUGLAS"
   **   0.75  11GENERAL DYNAMICS"
   **   0.10  "LAUNCH"
   **   0.75  "MARTIN MARIETTA"
   **   0.75  "PAYLOAD"
   **   0.75  "ROCKET"
   **   0.75  "SATELLITE"
   **   0.75  "SPACE"
   **   0.50  "TITAN"
   **   0.75  1CONTRACT1
   **   0.75  1LAUNCH SERVICE1


Notice that here we actually added words that were part of
obvious proper names (i.e., the "GENERAL" of "GENERAL
DYNAMICS", the "MARIETTA" of "MARTIN MARl-
ETTA", and the "MCDONNELL" of "MCDONNELL DOU-
GLAS"), but otherwise nothing was added. We also adjusted
the weights on `AGREEMENT' and "LAUNCH" to de-
emphasize their importance.

        The results of running this modified query are:

   Queryid   (Num):       54
   Total number of documents over all queries
        Retrieved:       1000
        Relevant:         65
        Rel_ret:          65
   Interpolated Recall - Precision Averages:
        at 0.00        0.5800
        at 0.10        0.5800
        at 0.20        0.5800
        at 0.30        0.5800
        at 0.40        0.5800
        at 0.50        0.4592
        at 0.60        0.4592
        at 0.70        0.4324
        at 0.80        0.3355
        at 0.90        0.1474
        at 1.00        0.0657
   Average precision   (non-interpolated)   over
   all rel docs:
                       0.3889
   Precision:
     At      5 docs:   0.2000
     At    10 docs:    0.1000
     At    15 docs:    0.3333
     At    20 docs:    0.4000
     At    30 docs:    0.4667
     At   100 docs:    0.4500
     At   200 docs:    0.2700
     At   500 docs:    0.1260
     At 1000 docs:     0.0650
   R-Precision   (precision after R   (= num_rel
   for a query)   docs retrieved)
        Exact:         0.4615


So that, again for very little manual input' we achieved a
significant improvement in precision performance; and this
time at no cost to recall.

      It is interesting to compare our results with Verity's
scores for these two topics. To do this we re-scored Verity's
TOPIC2 results on the AP corpus alone.12 For Topic 52,
Verity's results were:
  Queryid (Num):       52
  Total number of documents over all queries
      Retrieved:
      Relevant:

1000

                        345
       Rel_ret:         317
   Interpolated Recall - Precision Averages:
       at 0.00        1.0000
       at 0.10        0.9833
       at 0.20        0.9342
       at 0.30        0.8607
       at 0.40        0.8314
       at 0.50        0.7425
       at 0.60        0.7125
       at 0.70        0.6704
       at 0.80        0.6161
       at 0.90        0.3952
       at 1.00        0.0000
   Average precision  (non-interpolated)  over
   all rel docs:
                      0.7159
   Precision:
     At     S docs:   1.0000
     At   10 docs:    1.0000
     At   15 docs:    1.0000
     At   20 docs:    1.0000
     At   30 docs:    1.0000
     At  100 docs:    0.9000
     At  200 docs:    0.7900
     At  500 docs:    0.5820
     At 1000 docs:    0.3170
   R-Precision  (precision after R  (= num_rel
   for a query)  docs retrieved)
       Exact:         0.6812


Here we see better recall (317 of the 345 relevant docu-
ments retrieved) but with slightly lower precision. The
TOPIC2 tree for this topic is much more complex than the
one we developed, which explams the better recall. Notice
however that both trees gave perfect precision for the first
30 documents.

       For Topic 54, Verity's TOPIC2 results were:

   Queryid  (Num):       54
   Total number of documents over all queries
       Retrieved:      1000
                         65

       Relevant:
       Rel_ret:          65
   Interpolated Recall - Precision Averages:
       at 0.00        1.0000
       at 0.10        0.9130


12. We are grateful to Verity for allowing us to exarnine their
ThEC-2 results in detail.


                                                   262

    at 0.20
    at 0.30
    at 0.40
    at 0.50
    at 0.60
    at 0.70
    at 0.80
    at 0.90
    at 1.00
Average precision
all rel docs:

0.9130
0.9130
0.9000
0.7609
0.5942
0.5679
0.5049
0.2027
0.0927
(non-interpolated)  over

                    0.6838
  Precision:
    At    5 docs:   1.0000
    At   10 docs:   0.8000
    At   15 docs:   0.8667
    At   20 docs:   0.9000
    At   30 docs:   0.9000
    At  100 docs:   0.5100
    At  200 docs:   0.2900
    At  500 docs:   0.1280
    At 1000 docs:   0.0650
  R-Precision (precision after R  (= num_rel
  for a query) docs retrieved)
      Exact:        0.5846


This shows the same recall performance (i.e., all 65 relevant
documents were retrieved) but substantially better precision
performance. Through the first 30 documents TOPIC2 gave
excellent results, whereas our modified model-2 result was
only half as good. Again however, the TOPIC2 tree is much
more complex, and required more effort to develop.13

      Overall, we are impressed by the improved perfor-
mance we were able to achieve with minimal manual effort.
These auxiliary experiments provide at least suggestive evi-
dence of the value of automatic generation of initial trees.
The extent to which this is consistently achievable will
require further investigation, and we hope to report on this
in TREC-3.


5 Commentary

      The official results of our ThEC-2 experiments
demonstrate that automatic construction of routing queries
from training documents is indeed feasible. The queries pro-
duced are in fact binary classification trees that are optimal
with respect to size (measured in terms of the number of ter-
minals in the tree) and the estimated error rate of the tree.
Unfortunately, however, these trees generally appear to
have poor performance. In a few cases the trees were com-
parable with the results from other sites, but they mostly

13. We do not have precise figures for the amount of effort needed
to build the Verity TOPIC2 trees, but in general each topic required
several hours of effort.

seemed unsatisfactory. We have not been able to identify
any specific conditions under which we could expect the
CART trees to perform well.

      We believe, nevertheless, that further experiments
to assess the utility of using a variety of extensions to the
basic CART technique would still be of interest. In particu-
lar, the use of "low-level" concepts as features, surrogate
split information, larger training sets, and features drawn
from the complete corpus rather than the information need
statements, are all likely to assist in the construction of more
effective trees.

      The main focus of our TREC-2 effort has been to
explore the idea that the CART trees can form the basis of a
semi-automated approach to building knowledge-based
descriptions of routing topics. To that end, we developed
two techniques for converting CART output into a form that
can be used by the TOPIC retrieval engine from Verity, Inc.
In the first technique we perform a "lossless" transformation
of the CART tree into a TOPIC tree. In the second we gener-
alize the CART tree, although add no new features.

      As the examples contained in the paper show, the
TOPIC trees generated in the first canonical form are "skel-
etal." This is just as we would expect. Since CART is a par-
simonious classifier, rather than a broad-based information
retrieval tool, it produces minimally complex decision trees
with respect to the expected misclassification rate.

      From an information retrieval perspective, the sec-
ond canonical form seems more like the ones we might have
built by hand. It uses a range of features and gives them
weights based on their ability to discriminate among the
tralning data. In effect, we have used CART to indicate
wliich of the features are of use in defining the topic and
then generalized the CART decision function using our
(external) knowledge of the information retrieval problem.

      Using these automatically constructed TOPIC trees
as a starting point, we conducted a limited series of tests to
assess the impact of performing minimal "editing" of these
trees. For the two topics selected, we were able to produce
significant performance galns with edits that added no new
information (at least at the level of the features used) and
that took of the order of only a few minutes to implement.

      We are encouraged by these results, and, while the
generalization of them will require a more carefully con-
trolled series of experiments, we are now of the opinion that
the most effective role for machine learning techniques in
information retrieval is as a tool for producing candidate
descriptions of information need. These candidates can then
be reviewed by end-users who can easily make obvious cor-


                                         263

rections and modifications. We intend to explore this idea in
more detail and report on our results in ThEC-3.


6 Future Research

        There a number of directions in which we might
develop the basic research ideas presented in this paper. We
briefly consider a number of them here.

        We currenfly use just two classes (relevant and not
relevant), but nothing in CART prevents it working with
multiple classes. For the document routing problem, there is
a case to be made for adding a third class - unknown rele-
vance. Adding such a class might allow us to make use of
larger tralning sets without the costs associated with devel-
oping large ground truths.

        One way of extending the skeleton TOPIC trees
produced by our tool is to make use of external lexical
resources. For example, we might investigate the use of
WordNet as a way to expand each of the classification fea-
tures into a set of related words. Similarly, we might investi-
gate the use of TOPIC's own lexical resources (e.g., the
thesaurus and Soundex tools) by replacing the unstemmed
words in the topic outline files with the appropriate TOPIC
operator (e.g., <SQUNDEX>  word, or <THESARUS>
word).

        Mthough we have used CART as the module for
building the initial classification tree, we might be interested
in exploring other tree building tools that have been used in
the machine learning community. For example, the C4.5
algorithm by Quinlan [6], or various algorithms based on
Bayesian methods such as Minimum Message Length mod-
els and decision graphs [7]. All of these tools generate deci-
sion  trees from training data  but offer different
mathematical philosophies to justify their approaches.

        Finally, as in all machine learning problems, the
initial choice of features over which to learn is extremely
critical to the overall success of the process. An investiga-
tion of various extended feature definition tools (e.g., recog-
nizing key phrase and proper names), as well as exploring
the impact of making different assumptions from TOPIC
about how the lexical tokens in the texts are to be treated,
would almost certainly yield important insights.


References
 1.   Breiman, L.; Friedman, J. H.; Olshen, R. A.; Stone,
      C. J. Classification and regression trees. Pacific
      Grove, CA: Wadsworth & Brooks; 1984.

2.  Crawford, S. L.; Fung, R. M.; Appelbaum, L.A.;
    Tong, R. M. Classification trees for information
    retrieval. In: Proceedings of the eighth international
    workshop on machine learning (ML91). San Mateo,
    CA: Morgan Kaufinann; 1991.
3.  Tong, R. M.; Winkler, A.; Gage, P. Classification
    trees for document routing. In: Harman, D. K., ed.
    The first text retrieval conference (ThEC-1). NIST
    Special Publication SP-500-207, Gaithersburg, MD:
    March 1993.
4.  Fox, C. A stop list for general text. SIGIR Forum,
    24(1,2):19~5; Fall/Winter 1989/90.
5.  Paice, C. D. Another stemmer. SIGIR Forum,
    24(3):5~1; Fall 1990.

6.  Quinlan, J. R. C4.5: Programs for machine learning.
    San Mateo, CA: Morgan Kaulmann; 1992.
7.  Oliver, J. J.; Dowe, D. L.; Wallace, C. S. Inferring
    decision graphs using the minimum message length
    principle. In: Proceedings 5th Australian joint conf.
    on M. 1992.


                                       264