Application of the Automatic Message Router to the TIPSTER Collection

                                by


                           Richard L Jones,
                            Sek Kit Leung
                           and D Lewis Pape


                   Centre for Electronic Document Research
                          ci- Computer Power
                            POBox7l26
                     Canberra Mail Centre ACT 2610
                            AUSTRALIA


                              Abstract

The two expcrimcnts undertaken (CPGHC and CPGCN) investigated the applicability of technology developed in
the Automatic Message Router Project (AMR) to the ~PSThR data. AMR inverts the queries rather than the data;
this is claimed to be an appropriate way of handling document routing given the ephemeral nature of documents in
an electronic network. AMR's techniques of automatically identifying good discriminating terms and using the
relative position of terms in documents as a means of developing a single relevance score proved to be very
effective in producing good results in the experiment.


Background

The Centre for Electronic Research (CEDR) is based in Canberra, Australia. It is 5 laboratory in
the Australian Computers and Communications Institute (ACCI), an organisation whose primary
mission is to act as a link between Australian research in I.T. and the market-place. CEDR is the
pnncipal contribution to ACCI by Computer Power Group (CPG), Australia's largest computer
services company. It also continues an eight year R and D program by CPG into document
analysis and retrieval techniques.

The technology being tested in the TREC experiments has been developed in the Automatic
Message Router Project, (AMR). AMR was developed over the past three years extending
techniques called IQ, developed earlier by CPG for document retrieval, and incorporated in the
STATUS document retrieval product since 1988.


                               245

The AMR Project

The AMR Project set out to develop viable techniques that operate in an electronic mail or wire
service environment. In these environments, the roles of query and document are reversed,
compared with document retrieval. Users have a relatively long term interest in a topic and wish
to receive documents that are relevant to that topic passed to them as soon as possible. However,
documents are of short term interest and lose their value rapidly with time, unless they have been
routed to someone with a specific interest in them. of course all documents may be routed to a
document retrieval system for more general historical access. AMR reflects this exchange of role
of query and document, by inverting the queries (referred to as filters) and passing the
documents one at a time against them.

AMR allows filters to be prepared in a structured form, where each filter term represents a set of
synonymous terms, or in a plain English statement of the information that is desired. The former
technique was used for all the filters used in the experiment. The filters are inverted into
memory for performance reasons.

AMR computes the relevance of each document by utilising a set of heuristics that take into
account the number of different terms in a filter, and their relative positioning at a paragraph
level. Each term is automatically weighted by estimating its effectiveness as a discriminator, i.e.
its ability to divide the universe of documents into two groups, those that are relevant and those
that are not. From the model of routing defined above a decision on the fate of a document must
be made immediately. Thus the universe of documents is not constant but changes as each
document is filtered. To handle this dynamic environment, AIDA keeps statistics of the
discriminating power of each terms, both with respect to the most recent documents seen and
the average over the life-time of the filter. In practice, the weights stabilise after some 40
documents that have some degree of relevance to the filter have been processed. Thereafter, the
weights are only changed by a group of documents that predominanfly discuss a few aspects of a
filter.


The TREC Experiments

The experiments conducted for TREC had four major objectives:

    to obtain an objective evaluation of AMR on a large document collection;

    to develop as many different filters as possible for each topic to see what sensitivity there
    was in the AMR heuristic to widely differing filters;

    to investigate AMR's performance and robustness;

    to perform tuning and relevance normalisation on the AMR heuristics.


                               246

 Four sets of filters were run, though because of time constraints, only two were submitted to the
 formal experiment set. These two were:

    CPGHC - Hand-crafted structured form of filters. These were written by an experienced
    staff member over a period of a week.

 *  CPGCN - Automated structured form, generated directly from the Concept field of each
    topic. These were generated by a simple LEX program that converted each entry directly
    into an AMR term.


 Experimental Procedure

 The filters were run against a portion of the Disk 1 data. The primary purpose of this run was to
 tune the internal weights of the terms in each filter. Because of hardware problems, the filters
 had been submitted to TREC before these experiments were conducted so no changes were made
 to the filters in the light of these runs. However, some changes to the AMR algorithms were
 made to stabilise the dynamic weight modification process described above.

 The two sets of filters submitted, together with two other sets generated from the Disk 1 data
 (200 filters in all) were run against the Disk 2 collection over a weekend.

 Though AMR provides a measure of the relevance of each document in the range 1 to 100, no
 cut-off was applied to the results submitted. If a filter returned 200 documents, then these were
 submitted, regardless of the scores obtained.


 Results Analysis

 The results of the experiments were very encouraging, with both the manually generated and
 automatic sets appearing in the top four routing results presented at ThEC, as measured by the
 11-point average precision vs recall scores.

 The manually generated filters performed better than the automatically generated ones though
 not markedly so. This is probably due to the fact that the Concepts were well described and
 provided an adequate range of synonyms for most topics. Four of the manually generated filters
 presented very poor results, due to the insertion of a mandatory term in the filter.

 As measured by the precision figures, the filters performed better on the second 25 topics than
 they did on the first 25. This may be due to the different nature of the topics; the first half of the
 topics being more fact specific, and the last being more generic and `information retrieval'
 oriented. However, the 11 point average recall precision scores do not reflect any significant
 difference.


                               247

The system achieved a throughput of one document a second against 200 filters. This was
especially pleasing since the experiments were run on a relatively modest platform, an 80486133
processor with 8 Mbytes of memory under SCO UNIX.

As measured against the original objectives described above, three of the four have been
achieved: AMR techniques have proved to be as accurate as any other in current use, The tests
certinly proved that AMR is robust in running a very large collection, its performance was also
quite adequate. The system was tuned during the first set of runs, and the experience has proved
to be very valuable. The only major objective that remains to be achieved is to analyse AMR's
sensitivity to widely differing filter types. Work is proceeding on this.

References

(1)   Jones R.L., Enhanced Retrieval Mechanisms for Free Text
      Data Bases, Proceedings First Pan Pacific Computer
      Conference, (1985), pp 134-143


                                 248

                  Centre for Electronic Document Reeearch

               TREC Routing Experiments CPGHC and CPGCN


 Svstems Summarv and Timing


 I    Construction of Indexes

      The software does not invert the text. It inverts the queries (or filters) and passes the text
      through the combined index formed from the queries.

      Query Construction

 D    Automatically Built Queries (Routing)

      1    Concept field used

      2    Time to build query <5 seconds

      3    (a)  Terms selected from topic

           (b)  Terms weighted with weights based on terms from documents with
                relevance judgements, and dynamically modified through the training set
                and the test set.

           c)   Phrases extracted from topics

           j)   Automatic addition of Boolean connectors and proximity operators from
                topics.

 E    Manually constructed queries (routing)

      1    All topic fields used

      2    Average time to build query 30 minutes

      3    Query builder system expert

      4    Data used to build query from topic

      5    The creation of the query uses Boolean operators, and proximity operators.


                                249

III    Searching

A      Total Computer Time

       One message through 200 filters per second. This includes searching and ranking.

B      Method

       Software uses a fuzzy AND and Proximity measure to rank documents.

C      Factors included in ranking

       1    Termfrequency

       5    Position in document

       7    Proximity of terms

IV     Machine

       The experiments were run on an HP 486/33 with 8 Mbytes under SCO UNIX. The CD
       ROM drive was accessed via NFS.

V      AMRSoftware

       AMR is commercial strength software developed by Computer Power Group. Its
       commercialisation software engineering phase took some three person years.


                                 250