Application of the Automatic Message Router to the TIPSTER Collection by Richard L Jones, Sek Kit Leung and D Lewis Pape Centre for Electronic Document Research ci- Computer Power POBox7l26 Canberra Mail Centre ACT 2610 AUSTRALIA Abstract The two expcrimcnts undertaken (CPGHC and CPGCN) investigated the applicability of technology developed in the Automatic Message Router Project (AMR) to the ~PSThR data. AMR inverts the queries rather than the data; this is claimed to be an appropriate way of handling document routing given the ephemeral nature of documents in an electronic network. AMR's techniques of automatically identifying good discriminating terms and using the relative position of terms in documents as a means of developing a single relevance score proved to be very effective in producing good results in the experiment. Background The Centre for Electronic Research (CEDR) is based in Canberra, Australia. It is 5 laboratory in the Australian Computers and Communications Institute (ACCI), an organisation whose primary mission is to act as a link between Australian research in I.T. and the market-place. CEDR is the pnncipal contribution to ACCI by Computer Power Group (CPG), Australia's largest computer services company. It also continues an eight year R and D program by CPG into document analysis and retrieval techniques. The technology being tested in the TREC experiments has been developed in the Automatic Message Router Project, (AMR). AMR was developed over the past three years extending techniques called IQ, developed earlier by CPG for document retrieval, and incorporated in the STATUS document retrieval product since 1988. 245 The AMR Project The AMR Project set out to develop viable techniques that operate in an electronic mail or wire service environment. In these environments, the roles of query and document are reversed, compared with document retrieval. Users have a relatively long term interest in a topic and wish to receive documents that are relevant to that topic passed to them as soon as possible. However, documents are of short term interest and lose their value rapidly with time, unless they have been routed to someone with a specific interest in them. of course all documents may be routed to a document retrieval system for more general historical access. AMR reflects this exchange of role of query and document, by inverting the queries (referred to as filters) and passing the documents one at a time against them. AMR allows filters to be prepared in a structured form, where each filter term represents a set of synonymous terms, or in a plain English statement of the information that is desired. The former technique was used for all the filters used in the experiment. The filters are inverted into memory for performance reasons. AMR computes the relevance of each document by utilising a set of heuristics that take into account the number of different terms in a filter, and their relative positioning at a paragraph level. Each term is automatically weighted by estimating its effectiveness as a discriminator, i.e. its ability to divide the universe of documents into two groups, those that are relevant and those that are not. From the model of routing defined above a decision on the fate of a document must be made immediately. Thus the universe of documents is not constant but changes as each document is filtered. To handle this dynamic environment, AIDA keeps statistics of the discriminating power of each terms, both with respect to the most recent documents seen and the average over the life-time of the filter. In practice, the weights stabilise after some 40 documents that have some degree of relevance to the filter have been processed. Thereafter, the weights are only changed by a group of documents that predominanfly discuss a few aspects of a filter. The TREC Experiments The experiments conducted for TREC had four major objectives: to obtain an objective evaluation of AMR on a large document collection; to develop as many different filters as possible for each topic to see what sensitivity there was in the AMR heuristic to widely differing filters; to investigate AMR's performance and robustness; to perform tuning and relevance normalisation on the AMR heuristics. 246 Four sets of filters were run, though because of time constraints, only two were submitted to the formal experiment set. These two were: CPGHC - Hand-crafted structured form of filters. These were written by an experienced staff member over a period of a week. * CPGCN - Automated structured form, generated directly from the Concept field of each topic. These were generated by a simple LEX program that converted each entry directly into an AMR term. Experimental Procedure The filters were run against a portion of the Disk 1 data. The primary purpose of this run was to tune the internal weights of the terms in each filter. Because of hardware problems, the filters had been submitted to TREC before these experiments were conducted so no changes were made to the filters in the light of these runs. However, some changes to the AMR algorithms were made to stabilise the dynamic weight modification process described above. The two sets of filters submitted, together with two other sets generated from the Disk 1 data (200 filters in all) were run against the Disk 2 collection over a weekend. Though AMR provides a measure of the relevance of each document in the range 1 to 100, no cut-off was applied to the results submitted. If a filter returned 200 documents, then these were submitted, regardless of the scores obtained. Results Analysis The results of the experiments were very encouraging, with both the manually generated and automatic sets appearing in the top four routing results presented at ThEC, as measured by the 11-point average precision vs recall scores. The manually generated filters performed better than the automatically generated ones though not markedly so. This is probably due to the fact that the Concepts were well described and provided an adequate range of synonyms for most topics. Four of the manually generated filters presented very poor results, due to the insertion of a mandatory term in the filter. As measured by the precision figures, the filters performed better on the second 25 topics than they did on the first 25. This may be due to the different nature of the topics; the first half of the topics being more fact specific, and the last being more generic and `information retrieval' oriented. However, the 11 point average recall precision scores do not reflect any significant difference. 247 The system achieved a throughput of one document a second against 200 filters. This was especially pleasing since the experiments were run on a relatively modest platform, an 80486133 processor with 8 Mbytes of memory under SCO UNIX. As measured against the original objectives described above, three of the four have been achieved: AMR techniques have proved to be as accurate as any other in current use, The tests certinly proved that AMR is robust in running a very large collection, its performance was also quite adequate. The system was tuned during the first set of runs, and the experience has proved to be very valuable. The only major objective that remains to be achieved is to analyse AMR's sensitivity to widely differing filter types. Work is proceeding on this. References (1) Jones R.L., Enhanced Retrieval Mechanisms for Free Text Data Bases, Proceedings First Pan Pacific Computer Conference, (1985), pp 134-143 248 Centre for Electronic Document Reeearch TREC Routing Experiments CPGHC and CPGCN Svstems Summarv and Timing I Construction of Indexes The software does not invert the text. It inverts the queries (or filters) and passes the text through the combined index formed from the queries. Query Construction D Automatically Built Queries (Routing) 1 Concept field used 2 Time to build query <5 seconds 3 (a) Terms selected from topic (b) Terms weighted with weights based on terms from documents with relevance judgements, and dynamically modified through the training set and the test set. c) Phrases extracted from topics j) Automatic addition of Boolean connectors and proximity operators from topics. E Manually constructed queries (routing) 1 All topic fields used 2 Average time to build query 30 minutes 3 Query builder system expert 4 Data used to build query from topic 5 The creation of the query uses Boolean operators, and proximity operators. 249 III Searching A Total Computer Time One message through 200 filters per second. This includes searching and ranking. B Method Software uses a fuzzy AND and Proximity measure to rank documents. C Factors included in ranking 1 Termfrequency 5 Position in document 7 Proximity of terms IV Machine The experiments were run on an HP 486/33 with 8 Mbytes under SCO UNIX. The CD ROM drive was accessed via NFS. V AMRSoftware AMR is commercial strength software developed by Computer Power Group. Its commercialisation software engineering phase took some three person years. 250