Overview of the First Text REtrieval Conference (TREC-1) Donna Harman National Institute of Standards and Technology Gaithersburg, Md. 20899 1. Introduction There is a long history of experimentation in information retrieval. Research started with experiments in indexing languages, such as the Cranfield I tests (Cleverdon 1962), and has continued with over 30 years of experimentation with the retrieval engines themselves. The Cranfield II studies (Cleverdon et al. 1966) showed that automatic indexing was comparable to manual indexing, and this and the availability of computers created a major interest in the automatic indexing and searching of texts. The Cranfield experiments also emphasized the importance of creating test collections and using these for comparative evaluation. The Cranfield collection, created in the late 1960's, contained 1400 documents and 225 queries, and has been heavily used by researchers since then. Subsequendy other collections have been built, such as the CACM collection (Fox 1983), and the NPL collection (Sparck Jones & Webster 1979). In the 30 or 50 years of experimentation there have been two missing elements. First, although some research groups have used the same collections, there has been no concerted effort by groups to work with the same data, use the same evaluation techniques, and generally compare results across systems. The importance of this is not to show any system to be superior, but to allow comparison across a very wide variety of tech- niques, much wider than only one research group would tackle. Karen Sparck Jones in 1981 commented that: Yet the most striking feature of the test history of the past two decades is its lack of consolidation. It is true that some very broad generalizations have been endorsed by successive tests: for example...but there has been a real failure at the detailed level to build one test on another. As a result there are no explanations for these generalizations, and hence no means of knowing whether improved systems could be designed (p.245). This consolidation is more likely if groups can compare results across the same data, using the same evaluation method, and then meet to discuss openly how methods differ. The second missing element, which has become critical in the last 10 years, is the lack of a realistically- sized test collection. Evaluation using the small collections currently available may not reflect performance of systems in large full-text searching, and certainly does not demonstrate any proven abilities of these systems to operate in real-world information retrieval environments. This is a major barrier to the transfer of these labora- tory systems into the commercial world. Additionally some techniques such as the use of phrases and the con- struction of automatic thesauri seem intuitively workable, but have repeatedly failed to show improvement in performance using the small collections. Larger collections might demonstrate the effectiveness of these pro- cedures. The overall goal of the Text REtrieval Conference (TREC) was to address these two missing elements. It is hoped that by providing a very large test collection, and encouraging interaction with other groups in a friendly evaluation forum, a new thrust in information retrieval will occur. There is also an increased interest in this field within the DARPA community, and TREC is designed to be a showcase of the state-of-the-art in retrieval research. MST's goal as co-sponsor of TREC is to encourage communication and technology transfer among academia, industry, and government. 1 2. The Task 2.1 Introduction ~IREC is designed to encourage research in information retrieval using large data collections. Two types of retrieval are being ~xarnined -- retrieval using an `tadhoc't query such as a researcher might use in a library environment, and retrieval using a "routing" query such as a profile to filter some incoming document stream. The TREC task is not tied to any given application, and is not concerned with interfaces or optimized response time for searching. However it is helpful to have some potential user in mind when designing or testing a retrieval system. The model for a user in TREC is a dedicated searcher, not a novice searcher, and the model for the application is one needing monitoring of data streams for information on specific topics (routing), and the ability to do adhoc searches on archived data for new topics. It should be assumed that the users need the abil- ity to do both high precision and high recall searches, and are willing to look at many documents and repeatedly modify queries in order to get high recall. Obviously they would like a system that makes this as easy as possi- ble, but this ease should be reflected in ThEC as added intelligence in the system rather than as special inter- faces. Since TREC has been designed to evaluate system performance both in a routing (filtering or profiling) mode, and in an adhoc mode, both functions need to be tested. The test design was based on traditional information retrieval models, and evaluation used traditional recall and precision measures. The following diagram of the test design shows the various components of TREC (fig. 1). FThi~ Training Topics 1½ Qi Training Queries Q2 Routing Queries Test Topics Q3 Ad-hoc Q eries `U 1 Gigabyte 1 Gigabyte Training Test Documents Documents (Dl) (D2) Figure 1. The TREC Task. This diagram reflects the four data sets (2 sets of topics and 2 sets of documents) that were provided to partici- pants. These data sets (along with a set of sample relevance judgments for the 50 training topics) were used to 2 construct three sets of queries. Qi is the set of queries (probably multiple sets) created to help in adjusting a system to this task, to create better weighting algorithms, and in general to train the system for testing. The results of this research were used to create Q2, the routing queries to be used against the test documents. Q3 is the set of queries created from the test topics as adhoc queries for searching against the combined documents (both training documents and test documents). The results from searches using Q2 and Q3 were the official test results. The documents were full-length text from various sources such as newspapers, newswires, magazines and journals (see sect. 32 for more details). 2~ Specific Task Guidelines The various ThEC participants used a wide variety of indexing,knowledge base building techniques, and a wide variety of approaches to generate search queries. Therefore it was important to establish clear guidelines for the ~TREC task and to develop some methods of standardized reporting to allow comparison. The guidelines deal with the methods of indexing~owledge base construction, and with the methods of generating the queries from the supplied topics. In general they were constructed to reflect an actual operational environment, and to allow as fair as possible a separation among the diverse query construction approaches. There were guidelines for constructing and manipulating the system data structures. These structures were defined to consist of the original documents, any new structures built automatically from the documents (such as inverted files, thesauri, conceptual networks, etc.) and any new structures built manually from the documents (such as thesauri, synonym lists, knowledge bases, rules, etc.). The following guidelines were provided to the participants. 1. System data structures can be built using the initial training set (do~uments Dl, training topics, and relevance judgments). They may be modified based on the test documents D2, but not based on the test topics. In particular, the processing of one test topic should not affect the processing of another test topic. For example, it would not be allowed to update a system knowledge base based on the analysis of one test topic in such a way that the interpretation of subsequent test topics was changed in any fashion. 2. There are several parts of the Wall Street Journal and the Ziff material (see sec~ 3.2) that contain manually assigned controlled or uncontrolled index terms. These fields are delimited by SGML tags, as specified in the documentation files included with the data. Other parts of the ~IXEC data contain no manual indexing. Since the primary focus of ThEC is on retrieval and routing of naturally occurring text, these manually indexed terms should not be indiscriminately used as if they are a normal part of the text. If your group decides to use these terms, they should be part of a specific experiment that utilizes manual indexing terms, and their use should be declare~ 3. Special care should be used in handling the routing topics. In a true routing situation, a single document would be indexed and "passed" against the routing topics. Since most of you will be indexing the test documents as a complete set, routing should be simulated by not using any test document information (such as IDF based on the test collection, total frequency based on the test collection, etc.) in the searching. It is perfecdy permissible to use training-set collection information however. If your system bases system data structures on the entire test data and is unable to operate in a proper routing mode, then you should either have a different method for handling routing, or only submit results for the adhoc part of TREC. Additionally there were guidelines for constructing the queries from the provided topics (see sect. 3.3 for more on the topics). These guidelines were considered of great importance for fair system comparison and were therefore carefully constructed. Three generic categories were defined, based on the amount and kind of manual intervention used. 1. Method 1-- completely automatic initial query construction. adhoc queries -- The system will automatically extract information from the topic (the topic fields used should be identified) to construct the query. The query will then be submitted to the system (with no manual modifications) and the results from the system will be the results submitted to NIST. There should be no manual intervention that would affect the results. routing queries -- The queries should be constructed automatically using the training topics, the training relevance judgments and the training documents. The queries should then be submitted to NIST before the 3 test documents are released and should not be modified after that point. The unmodified queries should be run against the test documents and the results submitted to NIST. 2. Method 2-- manual initial query construction. adhoc queries -- The query is constructed in some manner from the topic, either manually or using machine assistance. The methods used should be identified, along with the human expertise (both domain expertise and computer expertise) needed to construct a query. Once the query has been constructed, it will be sub~ mitted to the system (with no manual intervention), and the results from the system will be the results sub~ mined to NIST. There should be no manual intervention after initial query construction that would affect the results. ~anual intervention is covered by Method 3.) routing queries -- The queries shou~d be constructed in the same manner as the adhoc queries for Method 2, but using the training topics, relevance judgments, and training documents. They should then be submitted to MST before the test documents are released and should not be modified after that point. The unmodified queries should be run against the test documents and the results submitted to NIST. 3. Method 3-- automatic or manual query construction with feedback. adhoc queries -- The initial query can be constructed using either Method 1 or Method 2. The particular technique used should be described. The query is submitted to the system, and a subset of the retrieved documents is used for manual feedback, i.e., a human makes judgments about the relevance of the docu- ments in this subseL These judgments may be communicated to the system, which may automatically modify the query, or the human may simply choose to modify the query himself. In either case, the exper- tise of the person or persons examining the documents should be described, both their domain expertise and their experience in online searching, and the manner of system feedback (i.e., automatic system modification of query or human modification) should be also described. At some point, feedback should end, and the query should be accepted as final. Three sets of results should be sent to NIST for each topic. The first set should be the results without feed- back, i.e., the top 200 documents retrieved from an initial query produced without feedback, whether pro- duced manually or automatically. This set should be exactly the same as the results from Method 1 or Method 2, but should be submitted again as one part of Method 3. The second set should be the results after only one iteration of feedback, with the top X documents used in the first iteration of feedback frozen. For example, if you "used't the top 20 documents for feedback, then the second set of results should have these documents as the top 20 documents, followed by the top 200 documents retrieved based on feedback. The term "used't means all documents for which some information has been seen by the judger, and are deemed by the system to have been seen. These two sets of results will be used by MST to calculate a residual evaluation measure. The third set of results should be a record of your feedback, i.e., a list of documents in the exact order they were seen and judged, with an indication of iteration boundaries. For example, if you ran six iterations of feedback, with 10 documents looked at for each iteration, the record would be a list of the 60 documents seen by the 91user11, marked at 10, 20, 30, etc. You should also indicate what information the user communicated to your system about each document (relevant/not relevant, too general/too specific/on target, etc.). We will be specifying a format for these record files later. These files will be used to calculate measures based on the total number of relevant documents retrieved both across iterations and across a given document level. routing queries -- Method 3 cannot be used for routing queries because routing systems have typically not supported feedback. In general these guidelines served well, although there was some misunderstanding about what constituted feed- back. The guidelines will be clarified for TRBC-2. 4 2~ The Participants There were 25 participating systems in ThEC-1, using a wide range of retrieval techniques. The participants were able to choose from three levels of participation: Category A, full participation, Category B, full participa- tion using a reduced dataset (25 topics and 1/4 of the full document set), and Category C for evaluation only (to allow commercial systems to protect proprietary algorithms). The program committee selected only 20 category A and B groups to present talks because of limited conference time, and requested that the rest of the groups present posters. All groups were asked to submit papers for the proceedings. Each group was provided the data, and asked to turn in either one or two sets of results for each topic. When two sets of results were sent, they could be made using different methods of creating queries (Methods 1, 2, or 3), or by using different parameter settings for one query creation method. Groups could choose to do the routing task, the adhoc task, or both, and were requested to submit the top 200 documents retrieved for each topic for evaluation. 3. The Test Collection 3.1 Introduction Critical to the success of TREC was the creation of the test collection. Like most traditional retrieval collec- tions, there are three distinct parts to this collection. The first is the documents themselves -- the training set (Dl) and the test set (D2). Both were distributed as CD-ROMs with about 1 gigabyte of data each, compressed to fit. The training topics, the test topics and the relevance judgments were supplied by email. TREC-1 used the same test collection (documents and topics) used in the DARPA TIPSThR project. (The DARPA TWSThR project involves the same tasks as TREC, but with four contractors doing more intense research than is being expected from TREC participants (Harman 1993)). However a major increase in the number of relevance judg- ments for this collection became available from the TREC-1 evaluation. The components of the test collection -- the documents, the topics, and the relevance judgments, are dis- cussed in the rest of this section. 3.2 The Documents The documents came from the following sources. . Diski ----- Wall Street Jounnal (1986, 1987, 1988, 1989) --- AP Newswire (1989) ---- Information from Computer Select disks (Ziff-Davis Publishing) e --- Federal Register (1989) ---- Short abstracts from the Department of Energy e Disk2 ----- Wall Street Journal (1990, 1991, 1992) --- AP Newswire (1988) ---- Information from Computer Select disks (Ziff-Davis Publishing) --- Federal Register (1988) The particular sources were selected because they reflected the different types of documents used in the ima- gined TREC application. Specifically they had a varied length, a varied writing style, a varied level of edit- ing and a varied vocabulary. All participants were required to sign a detailed user agreement for the data in order to protect the copyrighted source material. 5 The documents were uniformly formatted into an SGML-like structure, as can he seen in the following example. <[)OC> ~OCNO> WSJ880406-0090 OCNO> AT&T Unveils Services to Upgrade Phone Networks Under Global Plan ~UTHOR> Janet Guyon (WSJ Staff) NEW YORK ~IEXT> American Telephone &Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications equipment markets. AT&T said it is the first national longAistance carrier to announce prices for specific services under a world-wide standardization plan to upgrade phone networks. By announcing commercial services under the plan, which the industry calls the Integrated Services Digital Network, AT&T will influence evolving communications standards to its advantage, consultants said, just as International Business Machines Corp. has created de facto computer standards favoring its products. <[EXT> All documents had beginning and end markers, and a unique DOCNO id field. Additionally other fields taken from the initial data appeared, but these varied widely across the different sources. The documents also had different amounts of errors, which were not checked or corrected. Not only would this have been an impossible task, but the errors in the data provided a better simulation of the real-world task. Errors in missing document separators or bad document numbers were screened out, although a few were missed and later reported by participants. Table 1 shows some basic document collection statistics. TABLE 1. DOCUMENT ST~STICS Subset of collection WSJ AP ZIFF ER DOE Size of collection (megabytes) (disk 1) 295 266 251 258 190 (disk2) 255 248 188 211 Number of records (disk 1) 98,736 84,930 75,180 26,207 226,087 (disk 2) 74,520 79,923 56,920 20,108 Median number of terms per record (disk 1) 182 353 181 313 82 (disk2) 218 346 167 315 __________ Average number of terms per record (disk 1) 329 375 412 1017 89 (disk 2) 377 370 394 1073 ___________ Note that although the collection sizes are roughly equivalent in megabytes, there is a range of document lengths from very short documents ([)OE) to very long (FR). Also the range of document lengths within a col- lection varies. For example, the documents from AP are similar in length (the median and the average length 6 are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the Federal Register (FR) have a very wide range of lengths. The distribution of terms in these subsets show interesting variations. Table 2 shows some term distribution statistics found using a small stopword list of 25 terms and no stemming. For example the A? has more unique terms than the others, probably reflecting both more proper names and more spelling errors. The DOE collec- tion, while very small, is highly technical and has many domains, resulting in many specific technical terms. TABLE 2. DICTIONARY STATI~CS Subset of collection WSJ AP ZIFF FR DOE Totalnumberof unique terms (disk 1) 156,298 197,608 173,501 126,258 186,225 (disk2) 153,725 186,500 147A05 116,586 Occurring once (disk 1) 64,656 89,627 85,992 58,677 95,782 (disk 2) 64,844 83,019 72,053 54,823 Occurring more> 1 (disk 1) 91,642 107,981 87,509 67,581 90,443 (disk2) 88,881 103A81 75,352 61,763 _________ Average number of occurrences> 1 (disk 1) 199 174 165 106 159 (disk2) 178 169 139 91 _________ How does this document set compare with the older collections? Table 3 shows Not only has the size the documents has at lections with the Cranfield 1400 collection mentioned earlier. increased by a factor of about 200, but the average length of a comparison of these col- of the document collection least doubled, and in some cases (FR), increased by a factor of 10. Also, the dictionary sizes have increased by a factor of 20. TABLE 3. COMPARISON TO OLDER COLLECTIONS Subset of collection Size of collection (megabytes) 295 266 251 258 1.5 Number of records 98,736 84,930 75,180 26,207 1400 Median number of termsperrecord 182 353 181 313 79 Average number of termsperrecord 329 375 412 1017 88 Total number of unique terms 156,298 197,608 173,501 126,258 8226 What does this mean to the TREC task? First, a major portion of the effort for TREC-1 was spent in the system engineering necessary to handle the huge number of documents. This means that little time was left for system tuning or experimental runs, and therefore the TREC-1 results can best be viewed as a baseline for later research. The longer documents also required major adjustments to the algorithms themselves (or loss of perfor- mance). This is particularly true for the very long documents in FR. Since a relevant document might contain only one or two relevant sentences, many algorithms needed adjustment from working with the abstract length documents found in the old collections. Additionally many documents were composite stories, with different topics, and this caused problems for most algorithms. 7 3~ The Topics In designing the TkBC (and ~PSThR) tasks, there was a conscious decision made to provide `tuser need" statements rather than more traditional queries. Two major issues were involved in this decision. First there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) dis- tinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information available about each topic, in particular to include with each topic a Clear statement of what criteria make a document relevanL The topics were designed to mimic a real user's need, and were written by people who are actual users of a retrieval system. Although the subject domain of the topics was diverse, some consideration was given to the documents to be search~ The topics were constructed by doing trial retrievals against a sample of the docu- ment set, and then those topics that had roughly 25 to 100 hits in that sample were use~ This created a range of broader and narrower topics. The following is one of the topics used in ~TREC. Tipster Topic Description Number: 066 Domain: Science and Technology Topic: Natural Language Processing Description: Document will identify a type of natural language processing technology which is being developed or marketed in the U.S. Narrative: A relevant document will identify a company or institution developing or marketing a natural language processing technology, identify the technology, and identify one or more features of the company's product. Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications Factor(s): Nationality: U.S. Definition(s): Each topic was formatted in the same standard method to allow easier automatic construction of queries. Besides a beginning and an end marker, each topic had a number, a short title, and a one-sentence description. There was a narrative section which was aimed at providing a complete description of document relevance for the assessors. Each topic also had a concepts section with a list of assorted concepts related to the topic. This section was designed to provide a mini-knowledge base about a topic such as a real searcher might possess. Additionally each topic could have a definitions section and/or a factors section. The definition section had one or two of the definitions critical to a human understanding of the topic. The factors section was included to allow easier automatic query building by listing specific items from the narrative that constrain the documents that are relevant. Two particular factors were used in the TREC-1 topics: a time factor (current, before a given date, etc.) and a nationality factor (either involving only certain countries or excluding certain countries). While the TRBC topics did not present a problem in scaling, the challenge of either automatically construct- ing a query, or manually constructing a query with little foreknowledge of its searching capability, was a major challenge for TREC participants. In addition to filtering the relatively large amount of information provided in the topics into queries, the sometimes narrow definition of relevance as stated in the narrative was difficult for 8 most systems to handle. The two narratives shown beloW illustrate this point. Number: 051 A relevant document will cite or discuss assistance to Airbus Industrie by the French, German, British or Spanish government(s), or will discuss a trade dispute between Airbus or the European governments and a U.S. aircraft producer, most likely Boeing Co. or McDonnell Douglas Corp., or the U.S. govern- ment, over federal subsidies to Airbus. ~num> Number: 058 A relevant document will either report an impending rail strike, describing the conditions which may lead to a strike, or will provide an update on an ongoing strike. To be relevant, the document will identify the location of the strike or potential strike. For an impending strike, the document will report the status of negotiations, contract talks, etc. to enable an assessment of the probability of a strike. For an ongoing strike, the document will report the length of the strike to the current date and the status of negotiations or mediation. In a preliminary analysis, the narratives and the factors played a strange and unpredictable role in the results for TREC-1. Systems did as well on topics with very restrictive narratives, such as that of topic 58, as on topics with non-restrictive narratives, such as topic 51. The subject and terms in the entire topic were more important in determining success than the restrictiveness of the narrative. The factors also did not play a major role in system performance. This could change in TREC-2 when groups have more time to adjust their systems to the TREC task. 3A The Relevance Judgments The relevance judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents; hopefully as comprehensive a list as possible. For the TREC task, three possible methods for finding the relevant documents could have been used. In the first method, full relevance judgments could have been made on all 742,611 documents, for each topic, resulting in over 74 million judg- ments. This was clearly impossible. As a second approach, a random sample of the documents could have been taken, with relevance judgments done on that sample only. The problem with this approach is that a random sample that is large enough to find on the order of 200 relevant documents per topic is a very large random sample, and is likely to result in insufficient relevance judgments. The third method, the one used in TREC, was to make relevance judgments on the sample of documents selected by the various participating systems. This method is known as the pooling method, and has been used successfully in creating other collections. It was the recommended method in 1975 proposal to the British Library to build a very large test collection (Sparck Jones & van Rijsbergen). To construct the pool, the following was done. 1. Divide each set of results into results for a given topic 2. For each topic within a set of results, select the top 200 ranked documents for input to the pool 3. For each topic, merge results from all systems 4. For each topic, sort results based on document numbers 5. For each topic, remove duplicate documents Pooling proved to be an effective method. There was litde overlap among the 25 systems in their retrieved documents. Table 4 shows the overlap statistics. The first overlap statistics are for the adhoc topics (test topics against both training documents Dl and test documents D2), and the second statistics are for the routing topics (training topics against test documents D2 only). 9 TABLE 4. OVERLAP OF SUBMITTED RESULTS ______ Top 200 Top 200 Top 100 Top 100 Possible Actual Possible Actual Average Number of Unique Documents Per Topic 6600 2398A 3300 1278.86 (Adlioc, 33 runs, 16 groups) ________ ________ ________ Average Number of Unique. Documents Per Topic 4400 1932.42 2200 1066.86 ~outing, 22 runs, 16 groups) ________ ________ ________ _______ For example, out of a maximum of 6600 unique documents (33 groups times 200 documents), over one-third were actually unique. The top 100 documents retrieved contained about the same percentage of unique docu- ments. This means that the different systems were finding different documents as likely relevant documents for a topic. Whereas this might be expected (and indeed has been shown to occur, Katzer et. al. 1982) from widely differing systems, these overlaps were often between two runs for a given system, or between two systems run on the same basic retrieval engine. One reason for the lack of overlap is the very large number of documents that contain many of the same keywords as the relevant documents, but probably a larger reason is the very different sets of keywords in the constructed queries (this needs further analysis). This lack of overlap should improve the coverage of the relevance set, and verifies the use of the pooling methodology to produce the sam- ple. The merged list of results was then shown to the human assessors. Only the top 100 documents were judged, resulting in an average of 1462.24 documents judged for each topic, and ranging from a high of 2893 for topic 74 to a low of 611 for topic 46. Each topic was judged by a single assessor to insure the best con- sistency of judgment and varying numbers of documents were judged relevant to the topics. Figure 3 shows the number of documents judged relevant for each of the 100 topics. The topics are sorted by the number of relevant documents to better show their range and median. 1000 900 800 ~700 ~6Oo c4-o500 ~4oo z300 200 100 0 *iII Number of Relevant by Topic Topics Figure 3. Number of Relevant Documents on a Per Topic Basis. 10 4. Evaluation 4.1 Existing Evaluation Methodology An important element of ThEC was to provide a common evaluation forum. Standard recall/precision figures were calculated for each system and the tables and graphs for the results are presented in Appendix A. Figure 4 shows a typical r~a11/precision curve for illustration purposes. The x axis plots the recall values at fixed levels of recall1 where Recall = number of relevant items retrieved total number of relevant items in collection Th~ y axis plots the average precision values at those given recall values, where precision is calculated by Precision = number of relevant items retrieved total number of items retrieved 0.9 0.8 0.7 0.6 o 0.5 ~ 0.4 Illustration of Recall-Precision Curve ½ 0.3 0.2 0.1 0 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Recall Figure 4. A Sample Recall/Precision Curve. There is a standard table and graph in Appendix A for each run for each system, with the runs identified by their unique tags. A map for matching the tags to the systems is also provided. Note that the tables for the ~P- SThR panel are in Appendix B as the results are not directly comparable to the TREC results. The tables show some total statistics for each run, plus both the recall-level and document-level recall/precision averages. A second type of information about each system is shown in Appendix C. These standardized forms describe system features and system timing, and allow some primitive comparison of the amount of effort needed to produce the results. 11 4.2 Problems with Evaluation Since this was the first time that such a large collection of text has been used in evaluation, there were some problems using the existing methods of evaluation. First, groups were asked to send in only the top 200 docu- ments retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve all the relevant documents for most topics within the cutoff. All documents retrieved beyond the 200 were con- sidered nonrelevant by default and therefore the recall/precision curves become inaccurate after about 40% recall on average. Table 5 shows a comparison of one system using no threshold (so relevant documents found beyond the 200 limit are marked as relevant) versus using the 200 document threshold. TABLE 5. COMPARISON OF TABLES FROM TIPSThR Full Ranking Top 200 Ranking Recall Precision Precision 0.0 0.821 0.8208 0.1 0.672 0.6710 0.2 0.581 0.5759 0.3 0.528 0.5030 0.4 0.472 0.3819 0.5 0.424 0.2999 0.6 0.368 0.1773 0.7 0.315 0.1075 0.8 0.244 0.0487 0.9 0.154 0.0117 1.0 0.039 0.0000 11 Pt. average 0.421 0.3271 Recall Precision Recall Precision 0.25 0.559 0.20 0.5759 0.50 0.424 0.50 0.2999 0.75 0.280 0.80 0.0487 3 Pt. average 0.421 0.3082 It can be seen from these tables that not only are the recall-level statistics beyond about 40% recall inaccurate, but both the 11 pL and the 3 Pt. averages based on this table are also inaccurate. Since all systems were com- pared using the same measures, this problem is not serious in terms of comparing methods within ThEC-1. However, it could be improved by lowering the threshold, and TREC-2 will be run such that at least the top 500 documents are used for evaluation. A related problem occurred because some systems in TREC-1 worked on a variable thresholding system, with that threshold set for each topic. Documents not matching sufficient system criteria were rejected, even if fewer than 200 were returned. Sometimes as few as 10 documents were sent as results, and the evaluation method again assumed all documents beyond the 10 were not relevant. This hurt performance for these systems badly in some cases and the individual system papers discuss this. The plans for TREC-2 are to include some additional thresholding tests, so that these systems can evaluate how their thresholding performs and evaluate the standard ranking as done by other systems. The third prbblem was more general in nature. The current recall/precision measures do not include any indication of the collection size. This means that the recall and precision of a system based on a 1400 docu- ment collection could be the same as that of a system based on a million document collection, but obviously the discrimation powers on a million document collection would be much greater. This may not have been a prob- lem on the smaller collections, but the discrimation power of systems on TREC-sized collections is very impor- tant. Clearly some new evaluation measures are needed for this. One new measure being tried in ~IREC- 1 is the ROC (Relative Operating Characteristic) curves used in sig- nal processing. These curves are similar to the recall/precision curves, but allow the total size of the collection to influence performance. The two variables being used here are the probability of detection or probability of a 12 "hit" versus the probability of false alarm, or ale probability of a "false drop". The x axis plots the probability of false alarm, calculated as follows Probability of false alarm = number of nonrelevant items retrieved total number of nonrelevant items in collection The y axis plots the probability of detection, calculated as Probability of detection = number of relevant items retrieved total number of relevant items in collection Note that the probability of detection is the same as recall, and the probability of false alaIn is the same as fal- lout, an older measure in information retrieval (Salton & McGill 1983). These measures are for a single topic, but averages can be computed similarly to the recall-level averages by using probability of detection at fixed false alarm rates. The tables in Appendix A show both this average ROC curve and the same curve plotted on probability scales (Swets 1969). 5. Preliminary Results 5.1 Introduction The results of the ThEC-l conference should be viewed only as a preliminary baseline for what can be expected from systems working with large test collections. There are several reasons for this. First, the dead- lines for results were very tight, and most groups had minimal time for experiments. As discussed earlier, the huge scale-up in the size of the document collection required major work from all groups in rebuilding their sys- tems. Much of this work was simply a system engineering task: finding reasonable data structures to use, get- ting indexing routines to be efficient enough to finish indexing the data, finding enough storage to handle the large inverted files and other structures, etc. The second reason these results are preliminary is that groups were working blindly as to what constitutes a relevant document. There were no reliable relevance judgments for training, and the use of the long topics was completely new. This means that results were heavily influenced by an almost random selection of what parts of the topic to use. Groups also had to make often primitive adjustments to basic algorithms in order to get results, with litfie evidence of how well these adjustments were working. The large scale of the whole evalua- tion precluded any tuning without some relevance judgments, and the relevance judgments that were provided were generally sparse and sometimes inaccurate. These problems particularly affected those systems that needed training for routing. Many of the papers in the proceedings show some new results from work done in the short amount of time between the conference and the due date of the papers (less than 2 months). Some of the improvements are very significant, and the improvements seen in the ~PSThR results (where the results are a second-try at this task) are large. It can be expected that the results seen at the second TREC conference will be much better1 and also more indicative of how well a method works. Because these results are preliminary, they should he compared very carefully. Some very broad conclusions can be drawn, but no methods should be conclusively judged inferior or superior at this point. 5.2 Adhoc Results The adhoc evaluation used new topics (51-100) against the two disks of documents (Dl + D2). There were 33 sets of results for adhoc evaluation in TREC, with 20 of them based on runs for the full data set. Of these, 13 used automatic construction of queries, 6 used manual construction, and 1 used feedback. Figure 5 shows the recall/precision curve for the three TREC-1 runs with the highest 11-point averages using automatic construction of queries. These curves were all based on the use of the Cornell SMART system, but with important varia- tions. The "fuhrpl" results came from using the training data to find parameter weights (see Fuhr & Buckley paper), the "crnlpl" results came from doing local and global term weighting without training data (see Buckley, Salton & Allan paper), and the "siemsl" results came from using term expansion with terms from "Wordnet" (see Voorhees & Hou paper). 13 0.9 0.8 0.7 0.6 0 0.5 ~ 0.4 0.3 0.2 0.1 0 Best Automatic Adhoc 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Recall flilirpi crnlpl siemsl 1.00 Figure 5. The Best Adhoc Results using Automatic Query Construction. Figure 6 shows the recall/precision curve for the three ThEC-1 runs with the highest 1 1-point averages using manual construction of queries. It should be noted that varying amounts of manual intervention were used, and this should be considered when comparing results. These curves show differences in that the "clartb" and "gecrd2" have initially a high precision, but lose this precision as recall increases, whereas the 91cnqs12" method has a lower initial precision, but higher precision aL the higher recall levels. This may be a function of the very different methods being used. The "clartb" system adds noun phrases found in likely relevant documents to improve the query terms taken from the topic (see Evans paper), whereas the "cnqse" system uses more general thesaurus entries to expand the query (see Nelson paper). The `3gecrd2" system uses a totally different approach of constructing elaborate Boolean pattern matchers (see Jacobs, Krupka & Rau paper). Best Manual Acihoc 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Recall 0.9 0.8 0.7 0.6 0 ~ 0.5 0.4 0.3 0.2 0.1 0 cnqs~ clartb gecrd2 Figure 6. The Best Adhoc Results using Manual Query Construction. 14 It is useful to contrast the three methods of query construction. Figure 7 shows a comparison of five sets of results, two from automatic query construction, two using manual query construction, and the one relevance feedback run. It should be noted that there is relatively little difference between the results from automatic query construction versus manual query construction, although the relevance feedback results (citym2) were poor in this case. Figure 8 shows a histogram of the same information, but for all adhoc systems working with the full data set. In general it shows that the automatic query construction seems to work well for many systems, and that certainly it can be concluded that for ThEC-l the automatic construction of queries was as effective as the manual construction. Adhoc Manual vs Automatic 0.9 0.8 A 0.7 0.6 0 C,, 0.5 a) ~ 0.4 0.3 0.2 0.1 0 0.00 0.10 0.20 0.30 0.40 0.60 0.70 0.80 0.90 1.00 0.50 Recall fflhrpl crnlpl cnqst2 clarib citym2 Figure 7. A Comparison of Adhoc Results using Different Query Construction Methods. 0.25 Adlioc -- All Category A Systems ~> 0.15 .~ 0.1 0 0.05 "/X/;;/',;~ 0 Ilihi U Automatic U Manual Figure 8. A Comparison of Adhoc Results using Automatic and Manual Query Construction. 15 Figure 9 shows the comparison of automatic and manual query construcLion on a per topic basis. It is interest- ing to note that, for the two systems shown, most topics show equal performance in terms of the percentage of relevant documents retrieved by 100 documents. Some topics, like topic 51, show much better manual perfor- mance, whereas other topics, like topic 69 show better automatic performance. This is somewhat different results from earlier comparisons of Boolean systems (usually manual indexing and manual query construction) versus the automatic systems such as the SMART system. In the Medlars study (Salton 1969) the manual (Boolean) systems seemed to do either very well or very poorly, whereas the automatic systems produced con- sistent t1medium" reSults. The difference in the WEC task is likely that the topics are very long arid complex, and sometimes are easy to express manually, but sometimes very difficult, whereas the automatic construction is hampered by the existence of difficult narratives. This is only a hypothesis and needs further investigation. 1 0.9 0.8 0.7 ~ 0.6 ~ 0.5 ~ 0.4 ~ 0.3 0.2 Adhoc Manual vs Automatic by Topic Ii Iih1L Jj II. hL 0.1 0 - I IL51K L ________ ___ ____________ Hz -~ -~ -~ -~ -~-~-,- . -~ -~ -~ - 1111 ilili JIj~~ A i hullil 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79 81 83 85 87 89 91 93 95 97 99 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 ~ fuhrpl U cnqst2 Figure 9. Adhoc Results using Automatic and Manual Query Construction. on a Per Topic Basis There were also some category B results for adhoc, and the best of these are shown in Figure 10, with results from the Cornell system run as a category B run to show some comparison. There is a wide spread in the curves here, with widely differing systems being shown. The ~pirc~t~ results represent a very successful relevance feedback method, with the "pircsl" being an automatic query construction using the same system (see Kwok, Padadopoulos & Kwan paper). The "nyuirl't results come from a system using natural language tech- niques (see Strzalkowski paper). 16 Adhoc -- Category B 0.9 0.8 0 0.5 ~ 0.4 0.3 0.2 0.1 0 0.00 0~1O 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 Recall crnlB pircsl pircs4 nyuir~ Figure 10. Adhoc Results for Category B. 5~ Routing Results There were 22 sets of results for routing evaluation, with 16 of them based on runs for the full data seL Note that all routing techniques suffered from the lack of sufficient and accurate training data, and therefore these results are even more preliminary than the adhoc results. Of the 16 systems using the full data set, 8 used automatic construction of queries, and 8 used manual construction. Figure 11 shows the recall/precision curve for the three TREC-1 runs with the highest 11-point averages using automatic construction of queries. Two of the curves, based on the use of the Cornell SMART system, show very different results. The "fuhra2" results came from using a probabilistically-based relevance feedback (see Fuhr & Buckley paper), whereas the "crnla2" results came from doing traditional relevance feedback methods using the vector space model (see Buckley, Sal- ton & Allan paper), The "cityri" results also came from using traditional relevance feedback, but using a different probabilistic model and term weighting (see Robertson, Walker, Hancock-Beaulieu, Gull & Lau paper). The "cpgcn2" system used filtering methods rather than more traditional information retrieval methods to achieve results similar to the feedback results (see Jones, leung, and Pape paper). 0.9 0.8 0.7 0.6 0 0.5 C,) ~ 0.4 0.3 0.2 0.1 Best Automatic Routing 0 -~ 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Recall __ fuhra2 cnil~ cityri cpgcn2 Figure 11. The Best Routing Results using Automatic Query Construction. 17 Figure 12 shows the recalVprecision curve for the three TREC-1 runs with the highest 1 1-point averages using manual construction of queries. The systems used manually-built filters, with the "clartb" and `tgecrd2" results done similarly to their corr~sponding adhoc systems, but using the sample relevant documents as input to the filter-building process. The "parazi" system used manually-constructed filters based on clusters of interest- mg terms (see Zimmerman paper). The "cpghc2" group hand-crafted these queries as a contrast to their automatic pattern filtering methods (see Jones, Leung & Pape). Best Manual Routing 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Recall parazi gecrdl gecrd2 trwl 0.9 0.8 0.7 ~ 0.6 0 ~ 0.5 ~ 0.4 0.3 0.2 0.1 0 Figure 12. The Best Routing Results using Manual Query Construction. Again it is useful to contrast the methods of query construction. Figure 13 shows a comparison of four sets of results, two from automatic query construction and two using manual query construction. Here, unlike the adhoc results, the automatic query building seems to be clearly superior, with the "fuhrl" results having higher performance throughout the significant part of the recalVprecision curve. Routing Manual vs Automatic 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1*00 Recall 0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 ~- ftihrl __ cnil~ . parazi gecrd2 Figure 13. A Comparison of Routing Results using Different Query Construction Methods. 18 There were also some category B results for routing, and the best of these are shown in Figure 14. Again there is a much wider spread in the curves here, and widely differing systems being shown. The "pircsl" and "pircs2" results are correspondingly automatic and manually constructed queries using relevance feedback learn- ing from the training sample (see Kwok, Padadopoulos & Kwan paper). The "fairsi't system uses a combination of different term weighting methods (see Chang, Dediu, Assam & Du paper). The "nyuirl" results come from a system using natural languages techniques and probably reflect the short amount of time available to construct this very complicated system (see Strzalkowski paper). 0.9 0.8 0.7 0.6 0 0.5 C,) 0.4 0.3 0.2 0.1 0 Routing -- Category B 0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00 Recall pircs2 piresi fairsi nyuirl Figure 14. Routing Results for Category B. SA Summary The TREC- 1 conference demonstrated a wide range of different approaches to the retrieval of text from large document collections. Because of the preliminary nature of the results, very little can be said about which tech- niques seem to perform the best. It was clear that the simple systems did the task well, but it is too early to pass judgment on the more complicated systems. The automatic construction of queries from the topics did as well as, or better than, manual construction of queries, and this is encouraging for groups supporting the use of simple natural language interfaces for retrieval systems. There will be a second TREC conference in 1993, and all the Systems that participated in TREC- 1 will be back, along with additional groups. The results from this second conference should better identify the more promising retrieval techniques. REFERENCES Cleverdon C.W., (1962). Report on the Testing and Analysis of an Investigation into the Comparative Efficiency of Indexing Systems. College of Aeronautics, Cranfield, England, 1962. 19 Cleverdon C.W., Mills, J. and Keen E.M. (1966). Factors Determining the Performance of Indexing Systems, Vol.]: Design, Vol.2: Test Results. Aslib Cranfield Research Project, Cranfield, England, 1966. Harman D. (1993). The DARPA ~PSTER ProjecL SIGIR Forum, 26(2), 26-28. Katzer J., McGill MJ., Tessier J.A., Frakes W., and DasGupta P. (1982). A Study of the Overlap among Docu- ment Representations. Information Technology: Research and Development, 1(2), 261-274. Fox E. (1983). Characteristics of Two New Experimental Collections in Computer and Information Science Con- taining Textual and Bibliographic Concepts. Technical Report TR 83-561, Cornell University: Computing Sci- ence Department. Salton G. (1969). A Comparison Between Manual and Automatic Indexing Methods. American Documentation, 20(1). Salton G. and McGill M. (1983). Introduction to Modern Information Retrieval. New York, NY.: McGraw-Hill. Sparck Jones K. (1981). Information Retrieval Experiment. London, England: Butterwoiltis. Sp&ck Jones K. and Van Rijsbergen C. (1975). Report on the Need for and Provision of an "Ideal" Information Retrieval Test Collection, British Library Research and Development Report 5266, Computer Laboratory, University of Cambridge. Sparck Jones K. and Webster (1979). Research in Relevance Weighting, British Library Research and Develop- ment Report 5553, Computer Laboratory, University of Cambridge. Swets J. (1969). Effectiveness of Information Retrieval Methods. American Documentation, 20(1). 20