Overview of the Second Text REtrieval Conference (TREC-2) Donna Harman National Institute of Standards and Technology Gaithersburg, MD. 20899 1. Introduction InNovember of 1992 the first Text REtrieval Conference ~EC-1) was held at NIST [Harman 1993]. The confer- ence, co-sponsored by ARPA and NIST, brought together information retrieval researchers to discuss their system results on a new large test collection (the TIPSTER col- lection). This was the first time that such groups had ever compared results on the same data using the same evalua- tion methods and represented a breakthrough in cross- system evaluation in information retrieval. It was also the first time that most of these groups had used such a large test collection and therefore required a major effort by all groups to scale up their retrieval techniques. The overall goal of the TREC initiative is to encourage re- search in information retrieval using large-scale test col- lections. It is hoped that by providing a very large test collection, and encouraging interaction with other groups in a friendly evaluation forum, new momentum in infor- mation retrieval will be generated. Because of the NIST involvement, groups with commercial retrieval products have participated in TREC, leaning to increased techno- logical transfer between the research labs and the com- mercial products. TREC has also provided a state-of-the- art showcase of retrieval methods for ARPA clients. Whereas the TREC-1 conference demonstrated a wide range of different approaches to the retrieval of text from large document collections, the results should be viewed as very preliminary. Not only were the deadlines for re- sults very tight, but the huge increase in the size of the document collection required significant system rebuild- ing by most groups. Much of this work was a system en- gineering task: finding reasonable data structures to use, getting indexing routines to be efficient enough to index all the data, finding enough storage to handle the large in- verted files and other structures, etc. Still, the results showed that the systems did the task well, and that auto- matic construction of queries from the topics did as well as, or better than, manual construction of queries. The second TREC conference ~EC-2) occurred in Au- gust of 1993, less than 10 months after the first confer- ence. In addition to 22 of the TREC-l groups, nine new 1 groups took part. bringing the total number of participat- ing groups to 31. Many of the original TREC-1 groups were able to "complete" their system rebuilding and tun- ing, and in general the TREC-2 results show significant improvements over the TREC-l results. This paper provides an overview of the TREC-2 conf&- ence, including a review of the TREC task, a brief de- scription of the test collection being used, and an overview of the results. The papers from the individual groups should be referred to for more details on specific system approaches. 2. The TREC Task 2.1 Int~duction TREC is designed to encourage research in information retrieval using large data collections. Two types of re- trieval are being examined -- retrieval using an "adhoc" query such as a researcher might use in a library environ- ment, and retrieval using a "routing" query such as a pro- file to filter some inconling document stream. The TREC task is not tied to any given application, and is not primar- ily concerned with interfaces or optimmed response time for searching. However it is helpfiil to have some poten- tial user in mind when designing or testing a retrieval sys- tem. `The model for a user m TREC is a dedicated searcher, not a novice searcher, and the model for the ap- plication is one needing monitoring of data streams for in- formation on specific topics (routing), and the ability to do adlioc searches on archived data for new topics. It should be assumed that the users need the ability to do both high precision and high recall searches, and are will- ing to look at many documents and repeatedly modify queries in order to get high recall. Obviously they would like a system that makes this as easy as possible, but this ease should be reflected in TREC as added intelligence in the system rather than as special interfaces. Since TREC has been designed to evaluate system perfor- mance both in a routing (filtering or profiling) mode, and in an atihoc mode, both functions need to be tested. Training Topics L~mi1OO)\ mTest Topics LrFj;50) There were guidelines for constructing and manipulating the system data structures. These structures were defined to consist of the original documents, any new structures built automatically from the documents (such as inverted files, thesauri, conceptual networks, etc.), and any new structures built manually frorn the documents (such as thesauri, synonym lists, knowledge bases, rules, ~tc.). The following guidelines were developed for the TREC task. Qi Q2 [11~Th1ning~Th~est Documents and 2) (Disk 3) Figure 1. The II~EC Task. The test design was based on traditional information retrieval models, and evaluation used traditional recall and precision measures. The above diagram of the test design shows the various components of IREC (fig. 1). This diagram reflects the four data sets (2 sets of topics and 2 sets of documents) that were provided to partici- pants. These data sets (along with a set of sample rele- vance judgrnents for the 100 training topics) were used to construct three sets of queries. Qi is the set of queries (probably multiple sets) created to help in adjusting a sys- tem to this task, to create better weighting algorithms, and in general to train the system for testing. The results of this research were used to create Q2, the routing queries to be used against the test documents. Q3 is the set of queries created from the test topics as adhoc queries for searching against the training documents. The results from searches using Q2 and Q3 were the official test results sent to NIST. 2.1 Specific Task Guidelines Because the ThEC participants used a wide variety of indexin~~owledge base building techniques, and a wide variety of approaches to generate search queries, it was important to establish clear guidelines for the evaluation task. The guidelines deal with the methods of index- mg/knowledge base construction, and with the methods of generating the queries from the supplied topics. In gen- eral, they were constructed to reflect an actual operational environment, and to allow as fair as possible a separation among the diverse query construction approaches. 2 1. System data structures should be built using the initial training set (documents from disks 1 and 2, training topics 1-100, and the relevance judg- ments). They may be modified based on the test documents from disk 3, but not based on the test topics. 2. There are parts of the test collection, such as the Wall Street Journal and the Ziff material, that con- tain manually assigned controlled or uncontrolled index terms. These fields are delimited by SGML tags, as specified in the documentation files included with the data. Since the primary focus is on retrieval and routing of naturally occurring text, these manually indexed terms should not be used. 3. Special care should be used in handiing the rout- ing task. `[`a true routing situation, a single docu- ment would be indexed and compared against the routing topics. Since the test documents are gen- erally indexed as a complete set, routing should be simulated by not using any information based on the full set of test documents (such as weighting based on the test collection, total frequency based on the test collection, etc.) in the searching. It is permissible to use training-set collection informa- tion however. Additionally there were guidelines for constructing the queries from the provided topics. These guidelines were considered of great importance for fair system compari- son and were therefore carefully constructed. Three generic categories were defined, based on the amount and kind of manual intervention used. 1. AUTOMATIC (completely automatic iuitial query construction) adhoc queries -- The system will automatically extract information from the topic to construct the query. The query will then be submitted to the sys- tem (with no manual modifications) and the results from the system will be the results submitted to NIST. There should be no manual intervention that would affect the results. routing queries -- The queries should be constructed automatically using the training top- ics, the training relevance judgments and the train- ing documents. The queries should then be sub- mitted to NIST before the test documents are released and should not be modified after that point. The unmodified queries should be run against the test documents and the results submit- ted to NIST. 2. MANUAL (manual initial query construction) adhoc queries -- The query is constructed in some manner from the topic, either manually or using machine assistance. Once the query has been con- structed, it will be submitted to the system (with no manual intervention), and the results from the system will be the results submitted to NIST. There should be no manual intervention after ini- tial query construction that would affect the results. (Manual intervention is covered by the cat- egory labelled FEEDBACK.) routing queries -- The queries should be con- structed in the same manner as the adhoc queries for MANUAL, but using the training topics, rele- vance judgments, and training documents. They should then be submitted to NIST before the test documents are released and should not be modi- fled after that point. The unmodified queries should be run against the test documents and the results submitted to NIST. 3. FEEDBACK (automatic or manual query con- struction with feedback) atihoc queries -- The initial query can be con- structed using either AUTOMATIC or MAAUAL methods. The query is submitted to the system, and a subset of the retrieved documents is used for manual feedback, i.e., a human maltes judgments about the relevance of the documents in this sub- set. These judgments may be communicated to the system, which may automatically modify the query, or the human may snnply choose to modify the query himself. At some point, feedback should end, and the query should be accepted as final. Systems that submit runs using this method must submit several different sets of results to allow tracking of the time/cost benefit of doing relevance feedback. 22 The Participants There were 31 participating systems in ThEC-2, using a wide range of retrieval techniques. The participants were able to choose from three levels of participation: Cate- gory A, full participation, Category B, full participation using a reduced dataset (1/4 of the full document set), and Category C for evaluation only (to allow commercial sys- tems to protect proprietary algoritiuns). The program committee selected only 20 category A and B groups to present talks because of limited conference time, and requested that the rest of the groups present posters. All groups were asked to submit papers for the proceedings. Each group was provided the data and asked to turn in either one or two sets of results for each topic. When two sets of results were sent, they could be made using differ- ent methods of creating queries (AUTOMATIC, MAN- UAL, or FEIIIDBACK), or by using different parameter settings for one query creation method. Groups could choose to do the routing task, the atihoc task, or both, and were requested to submit the top 1000 documents retrieved for each topic for evaluation. 3. The Test Collection 3.1 Introduction The creation of the test collection (called the `IIPSThR collection) was critical to the success of ThEC. Like most traditional retrieval collections, there are three dis- tinct parts to this collection -- the documents, the queries or topics, and the relevance judgments or "right answers." These test collection components are discussed briefly in the rest of this section. For a more complete description of the collection, see [Hannan 1994]. 3,2 The Documents The documents needed to mirror the different tyi~ of documents used in the theoretical TREC application. Specifically they had to have a varied length, a varied writing style, a varied level of editing and a varied vocab- ulary. As a final requirement, the documents had to cover different timeframes to show the effects of document date on the routing task. The documents were distributed as CD-ROMs with about 1 gigabyte of data each, compressed to fit. The following shows the actual contents of each disk. Disk 1 routing queries -- FEEDBACK cannot be used for routing queries as routing systems have not sup- ported feedback. 3 WSJ -- Wall Street Journal (1987, 1988, 1989) * --- AP Newswire (1989) * ZIFF-- Articles from Computer Select disks (Ziff- Davis Publishing) * FR -- Federal Register (1989) * DOE -- Short abstracts from DOE publications Disk2 * WSJ --Wall Street Journal (1990, 1991, 1992) * --- AP Newswire (1988) * ---- Articles from Computer Select disks (Ziff- Davis Publishing) * --- Federal Register (1988) Disk3 * ----- San Jose Mercury News (1991) * AP--APNewswire(1990) * ---- Articles from Computer Select disks (Ziff- Davis Publishing) PAT--U.S.Patents(1993) The documents are uniformly formatted into an SGML- like structure, as can be seen in the following example. W5J880406-0090 AT&T Unveils Services to Upgrade Phone Net- works Under Global Plan ~UTHOR> Janet Guyon (WSJ Stafi) NEW YORK American Telephone & Telegraph Co. introduced the first of a new generation of phone services with broad implications for computer and communications equip- ment markets. AT&T said it is the first national long-distance car- rier to announce prices for specific services under a world-wide standardization plan to upgrade phone net- works. By announcing commercial services under the plan, which the industry calls the Integrated Services Digital Network, AT&T will influence evolving commu- nications standards to its advantage, consultants said, just as International Business Machines Corp. has cre- ated de facto computer standards favoring its products. from the Initial data appear, but these vary widely across the different sources. The documents have differing amounts of errors, which were not checked or corrected. Not only would this have been an impossible task, but the errors in the data provide a better simulation of the ThEC task. Errors in missing document separators or bad docu- ment numbers were screened out, although a few were missed and later reported as errors. Table 1 shows some basic document collection statistics. Note that although the collection sizes are roughly equlv- alent in megabytes, there is a range of document lengths from very short documents ~OE) to very long (FR). Also the range of document lengths within a collection varies. For example, the documents from AP are similar in length (the median and the average length are very close), but the WSJ and ZIFF documents have a wider range of lengths. The documents from the Federal Regis- ter (FR) have a very wide range of lengths. 3.3 The Topics In designing the ThEC task, there was a conscious deci- sion made to provide "user need" statements rather than more traditional queries. Two major issues were involved in this decision. First there was a desire to allow a wide range of query construction methods by keeping the topic (the need statement) distinct from the query (the actual text submitted to the system). The second issue was the ability to increase the amount of information avallable about each topic, in particular to include with each topic a clear statement of what criteria make a document relevant. The topics were designed to mimic a real user's need, and were written by people who are actual users of a retrieval system. Although the subject domain of the topics was diverse, some consideration was given to the documents to be searched. The topics were constructed by doing trial retrievals against a sample of the document set, and then those topics that had roughly 25 to 100 hits in that sample were used. This created a range of broader and narrower topics. The following is one of the topics used in ThEC. Tipster Topic Description Number: 066 Domain: Science and Technology Topic: Natural Language Processing <desc> Description: <ITEXT> Document will identijy a type of natural language pro- <IDOC> cessing technology which is being developed or mar- keted in the U.S. All documents have beginning and end markers, and a unique DOCNO id field. Additionally other fields taken 4 Table 1. Document Statistics (disk3) 90,257 78,325 161,021 6,711 Median number of terms per record (diski) 182 353 181 313 82 (disk2) 218 346 167 315 (disk3) 279 358 119 2896 Average number of terms per record (diskl) 329 375 412 1017 89 (disk2) 377 370 394 1073 (disk3) 337 379 263 3543 <smry> Summary: Document will identi~ a type of natural language pro- cessing technology which is being developed or mar- keted in the U.S. <narr> Narrative: A relevant document will identify a company or institu- tion developing or marketing a natural language pro- cessing technology, identify the technology, and identify one or more features of the company'S product. <con> Concept(s): 1. natural language processing 2. translation, language, dictionary, font 3. software applications <fac> Factor(s): <nat> Nationality: U.S. <Ifac> <def> Definition(s): <Itop> Each topic is formatted in die same standard method to allow easier automatic construction of queries. Besides a beginning and an end marker, each topic has a number, a short title, a one-sentence description, and a summary sentence or two that can be used as a surrogate for the full topic (often very similar to the one-sentence description). There is a narrative section which is aimed at providing a complete description of document relevance for the 5 assessors. Each topic also has a concepts section with a list of assorted concepts related to the topic. This section is designed to provide a mini-knowledge base about a topic such as a real searcher might possess. Additionally each topic can have a definitions section andlor a factors section. The definition section has one or two of the defi- nitions critical to a human understanding of the topic. The factors section is included to allow easier automatic query building by listing specific items from the narrative that constrain the documents that are relevant. Two par- ticular factors were used in the ThEC-2 topics: a time factor (current, before a given date, etc.) and a nationality factor (either jiwolving only certain countries or excluding certain countries). While the ThEC topics did not present a problem in scal- ing, the challenge of either automatically constructing a query, or manually constructing a query with little fore- knowledge of its searching capability, was a major chal- lenge for ThEC participants. In addition to filtering the relatively large amount of information provided in the topics into queries, the sometimes narrow definition of relevance as stated in the narrative was ditficult for most systems to handie. 3A The Relevance Judgments The relevance judgments are of critical importance to a test collection. For each topic it is necessary to compile a list of relevant documents; hopefully as comprehensive a list as possible. For the TREC task, three possible methods for finding the relevant documents could have been used. `lithe first method, flill relevance judgments could have been made on over one million documents, for each topic, resulting in over 100 million judgments. This was cle&ly impossible. As a second approach, a random sample of the documents could have been taken, with rel- evance judgments done on that sample only. The problem with this approach is that a random sample that is large enough to find on the order of 200 relevant documents per topic is a very large random sample, and is likely to result in insulficient relevance judgments. The third method, the one used in ~ was to make relevance judgments on the sample of documents selected by the various partici- pating systems. This method is known as the pooling method, and has been used successfully in creating other collections [Sparck Jones & van Rijsbergen 1975]. The sample was constructed by taI~g the top 100 documents retrieved by each system for a given topic and merging them into a pool for relevance assessment. This is a valid sampling method since all the systems used ranked retrieval methods, with those documents most likely to be relevant returned first. Pooling proved to be an effective method. There was lit- tle overlap among the 31 systems in their retrieved docu- ments, although cousiderably more overlap than in ThEC-1. Table 2. Overlap of Submitted Results TREC-2 ThEC-1 Max Actual Max Actual Unique Documents PerTopic 4000 1106.0 3300 1278.86 (Adhoc, 40 runs 23 groups) _____ ______ _______ Unique Documents PerTopic 4000 1465.6 2200 1066.86 ~outing, 40 runs 24 groups) ____________ _____ _______ Table 2 shows the overlap statistics. The first overlap statistics are for the adhoc topics (test topics against train- ing documents disks 1 and 2), and the second statistics are for the routing topics (training topics against test docu- ments disk 3 only). For example, out of a maximum of 4000 possible unique documents (40 runs times 100 docu- ments), over one4ourth of the documents were actually unique. This means that the different systems were lmd- ing different documents as likely relevant documents for a topic. Whereas this might be expected (and indeed has been shown to occur, Katzer et. al. 1982) from widely 6 differing systems, these overlaps were often between two runs for the same system. One reason for the lack of overlap is the very large number of documents that con- tain many of the same terms as the relevant documents, but the major reason is the very different sets of terms in the constructed queries. This lack of overlap should improve the coverage of the relevance set, and verifies the use of the pooling methodology to produce the sample. The merged list of results was then shown to the human assessors. Icach topic was judged by a single assessor to insure the best consistency of judgment. Varying numbers of documents were judged relevant to the topics. For the ThEC-2 adhoc topics (topics 101-150), the median num- ber of relevant documents per topic is 201, down from 277 for topics 51-100 (as used for adhoc topics in TREC-1). Only 11 topics have more than 300 relevant documents, with only 2 topics having more than 500 rele- vant documents. These topics were deliberately made narrower than topics 51-100 because of a concern that topics with more than 300 relevant documents are likely to have incomplete relevance assessments. 4. Evaluation An important element of TREC was to provide a common evaluation forum. Standard recall~recision and recallifallout figures were calculated for each TREC sys- tem and these are presented in Appendix A. A chart with additional data about each system is shown in Appendix B. This chart consolidates information provided by the systems that describe features and system tinting, and allows some primitive comparison of the amount of effort needed to produce the results. 4.1 Definition of Recall/Precision and Recall/Fallout Curves Figure 2 shows typical recall~recision curves. The x axis plots the recall values at fixed levels of recall, where Recall = number of relevant items retrieved total number of relevant items in collection The y axis plots the average precision values at those given recall values, where precision is calculated by Precision = number of relevant items retrieved total number of items retrieved These curves represent averages over the 50 topics. The averaging method was developed many years ago [Salton & McGill 1983] and is well accepted by the information retrieval community. The curves show system perfor- mance across the fall range of retrieval, i.e., at the early stage of retrieval where the highly-ranked documents give high accuracy or precision, and at the final stage of retrieval where there is usually a low accuracy, but more complete retrieval. Note that the use of these curves assumes a ranked output from a system. Systems that provide an unranked set of docuttients are known to be less effective and therefore were not tested in the TREC program. The curves m figure 2 show that system A has a much higher precision at the low recall end of the graph and therefore is more accurate. System B however has higher precision at the high recall end of the curve and therefore will give a more complete set of relevant documents, assuming that the user is willing to look further in the ranked list. A second set of curves was calculated using the recall/fallout measures, where recall is defined as before and fallout is defined as number of nonrelevant items retrieved fallout = total number of nonrelevant items in collection Note that recall has the same definition as the probability of detection and that fallout has the same definition as the probability of false alarm, so that the recall/fallout curves are also the ROC ~elative Operating characteristic) curves used in signal processing. A sample set of curves corresponding to the recall/precision curves is shown in figure 3. These curves show the same order of perfor- mance as do the recall/precision curves and are provided as an alternative method of viewing the results. The pre- sent version of the curves is experimental as the curve cre- ation is particularly sensitive to scaling (what range is used for calculating fallout). The high precision section of the curves does not show well in figure 3; the high recall area dominates the curves. Whereas the recall/precision curves show the retrieval system results as they might be seen by a user (sicce pre- cision measures the accuracy of each retrieved document as it is retrieved), the recall/fallout curves emphasize the ability of these systems to screen out non-relevant mate- rial. In particular the fallout measure shows the discrima- tion powers of these systems on a large document collec- tion. For example, system A has a fallout of 0.02 at a recall of about 0.48; this means that this system has found almost 50% of the relevant documents, while only retrieving 2% of the non-relevant documents. 42 Single-Value Evaluafion Measures In addition to recall/precision and recall/fallout curves, there were 2 single-value measures used in TREC-2. The first measure, the non-interpolated average precision, corresponds to the area under an ideal (non-interpolated) recall/precision curve. To compute this average, a 7 precision average for each topic is first calculated. This is done by computing the precision after every retrieved rel- evant document and then averaging these precisions over the total number of retrieved relevant documents for that topic. These topic averages are then combined (averaged) across all topics in the appropriate set to create the non- interpolated average precision for that set. The second measure used is an average of the precision for each topic after 100 documents have been retrieved for that topic. This measure is useful because it reflects a clearly comprehended retrieval point. It took on added impor~~e in the TREC environment because only the top 100 documents retrieved for each topic were actually assessed. For this reason it produces a guaranteed evalua- tion point for each system. 4.3 Problems with Evaluation Since this was the first time that such a large collection of text has been used in open system evaluation, there were some problems with the existing methods of evaluation. The major problem concerned a turesholding effect caused by the inability to evaluate ALL documents retrieved by a given system. For TREC- 1 the groups were asked to send in only the top 200 documents retrieved by their systems. This artificial document cutoff is relatively low and systems did not retrieve all the relevant documents for most topics within the cutoff. All documents retrieved beyond the 200 mark were considered nonrelevant by default and therefore the recall/precision curves became maccurate after about 40% recall on average. TREC-2 used the top 1000 documents for evaluation. Figure 4 shows the difference in the curves produced by various evaluation thresholds, includ- ing a curve for no threshold (similar to the way evaluation has been done on the smaller collections.). These curves show that the use of a 1000-document cutoff has solved most of the thresholding problem. Two more issues in evaluation have become important. The first issue involves the need for more statistical evalu- ation. As will be seen in the results, the recall/precision curves are often close, and there is a need to check if there is truly any statistically significant differences between two systems' results or two sets of results from the same system. This problem is currently under investigation in collaboration with statistical groups experienced in the evaluation of information retrieval systems. Another issue involves getting beyond the averages to bet- ter understand system performance. Because of the huge number of documents and the long topics, it is very Sample Recall/Precision Curves 1.00 0.80 0.60 0.40 0.20 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Recall System A~ System B Sample Recall/Fallout Curves 0.75 0.7 0.65 0.6 0.55 0.5 0.45 0.4 I IIIIIII 0.35 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 Fallout ~SystemA ~SystemB Figure 2. A Sample Recall~r~cisioii Curve. Figure 3. A Sample Recal1~allot Curve. 8 1.00 0.80 0.60 0.40 0.20 0.00 Effects of Cutoff on Evaluation 0.00 0.20 0.40 0.60 0.80 1.00 Recall ~at2OO ~at5OO ~at1OOO~fiill Figure 4. Effect of evaluation cutoffs on recall/precision curves. difficult to perform failure analysis on the results to better understand the retrieval processes being tested. Without better understanding of underlying system performance, it will be laard to consolidate research progress. Some pre- liminary analysis of per topic performance is provided in section 6, and and more attention will be given to this problem in the future. 5. Results 5.1 Inti~uction In general the ThEC-2 results showed signilicant improvements over the ~IREC-l results. Many of the original ThEC-l groups were able to "complete" their system rebuilding and tuning tasks. The results for ThEC-2 therefore can be viewed as the "best first-pass" that most groups can accomplish on this large amount of data. The adhoc results in particular represent baseline results from the scaling-up of current algorithms to large test collections. The better systems produced similar results, results that are comparable to those seen using these algorithms on smaller test collections. The routing results showed even more improvement over ThEC-l routing results. Some of this improvement was due to the availability of large numbers of accurate 9 relevance judgments for training (unlike TREC-l). but most of the improvements came from new research by participating groups into better ways of using the training data. For full descriptions of each system discussed in this sec- tion, see the individual papers in this proceedings. 52 Adhoc Results The acihoc evaluation used new topics (101-150) against the two disks of training documents (disks 1 and 2). There were 44 sets of results for adhoc evaluation in ThEC-2, with 32 of them based on runs for the full data seL of these, 23 used automatic construction of queries, 9 used manual construction, and 2 used feedback. Figure 5 shows the recall/precision curves for the six ThEC-2 groups with the highest non-interpolated average precision using automatic construction of queries. The results marked "INQOOl" are the INQUERY system from the University of Massachusetts (see Croft, Callan & Brogho paper). This system uses probabilistic term weighting and a probabilistic inference net to combine various topic and document features. The results marked "dortQ2", "Brkly3" and "cmlL2" are all based on the use of the Cornell SMART system, but with important varia- tions. The "c~2" run is the basic SMART system from Cornell University (see Buckley, Allan & Salton paper), but using less than optimal term weightings (1,y mistake). The 1tdortQ2'1 results from the University of Dortmund come from using polynoinial regression on the training data to find weights for various pre-set term features (see Fuhr, Pfeifer, Brenikamp, Polimann & Buckley paper). The "Brkly3'9 results from the Univ&sity of California at Berkeley come froin performing logistic regression analy- sis to learn optimal weighting for various term frequency measures (see Cooper, Chen & Gey paper). The "CLARTA" system from the CLARIT Corporation expands each topic with noun phrases found in a the- saurus that is automatically generated for each topic (see Evans & Lefferts paper). The "lsiasm" results are from Bellcore (see Dumals pape~. This group uses latent semantic indexing to create much larger vectors than the more traditional vector-space models such as SMART. The run marked "lsiasm91 represents only the base SMART pre-processing results, however. Due to process- ing errors the "improved'9 LSI run produced unexpectedly poor results. Figure 6 shows the recall/precision curve for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of queries. It should be noted that varying amounts of manual intervention were used. The results marked "INQOO2", "siems2", and "CLARIM" are automatically generated queries with manual modifications. The "INQOO2" results reflect vari- ous manual modifications made to the "INQOOl" queries, with those modifications guided by strict rules. The "siems2" results from Siemens Corporate Research, Inc. (see Voorhees paper) are based on the use of the Cornell SMART system, but with the topics manually modified (the "not91 phases removed). These results were meant to be the base run for improvements using WordNet, but the improvements did not materialize. The "CLARTM19 results represent manual weighting of the query terms, as opposed to the automatic weighting of the terms that was used in "CLARTA." The results marked "Vtcms2", "CnQs~", and "TOPI~" are produced from queries con- structed completely manually. The "Vtcms211 results are from ~nginia Tech (see Fox & Shaw paper) and show the effects of combining the results from SMART vector- space queries with the results from manually-constructed soft Boolean P-Norm type queries. The "CnQs~" results, from ConQuest Software (see Nelson paper), use a very large general-purpose semantic net to ald in constructing better queries from the topics, along with sophisticated morphological analysis of the topics. The results marked 19T0P1C219 are from the TOPIC system by Verity Corp. (see Lehman paper) and reflect the use of an expert sys- tem working off specially~onstructed knowledge bases to improve performance. 10 Several comments can be made with respect to these adhoc results. First, the better results (most of the auto- matic results and the three top manual results) are very similar and it is unlikely that there is any statistical differ- ences between them. There is clearly no "best" method, and the fact that these systems have very different approaches to retrieval, including different term weighting schemes, different query construction methods, and differ- ent similarity match methods implies that there is much more to be learned about effective retrieval techniques. As will be seen in section 6, whereas the averages for the systems may be similar, the systems do better on different topics and retrieve different subsets of the relevant docu- ments. A second point that should be made is that the automatic query construction methods continue to perform as well as the manual construction methods. Two groups (the INQUERY system and the CLARIT system) did explicit comparision of manually-modified queries vs those that were not modified and concluded that manual modifica- tion provided no benefits. The three sets of results based on completely manually-generated queries had even poorer performance than the manually-modified queries. Note that this result is specific to the very rich TREC top- ics; it is not clear that this will hold for the short topics normally seen in other retrieval environments. As a final point, it should be noted that these adhoc results represent significant improvements over the results from TREC-1. Figure 7 shows a comparison of results for a typical system in TREC-1 and TREC-2. Some of this improvement is due to improved evaluation, but the differ- ence between the curve marked "TREC-1" and the curve marked "TREC-2 looking at top 200 ouly" shows signifi- cant performance improvement. Wbereas this improvement could represent a difference in topics (the TREC-l curve is for topics 51-100 and the TREC-2 curves are for topics 101-150), the TREC-2 topics are generally felt to be more difficult and therefore this improvement is likely to be an understatement of the actual improvements. Only two groups worked with less than the full document collection. Figure 9 shows the results for the one group with official TREC-2 category B results (the results from UCLA were received after the deadline). This figure shows the best results from New York University (see Strzalkowski & Carballo paper), compared with a cate- gory B version of the Cornell SMART results. The "nyuir3" results reflect a very intensive use of natural lan- guage processing (NLP) techniques, including a parse of the documents to help locate syntactic phrases, context- sensitive expansion of the queries, and other NLP improvements on statistical techniques. 0 1.00 0.80 0.60 0.40 0.20 Best Automatic Adlioc 0 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Recall INQOOl * doitQ2 A Brkly3 CLARTA~ cnilL2 __ isiasm Best Manual Adlioc 1.00 0.80 0.60 0.40 0.20 0.00 0.00 0.20 0.40 Recall 0.60 0.80 1.00 __INQ002 * siems2 CLARTM Vtcms2 __CnQst2 TOPIC2 Figure 5. Best Automatic Acihoc Results. Figw~ 6. Best Manual Adlioc Results. 11 0 1.00 0.80 0.60 Performance Improvements in Adlioc 0.40 0.20 0.00 0.0 0.2 0.4 0.6 0.8 1.0 Recall ~- TREC-1 * TREC-2 looking at top 200 only A TREC-2 0 1.00 0.80 0.60 0.40 0.20 0.00 Adlioc Category B 0.00 0.20 0.40 0.60 0.80 1.00 Recall cmlVB nyuir3 Figure 7. Typical Improvements in Adlioc Results. Figure 8. Category B Adlioc Results. 12 53 Routing Results The routing evaluation used a subset of the training topics (topics 51-100 were used) against the new disk of test documents (disk 3). There were 40 sets of results for routing evaluation, with 32 of them based on runs for the flill data set. of the 32 systems using the flill data set, 23 used automatic construction of queries, and 9 used man- ual construction. Figure 9 shows the recall/precision curves for the six TREC-2 groups with the highest non-interpolated average precision using automatic construction of the routing queries. Again three systems are based on the Cornell SMART system. The plot marked "crnlCl" is the actual SMART system, using the basic Rocchio relevance feed- back algorithms, and adding many terms (up to 500) from the relevant traing documents to the terms in the topic. The "dortPl" results come from using a probabilistically- based relevance feedback instead of the vector-space algo- rithm, and adding only 20 terms from the relevant docu- ments to each query. These two systems have the best routing results. The "Brkly5" system uses logistic regres- sion on both the general frequency variables used in their adhoc approach and on the query-specific relevance data available for training with the routing topics. The results marked "cityr2" are from City University, London (see Robertson, Walker, Jones, Hancock-Beaulieu & Gafford paper). This group automatically selected variable num- bers of terms (1~25) from the training documents for each topic (the topics themselves were not used as term sources), and then used traditional probabilistic reweight- ing to weight these terms. The "INQOO3" results also use probabilistic reweighting, but use the topic terms, expanded by 30 new terms per topic from the training documents. The results marked "lsir2" are more latent semantic indexing results from Beilcore. This run was made by creating a filter of the singular-value decomposi- tion vector sum or centroid of all relevant documents for a topic (and ignoring the topic itself). Figure 10 shows the recall4,recision curves for the six TREC-2 groups with the highest non-interpolated average precision using manual construction of the routing queries. The results marked "INQOOI" are from the INQRY system using an inferential combination of the "INQ003" queries and manually modified queries created from the topic. The "trw2" results represent an adaptation of the TRW Fast Data Finder pattern matching system to allow use of term weighting (see Mettler paper). The queries were manually constructed and the term weight- ing was learned from the training data. The "geerdi" results from GE Research and Development Center (see Jacobs paper) also come from manually constructed queries, but using a general-purpose lexicon and the train- ing data to suggest input to the Boolean pattern matcher. 13 The results marked "CLARThI" are similar to the "CLARTM" adhoc results except that the training docu- ments were used as the source for thesaurus building, as opposed to using the top set of retrieved documents. The "rutcombx" results from Rutgers University (see Belitin, Kantor, Cool & Quatrain paper) come from combining 5 sets of manually generated Boolean queries to optimize performance for each topic. The results marked "TOPIC2" are from the TOPIC system and reflect the use of an expert system working off specially-constructed knowledge bases to improve performance. As was the case with the adhoc topics, the automatic query construction methods continue to perform as well as, or m this case, better than the manual construction methods. A comparision of the two INQRY runs illus- trates this point and shows that all six results with manu- ally generated queries perform worse than the six runs with automatically-generated queries. The availability of the training data allows an automatic tuning of the queries that would be difficult to duplicate manually without extensive analysis. Unlike the adhoc results, there are two runs ("crnlCl" and "dor~1") that are clearly better than the others, with a sig- nificant difference between the "crnlCl" results and the "do~1" results and also significant differences between these results and the rest of the automatically-generated query results. In particular the Cornell group's ability to effectively use many terms (up to 500) for query expan- sion was one of the most interesting findings in I1~EC-2 and represents a departure from past results (see Buckley, Allan, & Salton paper for more on this). As a final point, it should be noted that the routing results also represent significant improvements over the results from ThEC-l. Figure 11 shows a comparison of results for a typical system in JREC- 1 and ThEC-2. Some of this improvement is due to the improved evaluation tech- niques, but the difference between the curve marked "ThEC-1" and the curve marked "TREC-2 looking at top 200 only" shows significant performance improvement. There is even more improvement for the routing results than for the adlioc results, due to better training data (mosfly non-existent for ThEC-1) and to major efforts by many groups in new routing algorithm experiments. Only four groups worked with less than the full document collection. Figure 12 shows the results for two of the groups m category B compared with a category B version of the Cornell SMART results. These curves show the results of runs from New York University (that were done in a similar method as that used for the adhoc results) and results from Dathousie University. 1.00 0.80 0.60 0.40 0.20 Best Automatic Routing 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Recall cnilC 1 dortP 1 city12 INQOO3 Brkly5 lsir2 Best Manual Routing 1.00 0.80 0.60 0.40 0.20 0.00 0.00 0.20 0.40 0.60 0.80 1.00 Recall * INQ004 ~trw2 A gecrdl CLARTM~ nitcombx TOPIC2 Figure 9. Best Automatic Routmg Results. Figure 10. Best Manual Routing Results. 14 1.00 0.80 0.60 0.40 0.20 0.00 0 Petfonnance Improvements In Routing 0.0 0.2 0.4 0.6 0.8 1.0 Recall ~- TREC-1 * TREC-2 looking at top 200 only TREC-2 1.00 0.80 0.60 0.40 0.20 0.00 Routing Category B 0.00 0.20 0.40 0.60 0.80 1.00 Recall cmlRB * nyu~ DalTx2 Figure 11. Typical Improvements in Routing Results. Figure 12. Category B Routing Results. 15 6. Some Preliminary Analysis 6.1 Int~duction The recall~recision curves shown in section 5 represent the average performance of the various systems on the full sets of topics. It is important to look beyond these aver- ages in order to learn more about how a given system is performing and to discover some generalizable principles of retrieval. Individual Systems are able to do this by performing fail- ure analysis (see Dumais paper in this proceedicgs for a good example) and by rimlng specific experiments to test hypotheses on retrieval behavi~ within a given system. However, additional infomiation can be gained by doing some cross-system comparison: information about spe- cific system behavior and information about generalized information retrieval principles. One way to do this is to examine system behavior with respect to test collection characteristics. A second method is to compare system behavior on a topic by topic basis. 62 The Effects of Test Collection Characteristics One particular test collection characteristic is the length of documents, both the average length of documents in a col- lection, and the variation in document length across a col- lectio~ Document length has significant effect on system performance. A term that appears 10 times in a "short" document is likely to be more important to that document than if the same term appeared 10 times in a "long" docu- ment. Table 3 shows system performance across the dif- ferent document subcollections for each of the adhoc top- ics, listing the total number of documents that were retrieved by the system as well as the number of relevant documents that were retrieved. Two particiilar points can be seen from table 3. First, the better systems retrieve about 50% relevant documents from all the subcollections except the Federal Register (FR). For this subeollection the retrieval rates are in the 25% range because the varied length of these documents makes retrieval difficult. The second point concerning table 3 is that the retrieval rate across the subeollections is highly varied among the systems. For example the "Brkly3" results show that many fewer Federal Register documents and more AP were retrieved than for the INQUERY system, whereas the "CLARTA" results show more DOE abstracts and fewer Wall Street Journal being retrieved. These "biases" towards particular subcollections reflect the methods used by systems such as the length normalization issues, domain concentrations of terminology, and methods used to "merge" results across subcollections (often implicit merges during indexing). A second test collection characteristic worth examining is the varied broadness and varied difficulty of the topics. An analysis was done [Harman 1994] to find the topics for which the systems retrieved the lowest percentage of the relevant documents on average. These topics are 61, 67,76,77, 81, 85,90,91,93, and 98 for the routing topics and 101, 114, 120, 121, 124, 131, 139, 140, 141, and 149 for the adhoc topics. Tables 4 and 5 show the top 8 sys- tem runs for the individual topics based in the average precision (noninterpolated). These tables mix automatic, manual, and feedback results for category A, and also cat- egory B results, so they should be interpreted careflilly. However they do demonstrate that no consistent patterns appear for the "hard" topics. The two best routing runs ("crnlCl" and "do~1") only do well on about half of these topics, and the adhoc results are even more varied. Often systems that do not perform well on average are the top performing system for a given topic. This verifies that, as usual, the variation across the topics is greater than the variation across systems. 6.3 C~~ss-System Analysis Tables 4 and 5 not only show the wide variation in system performance, but also raise several questions about system performance in general. 1. Does better average performance for a system result fro[n better performance on most topics or from comparable performance on most topics and significantly better performance on other topics? 2. if two systems perform similarly on a given topic, does that mean that they have retrieved a large proportion of the same relevant documents? 3. Do systems that use "similar" approaches have a high overlap in the particular relevant documents they retrieve? 4. And, if number 3 is not true, what are the issues that affect high overlap of relevant documents? Work is ongoing at NIST on these questions and oth& related issues. 16 Table 3. Number of Documents Retrieved~~elevant by Document Subcollecfion RwiThg AP DOE FR WSJ ZJH~ Brkly3 2414/1155 293/155 97/22 1847/831 349/121 citril 2152/970 474/179 129/16 1865/770 380/177 citri2 2206/1019 348/156 250/67 1814/756 382/179 cityau 1578/794 93/46 1359/147 1603/661 367/108 citynif 1939/1226 24/21 584/146 2026/812 427/173 CLARTA 2034/1048 403/170 412/95 1795/819 356/190 CLARIM 2131/1087 388/208 315/87 1769/820 397/221 CnQstl 2272/921 254/112 311/74 1763/689 400/181 CnQs~ 2214/980 191/94 453/93 1787/738 355/184 crnlL2 1914/944 648/240 155/33 1774/773 509/213 crnlV2 2164/1083 687/236 79/22 1600/682 470/194 dortL2 2081/1053 305/169 473/78 1818/815 323/166 dor~Q2 2205/1053 357/171 186/44 1924/874 328/171 erirnal 1077/501 85/51 1364/110 2251/752 223/94 erirna2 1267/614 122/68 1219/125 2124/773 268/133 gecr(12 2250/852 294/91 319/68 1952/743 185/77 HNCad1 2140/1042 409/145 164/53 1875/839 412/181 HNCa~ 2163/1286 306/159 171/67 1974/1005 386/237 INQ001 2031/1071 206/107 297/115 2184/1023 282/151 INQ002 2087/1111 201/120 276/111 2141/1010 295/177 isial 2278/771 587/124 124/0 1448/376 563/61 isiasm 2168/1052 711/211 70/17 1607/690 444/183 nyuirl 0/0 0/0 0/0 5000/1360 0/0 nyuir2 0/0 0/0 0/0 5000/1547 0/0 nyuir3 0/0 0/0 0/0 5000/1547 0/0 pircs3 2109/1021 358/152 246/86 1999/835 288/139 pirc84 2108/1014 342/148 254/85 2012/863 284/137 proeol 1099/1024 315/83 1178/205 1377/980 1031/277 proeol 1667/1024 695/83 381/205 1350/980 907/277 rutcombl 1029/368 181/72 112/18 963/312 215/79 ruffined 945/309 131/46 161/9 963/282 200/77 schaul 2038/901 534/189 173/18 1778/706 477/186 siems2 2225/1147 631/218 62/8 1655/770 427/202 siems3 2238/1173 654/208 53/7 1619/764 436/194 TMC8 2054/859 146/44 763/59 1472/526 565/183 TM~ 1923/802 77/29 975/63 1401/507 624/171 TOP102 2292/9% 152/98 344/100 1762/889 384/229 UREKA2 385/215 0/0 4003/87 354/144 258/10 UREKA3 755/405 5/2 2654/67 1045/348 441/22 uicah 1612/628 234/104 797/137 1846/356 511/167 VTcms2 2110/1130 232/107 444/95 1859/894 355/169 totalS 71354/4630 12073/669 21407/396 793%/3929 15504/1154 17 Table 4. System Rankings (using Average Precision) on Individual Topics 51 nyuir2 nyuirl gecrdl TOPI~ ADS2 cityr2 INQ004 INQ003 52 INQ004 INQ003 Brkly4 pixcs2 VTcms2 gecrdl pircsl trwl 53 gecrdl nyuir2 trw2 nyuirl CLARThi CLARTA do~1 INQ003 54 siemsl crnlRl schaul Brkly4 INQ003 crnlCl Isiri CLARTM 55 dortPl cmlRl crniCl lsir2 dortVl CLARTM cityri CLARTA 56 trw2 dor~1 dortVl INQ003 INQ004 HNCrt1 c~1 crnlCl 57 INQ003 lsir2 INQ004 trw2 crnlCl TMC6 VTcms2 crnlRl 58 nyuir2 nyuirl ruteombx INQ003 lsir2 INQ004 gecrdl Brkly5 59 trw2 Brkly5 gecrdl isiri HNCrt1 VTcms2 HNCrt2 lsir2 60 dortPl dortVl rutcombx crnlRl INQ004 crnlCl INQ003 TOPIC2 61 TOPIC2 rutcombx Brkiy4 idsr~ cityr2 isiri INQ004 Brkly5 62 criilRl emiCi dortpl isiri CLARTA Brkly4 CLARTM Brkly5 63 dortVl erniCi pircs2 cmlRl pircsl siemsl HNCrtl dortPl 64 nyuir2 lsir2 INQ004 INQ003 Brkly5 crnlCl crnlRl cityr2 65 criliCi dortVl do~1 HNCrt1 criliRi trw2 HNC~ lsirl 66 pircs2 pircs 1 dortpl dortVl cmlRl erniCi siemsl INQ004 67 crnlRl criliCi INQ004 nyuir2 dortPl cityr2 lsir2 INQ003 68 Brkly5 criliCi cityri cityr2 trw2 INQ003 lsir2 CLARTA 69 eninri Brkly5 dortVl cityr2 cityri erirnr2 isiri Brkly4 70 TMC6 ruteombx VTcms2 HNCrt2 Brkly5 INQ004 cityr2 71 cmlRl crnlCl HNC~ siemsl CLARTM HNCrt1 CLARTA lsir2 72 crnlRl criliCi dortPl siemsl INQ003 Brkly5 INQ004 cityri 73 INQ003 crnlRl cityr2 INQ004 erniCi trw2 dortl'1 dortVl 74 crnlRl rutco[nbx crnlCl CLARTA Brkly5 dortPl siemsl dortVl 75 erniCi ADS2 crnIRl trwl lsir2 dortPl cityr2 nyuir2 76 trw2 cityr2 TOPI~ TMC6 TM~ crnlCl criliRi INQ003 77 crIliRi crIliCi INQ003 CLARTM dortVl do~1 JNQ004 CLARTA 78 rutcombx TOPIC2 INQ004 CLARTM INQ003 dortVl pircs2 CLARTA 79 cityr2 criliRi crnlCl INQ004 dortPl gecrdl lsir2 INQ003 80 trwl criliCi cmlRl cityri Brkly5 INQ003 INQ004 cityr2 81 gecrdl TMC6 cityr2 trw2 VTcms2 HNC~ cityri 82 CLARTM CLARTA trw2 Brkly5 pircsl pircs2 dortVl dortPl 83 TOPIC2 gecrdl trwl crnlCl HNCrt1 cmlRl cityr2 cityri 84 dortPl criliCi lsir2 geerdi criliRi dortVl trwl VTcms2 85 criliRi emiCi dortPl Brkly5 trw2 nyuir2 dortVl siemsl 86 gecrdl VTcms2 lsir2 lsirl cityrl crnlRl cityr2 crnlCl 87 lsir2 gecrdl cityri cityr2 HNCrt1 Brkly5 crnlCl HNCrt2 88 crIliCi cityr2 cmlRl Brkly4 dortPl lsir2 dortVl BridyS 89 trw2 nyuirl TOPI~ TMC6 HNCrt1 uicrl HNCrt2 gecrdl 90 gecrdl trwl crnlCl cmlRl schaul VTcms2 Brkly5 dortPl 91 trwl INQ004 schaul Brkly5 trw2 TOPI~ HNC~ HNCrt1 92 gecrdl CrIilR1 lsir2 crnlCl CLARTh4 CLARTA nyuirl INQ003 93 INQ004 rutco'nbx INQ003 TMC6 trwl Brkly5 TMC7 gecrdl 94 lsir2 criliCi cityr2 gecrdl INQ004 CLARTM trw2 cityri 95 VTcms2 gecrdl crniCl Brkly5 c~1 Brkly4 trwl siemsl 96 dortPl TOPI~ cityri dortVl cityr2 lsir2 crnlCl rutcombx 97 idsra2 HNCrt1 nyuir2 dortPl HNCrt2 lsir2 crnlCl TOPIC2 98 HNCrt1 HNCrt2 crIliCi trw2 DalTx2 INQ004 crnlRl dortPl 99 lsir2 crnlRl dortPl CLARTA crnlCl CLARTM dortVl cityr2 100 erniCi criliRi dortPl lsir2 dortVl CLARTA CLARThI lsirl 18 Table 5. System Rankings (using Average Precision) on Individual Topics 101 rutcombl VTcms2 cnilV2 INQ()()2 dortQ2 pircs3 Brkiy3 CLARTM 102 crIilL2 crnlV2 VTcms2 siems3 dortL2 INQ()()2 siems2 CLARThI 103 siems3 siems2 scilaul citril cmlV2 isiasm HNCa~ `INCadi 104 dortQ2 CLARTM CLARTA pircs4 pi'~s3 dortL2 HNCa~ isiasm 105 citri2 isiasm citril siems2 siems3 cmlV2 schaul cmlL2 106 VTcms2 INQ()()2 INQOOl TOPIC2 pucs4 pircs3 CLARTM do~2 107 CnQstl CnQs~ nitcombl TOPI~ VTcms2 INQ002 ruffined CLARIM 108 citril dortQ2 siems3 VTcms2 siems2 HNCa~ schaul dortL2 109 do~2 cmlL2 dortQ2 CLARTA CLARTM pircs3 cmlV2 pirc~ 110 INQ002 INQOOl Brkly3 dortQ2 nyuir3 nyuir2 cityau siems2 111 CLARTA CLARTM INQOOl dortQ2 Brkly3 siems2 siems3 pirc54 112 INQ002 INQOOl VTcms2 nyuir2 nyuir3 `INCadi HNCa~ CnQs~ 113 VTcms2 cxiiIL2 dortL2 cmlV2 nyuirl siems2 CLARTM INQ002 114 INQ002 cityau VTcms2 INQOOl siems3 siems2 isial ToPI~ 115 nyuir2 nyuir3 nyuirl siems2 dortL2 cmlV2 siems3 cmlL2 116 VTcms2 CLARTA HNCad2 `INCadi siems3 siems2 CLARTM Brkiy3 117 citri2 citril dortQ2 INQOOl TMC8 isiasm gecr(12 schaul 118 nyuir2 nyuir3 nyuirl TOPIC2 citynif dortQ2 CLARTA INQOOl 119 nyuirl nyuir2 nyuir3 INQ002 INQOOl dortQ2 citynif VTcms2 120 citymf nyuir2 nyuir3 nytifri CnQs~ CnQstl VTcms2 eri~ 121 TOPI~ CLARTM VTcms2 Brkiy3 nyuirl prceol INQ002 rut~ned 122 siems2 siems3 INQ002 INQOOl dortQ2 Brkly3 CLARTM cmlV2 123 nyuirl nyuir2 nyuir3 CLARTA INQOOl INQ002 CLARTM pirc54 124 nyuir2 nyuir3 nyuirl dortL2 dor~2 INQOOl Brkly3 TMC9 125 cinlV2 Brkly3 crnlL2 CLARTM siems3 CLARTA pircA pircs3 126 siems3 cm'L2 siems2 Brkly3 ciiilV2 INQ002 CLARTM INQOOl 127 cityau Brkly3 CLARTA HNCa~ INQOOl INQ002 siems2 siems3 128 VTcms2 CLARTA siems3 siems2 CLARTM TOPIC2 citril isiasm 129 INQOOl INQ002 cityau CLARTM siems2 Brkly3 crnlL2 CLARTA 130 INQO()2 INQOOl dortQ2 cmlL2 pnc$4 CLAR~ dortL2 pircs3 131 ToPI~ VTcms2 HNCadl HNCa~ siems3 Brkly3 siems2 INQ002 132 dortL2 INQOOl INQ002 citril citri2 dortQ2 HNCad2 crnlL2 133 CnQs~ CriQstl rutcombl pircs4 INQ002 pircs3 cityau INQOOl 134 c~2 dortL2 nyuirl nyuir2 nyuir3 INQ()()2 INQOOl dortQ2 135 nyuir2 nyuir3 nyuirl Brkly3 INQOOl INQ002 siems3 siems2 136 VTcms2 CnQstl CnQst2 CLARTM pixc84 CLARTA dortQ2 ToPI~ 137 CLARTA nyuir2 nyuir3 Brkly3 siems2 siems3 CLARTM nyuirl 138 nyuir2 nyuir3 rutfined ruteombi nyuirl schaul gecr(12 citril 139 nyuir2 nyuir3 nyuirl VTcms2 dortL2 HNCad2 dortQ2 HNCad1 140 nyuir2 nyuir3 nyuirl dortQ2 dori2 INQ002 siems3 siems2 141 VTcms2 INQ002 CnQs~ INQOOl Brkly3 dori2 dortQ2 CnQstl 142 dortQ2 siems2 cxiilL2 VTcms2 siems3 CLARIM cmlV2 Brkly3 143 INQ002 INQOOl siems2 siems3 crnlL2 cmlV2 nyuir2 nyuir3 144 VTcms2 Brkiy3 citymf cnilV2 siems3 isiasm siems2 HNCad2 145 cmlL2 crnlV2 dori2 CLARTM nyuirl siems3 siems2 dortQ2 146 Brkly3 siems3 siems2 isiasm crnlV2 schaul CLARTM citril 147 HNCad2 `iNCadi vr~~s2 citril JNQ()()2 INQOOl citynif CLARTA 148 isiasm cmIL2 cm'V2 siems2 siems3 BIkly3 dortL2 dortQ2 149 nyuirl CnQs~ ToPI~ CnQstl CLARTA rutf~ned Brkly3 ruteombi 150 cIn'L2 dortQ2 CLAR~ siems3 INQ002 INQOOl cmlV2 siems2 19 6.4 Summary The ~fl~EC-2 conference demonstrated a wide range of different approaches to the retrieval of text from large document collections. There was significant improvement in retrieval performance over that seen in TREC-1, espe- cially in the routing task. The availability of large amounts of training data for routing allowed extensive experimentation in the best use of that data, and many dif- ferent approaches were tried in IREC-2. The automatic construction of queries from the topics continued to do as well as, or better than, manual construction of queries, and this is encouraging for groups supporting the use of simple natural language interfaces for retrieval systems. How well is the TREC initiative meeting its goals? There is certalaly increased research using a much larger collec- tion than had previously been tested. This leads not only to discoveriiig interesting research problems, but also to developing algorithms that are ripe for transfer into com- mercial systems. The conference itself provided the opportunity for this; there was open exchange between the research groups in universities and the research groups in commercial organizations and this is a very critical part of teclmology transfer. There will be a third TREC conference in 1994, and all the systems that participated in TREC-2 will be back, along with additional groups. 7. References Harman, D. (1993) (Ed.).The First Text REtrieval Confer- ence (TREC-1). National Institute of Standards and Tech- nology Special Publication 500-207, Harman, D. (1994). Data Preparation. In: Merchant R. (Ed.).The Proceedings of the TIPSTER Text Program - Phase L San Mateo, California: Morgan Kanfinaun Pub- lishing Co., 1994. Katzer. 3., McGill, M.J., Tessier, J.A., Frakes, W., and DasGupta, P. (1982). A Study of the Overlap among Doe- ument Representations. Information Technology: Research and Development, 1(2), 261-274. Salton, G. and McGill, M. (1983). Introduction to Modern Information Retrieval. New York, NY: McGraw-Hill. Sparck Jones, K. and Van Rijsbergen, C. (1975). Report on the Need for and Provision of an `Ideal" Information Retrieval Test Collection, British Library Research and Development Report 5266, Computer Laboratory, Uni- versity of Cambridge. 1993. 20