BACKGROUND NOTE TO : ------------------ TREC-7 SPOKEN DOCUMENT RETRIEVAL (SDR) TRACK SPECIFICATION 1 ------------------------------------------------------------- Karen Sparck Jones, John Garofolo, Ellen Voorhees 28 April 1998 SUMMARY OF TREC-6 ----------------- The TREC-6 SDR evaluation met its goals in bringing the information retrieval (IR) and speech recognition (SR) communities together, and in debugging the logistics for an SDR evaluation. Further details can be obtained from the track specification (on the TREC home page), the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998. The experiments used the DARPA CSR Broadcast News data as documents, taking the HUB-4 1996 50 hours training data as training set and the further HUB-4 1997 50 hours training data as test set. There were approximately 1500 documents - news stories - in each set. The queries were `known-item' ones, topic specifications designed to recover a previously-seen but only roughly-remembered document. This type of query simplifies data provision for evaluation since there is no need for post-search assessment of retrieved documents for relevance to query subject content, as with the normal adhoc search for documents on some topic: the SDR task was simply to find the required document, optimally delivering it in top position in ranked search output. There were 5 training topics and 49 test ones. The performance measures used were matched to this specific task, and not the usual adhoc retrieval ones. Full details, including those of the test design, which was formulated to cover a range of comparisons across speech processing strategies, across retrieval strategies, and between performance for documents as recognised as opposed to that for their correct (reference transcription) forms, are given in the cited publications. One important feature of the TREC-6 experiment was that there was a so-called `Baseline' represented by recogniser output from a single system (kindly supplied by IBM), which was provided for common use. This was both exploited by individual teams who had no recogniser of their own and used in obligatory retrieval runs for all participants. The results were of interest in that the Baseline was neutral, ie not geared to the application, but could give good performance, suggesting that working with an `off-the-shelf' recogniser can be worthwhile. The aims for TREC-7 are to improve on TREC-6's good points and overcome its bad ones. TREC-6 SDR suggested that SDR performance could approach that for text retrieval. However, while SDR performance was good, the TREC-6 tests are only indicative because the document data set was very small, known-item searching is a limited task, and its particular instantiation for the test material was too easy since the required documents could be delivered at top rank with little effort. TREC-7 SDR GOALS ---------------- The specific goals in TREC-7 are to evaluate SDR for a) the usual type of adhoc topic query b) a larger document file. The design and style for TREC-7 are like those for TREC-6. As before, there are two modes of participation: SDR for those with speech recognisers and Q(uasi)SDR for those without. QSDR is offered for those working in retrieval who wish to test their retrieval ideas on real recogniser output provided for them. Those in the speech community without retrieval engines of their own may use any available system, including publicly-available ones such as the NIST ZPRISE system. The new query type implies a change in the evaluation measures used to those which are standard for TREC adhoc retrieval evaluation, based on Precision and Recall. DATA TERMINOLOGY NOTE --------------------- The experimental paradigms for IR and SR are different. Combining them leads to an evaluation paradigm which may not be the most obvious from just the IR or SR point of view. Thus for *documents*, in IR with adhoc queries the same set of documents can be (and normally is) used for both engine training and engine testing. The crucial test data is provided by new *queries*. For SR, on the other hand, there are no queries and the crucial distinction is between data on which the engine is trained and that on which it is tested. In principle the SR training data can be anything; in practice, given the cost of providing the reference transcripts needed for training, and also an interest in experimental control for formal evaluations, specific data sets may be designated as training data. However since (in general) more data makes for better recognition, it is normal to exploit any appropriate available data for acoustic modelling (AM) and for language modelling (LM): LM needs only text, but welcomes lots of it. In addition in SR, where formal evaluations are in question, and where there is no reason to suppose any training materials - whether specifically designated or whatever has been used for modelling - are very like the test data, it has become common to have so-called `development data', providing a sample-type preview of the test data. A particular point of note is that, as naturally follows from the differences between the IR and SR cases just mentioned, it is possible to reuse IR test document sets over time, with different query sets and hence without `contaminating' the test data by having used it for training; but in the SR case it is customary in practice to roll each successive data set into the training data, so it cannot (conveniently) be extracted and used again for test purposes. It should also be noted that test (or `evaluation') data for SR can be quite modest in scale, as in HUB-4, but that test data for IR has to be quite large, so results are statistically valid and tests are manifestly pertinent to (typically huge) operational systems. For the TREC SDR evaluations we therefore work as follows. We have Training data ------------- 1) a defined document training set (with whatever generic properties are suitable for IR eg document similarity and difference, and for SR eg speech conditions) 2) a defined query training set appropriate to (1) in topics and in having relevant documents in (1) Test data --------- 3) a separate defined document test set of the same type of material as the training set eg they are both Broadcast News, though not necessarily sharing the same time frame 4) a separate query test set, similar in type to (2) and also appropriate to (3), especially in having relevant documents in (3). We also define conditions on any additional material that may be used for SR modelling (AM or LM) - call this 'modelling data' to distinguish it from the training data just summarised, though it is of course used for system training: these conditions are typically exclusion conditions motivated by practical definitions to ensure that the modelling data does not accidentally contain the test data, though the modelling data may be similar to the test data. Further, since for IR additional document (or query) material may also be helful for system training, we impose the same exclusion conditions on IR modelling material as for SR modelling data (it may or may not be useful to use the same data for both IR and SR modelling purposes). TREC-7 MAIN DATA ---------------- The specific data choices for TREC-7 SDR (aka SDR98) are laid out in the track specification, It should be noted however that it has proved impossible to obtain a test file for TREC-7 satisfying both the size desiderata for IR (> 5 K documents, minimum), and the preprocessing desiderata for SR (supply of transcriptions) within the tight TREC-7 time frame. Thus while the DARPA Topic Detection and Tracking (TDT) data should be suitable for, and will be available for, any future SDR evaluations, it could not be supplied for TREC-7. After intensive discussion, the TREC-7 SDR training and test document sets have been defined as follows. The documents are stories taken from the Linguistic Data Consortium (LDC) Broadcast News (BN) corpora, which are also used for the DARPA CSR evaluations. The training set for the SDR track consists of about 100 hours (actually after the removal of commercials a bit less). This is the 100 hours used as the training material for the 1997 DARPA HUB-4 CSR evaluation, and it is also the combination of the two 50-hour sets used for training and test respectively in TREC-6 SDR. (It is sometimes referred to as the first 100 hours.) The test set will consist of the (second) 100 hours prepared for the 1998 CSR evaluation. The training material was recorded in 1996, the test material from the middle of 1997 to early 1998. There are about 3000 stories in the training data, and it is expected that there will be a similar number in the test data. A story, i.e. document, is generally defined as a continuous stretch of news material with the same content or theme. The topics are short sentences, 5 for training (with accompanying relevant document information) and 25 for test, for which relevance assessments will be provided, in normal TREC style, by judging pooled output from submitted searches.