BACKGROUND NOTE TO :
------------------

TREC-7 SPOKEN DOCUMENT RETRIEVAL (SDR) TRACK  SPECIFICATION 1
-------------------------------------------------------------


Karen Sparck Jones, John Garofolo, Ellen Voorhees

28 April 1998


SUMMARY OF TREC-6
-----------------

The TREC-6 SDR evaluation met its goals in bringing the information
retrieval (IR) and speech recognition (SR) communities together, and
in debugging the logistics for an SDR evaluation. Further details
can be obtained from the track specification (on the TREC home page),
the TREC-6 Proceedings published by NIST, and the Proceedings of the
DARPA Broadcast News Transcription and Understanding Workshop,
February 8-11, 1998. The experiments used the DARPA CSR Broadcast
News data as documents, taking the HUB-4 1996 50 hours training data as
training set and the further HUB-4 1997 50 hours training data as
test set. There were approximately 1500 documents - news stories -
in each set. The queries were `known-item' ones, topic specifications
designed to recover a previously-seen but only roughly-remembered
document. This type of query simplifies data provision for evaluation
since there is no need for post-search assessment of retrieved documents
for relevance to query subject content, as with the normal adhoc search
for documents on some topic: the SDR task was simply to find the
required document, optimally delivering it in top position in ranked
search output. There were 5 training topics and 49 test ones. The
performance measures used were matched to this specific task, and not
the usual adhoc retrieval ones. Full details, including those of
the test design, which was formulated to cover a range of comparisons
across speech processing strategies, across retrieval strategies, and
between performance for documents as recognised as opposed to that for
their correct (reference transcription) forms, are given in the cited
publications.

One important feature of the TREC-6 experiment was that there was a
so-called `Baseline' represented by recogniser output from a single
system (kindly supplied by IBM), which was provided for common use.
This was both exploited by individual teams who had no recogniser of
their own and used in obligatory retrieval runs for all participants.
The results were of interest in that the Baseline was neutral, ie not
geared to the application, but could give good performance, suggesting
that working with an `off-the-shelf' recogniser can be worthwhile.

The aims for TREC-7 are to improve on TREC-6's good points and
overcome its bad ones. TREC-6 SDR suggested that SDR performance
could approach that for text retrieval. However, while SDR performance
was good, the TREC-6 tests are only indicative because the document
data set was very small, known-item searching is a limited task, and
its particular instantiation for the test material was too easy since
the required documents could be delivered at top rank with little
effort.


TREC-7 SDR GOALS
----------------

The specific goals in TREC-7 are to evaluate SDR for
  a) the usual type of adhoc topic query
  b) a larger document file.

The design and style for TREC-7 are like those for TREC-6. As before,
there are two modes of participation: SDR for those with speech
recognisers and Q(uasi)SDR for those without. QSDR is offered for those
working in retrieval who wish to test their retrieval ideas on real
recogniser output provided for them. Those in the speech community without
retrieval engines of their own may use any available system, including
publicly-available ones such as the NIST ZPRISE system.

The new query type implies a change in the evaluation measures used
to those which are standard for TREC adhoc retrieval evaluation,
based on Precision and Recall. 


DATA TERMINOLOGY NOTE
---------------------

The experimental paradigms for IR and SR are different. Combining them
leads to an evaluation paradigm which may not be the most obvious from
just the IR or SR point of view.

Thus for *documents*, in IR with adhoc queries the same set of documents
can be (and normally is) used for both engine training and engine
testing. The crucial test data is provided by new *queries*. For SR,
on the other hand, there are no queries and the crucial distinction
is between data on which the engine is trained and that on which it
is tested. In principle the SR training data can be anything; in
practice, given the cost of providing the reference transcripts needed
for training, and also an interest in experimental control for formal
evaluations, specific data sets may be designated as training data.
However since (in general) more data makes for better recognition, it is
normal to exploit any appropriate available data for acoustic modelling
(AM) and for language modelling (LM): LM needs only text, but welcomes
lots of it. In addition in SR, where formal evaluations are in question,
and where there is no reason to suppose any training materials - whether
specifically designated or whatever has been used for modelling - are
very like the test data, it has become common to have so-called
`development data', providing a sample-type preview of the test data.
A particular point of note is that, as naturally follows from the
differences between the IR and SR cases just mentioned, it is possible
to reuse IR test document sets over time, with different query sets
and hence without `contaminating' the test data by having used it for
training; but in the SR case it is customary in practice to roll each
successive data set into the training data, so it cannot (conveniently)
be extracted and used again for test purposes. It should also be noted
that test (or `evaluation') data for SR can be quite modest in scale,
as in HUB-4, but that test data for IR has to be quite large, so results
are statistically valid and tests are manifestly pertinent to (typically
huge) operational systems.

For the TREC SDR evaluations we therefore work as follows. We have

Training data
-------------
  1) a defined document training set (with whatever generic properties
      are suitable for IR eg document similarity and difference, and
      for SR eg speech conditions)
  2) a defined query training set appropriate to (1) in topics and
      in having relevant documents in (1)

Test data
---------
  3) a separate defined document test set of the same type of
      material as the training set eg they are both Broadcast News, though
      not necessarily sharing the same time frame
  4) a separate query test set, similar in type to (2) and also appropriate
      to (3), especially in having relevant documents in (3).

We also define conditions on any additional material that may be used
for SR modelling (AM or LM) - call this 'modelling data' to distinguish it
from the training data just summarised, though it is of course used
for system training: these conditions are typically exclusion conditions
motivated by practical definitions to ensure that the modelling data does
not accidentally contain the test data, though the modelling data may be
similar to the test data. Further, since for IR additional document (or
query) material may also be helful for system training, we impose the
same exclusion conditions on IR modelling material as for SR modelling
data (it may or may not be useful to use the same data for both
IR and SR modelling purposes).


TREC-7 MAIN DATA
----------------

The specific data choices for TREC-7 SDR (aka SDR98) are laid out in
the track specification, It should be noted however that it
has proved impossible to obtain a test file for TREC-7 satisfying both
the size desiderata for IR (> 5 K documents, minimum), and the
preprocessing desiderata for SR (supply of transcriptions) within the
tight TREC-7 time frame. Thus while the DARPA Topic Detection and Tracking
(TDT) data should be suitable for, and will be available for, any future
SDR evaluations, it could not be supplied for TREC-7. After intensive
discussion, the TREC-7 SDR training and test document sets have been
defined as follows.

The documents are stories taken from the Linguistic Data Consortium
(LDC) Broadcast News (BN) corpora, which are also used for the DARPA
CSR evaluations.

The training set for the SDR track consists of about 100 hours (actually
after the removal of commercials a bit less). This is the 100 hours
used as the training material for the 1997 DARPA HUB-4 CSR evaluation,
and it is also the combination of the two 50-hour sets used for training
and test respectively in TREC-6 SDR. (It is sometimes referred to as
the first 100 hours.)

The test set will consist of the (second) 100 hours prepared for
the 1998 CSR evaluation.

The training material was recorded in 1996, the test material from the
middle of 1997 to early 1998. There are about 3000 stories in the
training data, and it is expected that there will be a similar number
in the test data. A story, i.e. document, is generally defined as a
continuous stretch of news material with the same content or theme.

The topics are short sentences, 5 for training (with accompanying
relevant document information) and 25 for test, for which relevance
assessments will be provided, in normal TREC style, by judging pooled
output from submitted searches.