1999 TREC-8 Spoken Document Retrieval (SDR) Track Evaluation Specification.

Updated: 20 July 1999
Version: 1.2 [HTML 1.3]

A simple text version can be found here .
John Garofolo, Cedric Auzanne, Ellen Voorhees, Karen Sparck Jones

This is the specification for implementation of the TREC-8 Spoken Document Retrieval (SDR) Track. For other associated documentation regarding the TREC-8 SDR Track.

For information regarding other TREC-8 tracks, see the TREC Website at http://trec.nist.gov

Background from TREC-7
What's New and Different
TREC-8 SDR Track in a Nutshell
Baseline Speech Recognizer
Baseline Retrieval Engine
Spoken Document Test Collection
SDR System Date
Development Test Data
Speech Recognition Training/Model Generation
Retrieval Training, Indexing, and Query Generation
SDR Participation Conditions and Levels
Evaluation Retrieval Conditions
Topics (Queries)
Relevance Assessments
Retrieval (Indexing and Searching) Constraints
Submission Formats
1. Retrieval Submission Format
2. Recognition Submission Format
Scoring
1. Retrieval Scoring
2. Speech Recognition Scoring
Data Licensing and Costs
Reporting Conventions

Appendix A: SDR Corpus File Formats

Appendix B: SDR Corpus Filters

Background from TREC-7

The 1998 TREC-7 SDR evaluation met its goals in bringing the information retrieval (IR) and speech recognition (SR) communities together to implement an ad-hoc-style evaluation of Spoken Document Retrieval technology using an 87-hour broadcast news collection. The TREC-7 SDR evaluation taught us that there is a direct relationship between recognition errors and retrieval accuracy and that certain document expansion approaches applied to moderately accurate recognized transcripts produce results which are nearly comparable results for retrieval using perfect human-generated transcripts. However, the TREC-7 2,866-story collection was still quite small by IR standards. As such, it is impossible to make conclusions about the effectiveness of the technology for realistically large collections.

In TREC-8, we will investigate how the technology scales for a much larger broadcast news collection. We will also permit participants to explore one of the challenges in real spoken document retrieval implementation - retrieval of excerpts with unknown story boundaries.

Further details regarding the TREC-6 SDR Track can be obtained from the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998.

Further details regarding the TREC-7 SDR Track can be obtained from the TREC-7 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 28-march 3, 1999.

srt2ltt.pl	This filter transforms the Speech Recognizer Transcription (SRT) format with word times into the Lexical TREC Transcription (LTT) form. This resulting simplified form of the speech recogniser transcription can be used for retrieval if word times are not desired.
srt2ctm.pl	This filter transforms the Speech Recognizer Transcription (SRT) format into the CTM format used by the NIST SCLITE Speech Recognition Scoring Software.
ctm2srt.pl	This filter together with the corresponding NDX file transforms the CTM format used by the NIST SCLITE Speech Recognition Scoring Software into the SDR Speech Recognizer Transcription (SRT) format. Material not specified in the NDX time tags is excluded.

1999 TREC-8 Spoken Document Retrieval (SDR) Track Evaluation Specification.

Larger Test Collection:

Unknown Boundaries Spoke:

Rolling Language Model Option:

Training Collection:

Test Collection:

Participation:

Topics:

Retrieval Conditions:

Recognition Language Models:

Recognition Modes:

Primary Scoring Metrics:

Important Dates:

N O T E