Updated: 20 July 1999
Version: 1.2 [HTML 1.3]
This is the specification for implementation of the TREC-8 Spoken Document Retrieval (SDR) Track. For other associated documentation regarding the TREC-8 SDR Track.
For information regarding other TREC-8 tracks, see the TREC Website at http://trec.nist.gov
Appendix A: SDR Corpus File Formats
Appendix B: SDR Corpus Filters
The 1998 TREC-7 SDR evaluation met its goals in bringing the information retrieval (IR) and speech recognition (SR) communities together to implement an ad-hoc-style evaluation of Spoken Document Retrieval technology using an 87-hour broadcast news collection. The TREC-7 SDR evaluation taught us that there is a direct relationship between recognition errors and retrieval accuracy and that certain document expansion approaches applied to moderately accurate recognized transcripts produce results which are nearly comparable results for retrieval using perfect human-generated transcripts. However, the TREC-7 2,866-story collection was still quite small by IR standards. As such, it is impossible to make conclusions about the effectiveness of the technology for realistically large collections.
In TREC-8, we will investigate how the technology scales for a much larger broadcast news collection. We will also permit participants to explore one of the challenges in real spoken document retrieval implementation - retrieval of excerpts with unknown story boundaries.
Further details regarding the TREC-6 SDR Track can be obtained from the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998.
Further details regarding the TREC-7 SDR Track can be obtained from the TREC-7 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 28-march 3, 1999.
Back to the Table of Contents.
The 1999 TREC-8 SDR test collection will be approximately six times the size of the 1998 TREC-7 SDR collection: 550+ hours of speech and approximately 21,500 news stories.
A new optional condition will be included in which the story boundaries in the collection are unknown. To support this, systems must implement both the recognition and retrieval components of the task without access to the reference story boundaries. A version of the baseline recognizer transcripts will be provided without embedded story boundaries to support participation in this condition by Quasi-SDR participants. The new condition will require that retrieval systems output times rather than story IDs. Details regarding the implementation and scoring of the new condition are provided in later sections of this document.
Sites may implement their primary S1 and/or secondary S2 recognition run using a "rolling" language model which adapts to newswire texts from previous days.
Back to the Table of Contents.
No particular training collection is specified or provided for this track. All previous TREC SDR training and test materials may be used for training. (A list of potential training will be given in the SDR Website.) In addition, sites may make use of other training material as long as these materials are publicly available and pre-date the test collection.
~550+-hour TDT-2 corpus subset (audio and human/asr transcripts)
Back to the Table of Contents.
As in TREC-7, Baseline recognizer transcripts will be provided for retrieval sites who do not have access to recognition technology. Using these baseline recognizer transcripts, sites without recognizers can participate in the "Quasi-SDR" subset of the Track.
Note that all sites (Full SDR and Quasi-SDR) will be required to implement retrieval runs on the baseline recognizer transcripts. This will provide a valuable "control" condition for retrieval.
This year, one baseline recognizer transcript set will be provided by a NIST instantiation of the Rough 'N Ready BYBLOS recognition engine kindly provided by BBN. The acoustic model training for the baseline recognizer was limited to the 1995 (Marketplace) and 1996/97 (Broadcast News) training sets released by the Linguistic Data Consortium for use in Hub-4 speech recognition evaluations. The language modeling training sources for the baseline recognizer are the following:
Two versions of the baseline recognizer trascript set are provided:
Back to the Table of Contents.
Sites in the speech community without access to retrieval engines may use the NIST ZPRISE retrieval engine.
see http://www-nlpir.nist.gov/works/papers/zp2/zp2.html
Back to the Table of Contents.
The 1999 SDR collection is based on the broadcast news audio portion of the TDT-2 News Corpus which was originally collected by the Linguistic Data Consortium to support the DARPA Topic Detection and Tracking Evaluations. The corpus contains recordings, transcriptions, and associated data for several radio and television news sources broadcast daily between January and June 1998. The 1999 SDR Track will use the February - June subset of the TDT-2 corpus (January is excluded so as not to conflict with Hub-4 recognizers which have been trained on overlapping material from January 1998). The SDR collection consists of approximately 550 hours of recordings and contains approximately 21,500 news stories.
Back to the Table of Contents.
The "documents" in the SDR Track are news stories taken from the Linguistic Data Consortium (LDC) TDT-2 Broadcast News Corpus (February - June 1998 subset), which was also used in the 1998 DARPA TDT-2 evaluation. A story is generally defined as a continuous stretch of news material with the same content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), which will have been established by hand segmentation of the news programs (at the LDC).
There are two classifications of "stories" in the TDT-2 Corpus: "NEWS" which are topical and content rich and "MISCELLANEOUS" which are transitional filler or commercials. Only the "NEWS" stories will be included in the SDR collection. Note that a news story is likely to involve more than one speaker, background music, noise etc.
The news stories comprise approximately 385 of the 550 hours of the corpus. (note that this information may be used for planning purposes, but not in training story boundary unknown systems)
The collection has been licensed, recorded and transcribed by the Linguistic Data Consortium (LDC). Since use of the collection is controlled by the LDC, all SDR participants must make arrangements directly with the LDC to obtain access to the collection. See below for contact info.
Back to the Table of Contents.
The test collection for the SDR Track consists of digitized NIST SPHERE-formatted waveform (*.sph) files containing recordings of news broadcasts from various radio and television sources aired between 01-February-1998 and 30-June-1998 and human-transcribed and recognizer-transcribed textual versions of the recordings.
The test collection contains approximately 900 SPHERE-formatted waveform files of recordings of entire broadcasts. Each waveform filename consists of a basename identifying the broadcast and a .sph extension.
The file name format is as follow: 1998MMDD-<STARTTIME>-<ENDTIME>-<NETWORKNAME>-<SHOWNAME>.sph
E.g., the filename, 19980107-0130-0200-CNN-HDL.sph, indicates a recording of CNN Headline News taped on January 7 1998 between 1:30am and 2:00 AM.
The following auxilary file types (with the same basename as the waveform file they correspond to) are also provided:
Most of the filetypes in the collection will be provided by NIST through the LDC unless otherwise specified in later email.
Back to the Table of Contents.
As in past years, story boundaries will be known for the primary Reference, Baseline, Speech, and Cross Recognizer retrieval conditions. As such, all systems will be required to implement the Story Boundaries Known condition for their primary retrieval runs. However, this year an optional Story Boundaries Unknown condition will also be supported for the Baseline, Speech, and Cross Recognizer retrieval conditions. The specifications for the Known Story Boundaries and new Uknown Story Boundaries conditions follow.
Back to the Table of Contents.
For this condition, as in last year's track, the temporal boundary of news stories in the collection will be "known". Boundary times are given in the SGML <Section> tags contained in the Index (*.ndx) files as well as in the LTT and SRT transcript filetypes. The <Section> tags specify the document IDs and start and end times for each story within the collection.
Note that sections of the waveform files containing commercials and other "out of bounds" material which have not be transcribed by the LDC will be exluded from retrieval in the Known Story Bounary Condition. The NDX files for this condition will indicate the proper subset of the corpus to be indexed and retrieved.
Note: Recognition systems developed for the Known Story Boundaries condition may use the story boundary information in segmenting the recordings and in skipping non-news segments. However, participants are encouraged to implement recognition in conformance with the rules for the Uknown Story Boundaries condition (recognize entire broadcast files and ignore the story boundaries for the recognition portion of the task) so that these transcripts can be used for both the Known and Unknown Story Boundaries conditions. NIST will supply a script to create a filtered copy of whole-broadcast recognized transcripts to add embedded story boundaries and remove non-news material so that these can be used for the Known Story Boundaries condition.
Note that except for the time boundaries and Story IDs provided in the <Section> tags, NO OTHER INFORMATION provided in the SGML tags may be used or indexed in any way for the test including any classification or topic information which may be present.
Back to the Table of Contents.
This new condition is being implemented to investigate the retrieval of excerpts where story boundaries are unknown. As such, no <Section> tags will be given for use in this condition. Full-SDR participants in this condition must recognize entire broadcast audio files. The object of this task is for retrieval systems to to emit a single time impulse for each relevant story. As such, retrieval systems will emit time-based IDs consisting of the broadcast ID plus a time.
A Time ID consists of 2 fields separated by a column (:)
(e.g., 19980104_1130_1200_CNN_HDL:13.45)
In scoring, the TimeIDs will be mapped to Story IDs and duplicates will be eliminated. (See the Retrieval Scoring section for more details on the processing of this condition.)
Back to the Table of Contents.
This is an online recognition/retrospective retrieval task. As such, two speech recognition modes are permitted - each with system date rules:
See Section 9 for details regarding these modes and acoustic and language model training requirements.
The retrieval system date is July 1, 1998.
Back to the Table of Contents.
No Development Test data is specified or provided for the SDR track although this year's training set may be split into the training/test sets used last year for development test purposes.
Back to the Table of Contents.
The 200 hours of LDC Broadcast News data collected in 1996, 1997 and January 1998 is designated as the suggested training material for the 1999 SDR evaluation. This, however, does not preclude the use of other training materials as long as they conform to the restrictions listed later in this section. There is no designated supplementary textual data for SDR language model training. However, sites are encouraged to explore the development of rolling language models using the NewsWire data provided in the TDT-2 corpus. Sites may choose either a "fixed" or "rolling" language model mode as described below for each of their S1 and S2 recognition runs.
"Fixed" language model/vocabulary (FLM) systems: This is the traditional speech recognition evaluation mode in which systems implement fixed (non-time-adaptive) language models for recognition. If sites are implementing this recognition model, for all intents and purposes, the fixed recognition date for this evaluation will be 31 January 1998. Therefore, no acoustic or textual materials broadcast or published after this date may be used in developing either the recognition or retrieval system component. These systems will be referred to as Fixed Language Model (FLM) systems and will be dated 31 January 1998.
"Rolling" language model/vocabulary (RLM) systems: This option is supported to investigate the utility of using automatically-adapted evolving language models/vocabularies for recognition in temporal applications. These systems are permitted to use newswire data (not broadcast transcripts) from previous data days to automatically adapt their language models and vocabularies to implement recognition for the current day. For example, sites are permitted to use newswire material from March 17 to recognize material recorded on March 18 . These systems will be referred to as Rolling Language Model (RLM) systems. The TDT-2 newsire portion of the corpus is available to support this mode. The TDT-2 newswire corpus contains approximately the same number of stories as the audio portion and was collected over the same time period. NIST will re-format this data into a TREC-style SGML format and make it available simultaneously with the waveform files. If possible, additional newswire stories (eliminated from TDT-2 to control the size of the corpus) will also be made available.
Sites are permitted to investigate less frequent adapation schemes (e.g., weekly, monthly, etc.) so long as the material used for adapation always predates the current data day by at least one day.
Two recognition segmentation modes are included to support the story-boundaries-known and -unknown retrieval conditions:
For story-boundaries-known (SK) systems (required): Systems may make use of story boundary timing information for segmentation purposes. They may also ignore non-news sections. However, this recognition mode is discouraged since the transcripts provided by this mode may not be used in story-boundaries-unknown retrieval conditions. All sites are encouraged to implement recognition of whole broadcasts without story boundaries for the Story Boundaries Unknown condition. As indicated in section 6.3.1, NIST will create a filter to transform these whole-broadcast transcripts to the form used for the Story Boundaries Known condition. This will permit these transcripts to be used in both the Cross Recognizer and Cross Recognizer with Story Boundaries Unknown conditions.
For story-boundaries-unknown (SU) systems (optional): Systems may not use story boundary timing information and must perform recognition on entire broadcast files. Systems are permitted to attempt to AUTOMATICALLY screen out non-news sections such as commercials, but no manual segmentation may be used. These transcripts may be converted into SK-type transcripts with NIST-supplied software and may, therefore, be used for both story-boundaries-unknown and -known retrieval conditions.
The following general rules apply to training for all recognition modes:
The granularity for adaptation for recognition is 1 day. The time of day that an episode (or an excerpt within an episode) was broadcast can be ignored. During recognition of episodes from the "current" day, only language model training data collected up through the "previous" day may be used. However, material for unsupervised acoustic model adaptation from the current day may be used. This implies that audio material to be recognized from the current day may be processed in any order using any adaptation scheme permitted by the above rules.
Note: "Current" refers to the date the episode to be recognized was broadcasted.
Sites are requested to report the training materials and adapation modes they employed in their site reports and TREC papers.
All acoustic and textual materials used in training must be publicly available at the time of the start of the evaluation.
Back to the Table of Contents.
The SDR track is an automatic ad hoc retrospective retrieval task. This means both that any collection-wide statistics may be used in indexing and that the retrieval system may NOT be tuned using the test topics. Participants may not use statistics generated from the reference transcripts collection in the baseline or recognizer transcript collections. Any auxiliary IR training material or auxiliary data structures such as thesauri that are used must predate the 01-JUL-1998 retrieval date. Likewise, any IR training material which is derived from spoken broadcast sources (transcripts) must predate the test collection (prior to 31-JAN-1998).
All sites are required to implement fully automatic retrieval. Therefore, sites may not perform manual query generation in implementing retrieval for their submitted results.
Participants are, of course, free to perform whatever side experiments they like and report these at TREC as contrasts.
For retrieval training purposes, the 1998 SDR-TREC-7 data is available as a set of 23 topics and relevance judgements.
Back to the Table of Contents.
Interested sites are requested to register for the SDR Track as soon as possible. Registration in this case merely indicates your interest and does not imply a committment to participate in the track. Participants must register via the TREC Call for Participation Website at http://trec.nist.gov/cfp.html
Since this is a TREC track, participants are subject to the TREC conditions for participation, including signing licensing agreements for the data. Dissemination of TREC work and results other than in the (publically available) conference proceedings is welcomed, but specific advertising claims based on TREC results is forbidden. The conference held in November is open only to participating groups that submit results and to government sponsors. (Signed-up participants should have received more detailed guidelines.)
Participants implementing Full SDR (speech recognition and retrieval) are exempted from participation in the central TREC Adhoc Task. However, Quasi-SDR (retrieval only) participants must also implement the Adhoc Task.
Participants must implement either Full SDR or Quasi-SDR retrieval as defined below. Note that sites may not participate by simply pipelining the baseline recognizer transcripts and baseline retrieval engine. Participants should implement at least one of the two major system components. As in TREC-7, sites with speech recognition expertise and sites with retrieval expertise are encouraged to team up to implement Full SDR.
The 1999 SDR Track has two participation levels and several retrieval conditions as detailed below. Given the large number of conditions this year, sites are permitted to submit only 1 run per condition.
Participation Levels:
Back to the Table of Contents.
The following are the retrieval conditions for the SDR Track. Note that some retrieval conditions are required and others are optional.
Participants MUST use the SAME retrieval strategy for all conditions (that is, term weighting method, stop word list, use of phrases, retrieval model, etc must remain constant). For sites implementing S1 and/or S2 using non-word-based recognition (phone, word-spotting, lattice, etc.), they should use the closest retrieval strategy possible across conditions.
Sites may not use Word Error Rate or other measures as generated by scoring the recognizer transcripts against the reference transcripts to tune their retrieval algorithms in the S1/S1U, S2/S2U, B1/B1U and CR/CRU retrieval conditions (all conditions where recognized transcripts are used). The reference transcripts may not be used in any form for any retrieval condition except of course for R1.
Back to the Table of Contents.
The TREC-8 SDR Track will have 50 topics (queries) constructed by the NIST assessors. Each topic will consist of a concise word string made up of 1 or more sentences or phrases.
Examples:
What countries have been accused of human right violations?
Find reports of fatal air crashes.
What are the latest developments in gun control in the U.S.?
In particular, what measures are being taken to protect children from guns?
For SDR, the search topics must be processed automatically, without any manual intervention. Note that participants are welcome to run their own manually-assisted contrastive runs and report on these at TREC. However, these will not be scored or reported by NIST.
The topics will be supplied in written form. Given the number of retrieval conditions this year, spoken versions of the queries will not be included as part of the test set. However, participants are welcome to run their own contrastive spoken input tests and report on these at TREC.
Back to the Table of Contents.
Relevance assessments for the SDR Track will be provided by the NIST assessors. As in the Adhoc Task, the top 100-ranked documents for each topic from each system for the Reference Condition (R1) will be pooled and evaluated for relevance. If time and resources permit, additional documents from other retrieval conditions may be added to the pool as well. Note that this approach is employed to make the assessment task manageable, but may not cover all documents that are relevant to the topics.
Note that because the Cross Recognizer conditions will be run after the R1, B*, S* retrieval results are due, the Cross Recognizer results will not be used in creating the assessment pools this year.
Back to the Table of Contents.
Since the focus of the SDR Track is on the automatic retrieval of spoken documents, manual indexing of documents, manual construction or modification of search topics, and manual relevance feedback may not be used in implementing retrieval runs for scoring by NIST. All submitted retrieval runs must be fully automatic. Note that fully automatic "blind" feedback and similar techniques are permissible and manually-produced reference data such as dictionaries and thesauri may be employed. Note the training and training date constraints specified in Section 10.
Participants are free to perform internal experiments with manual intervention and report on these at TREC.
Back to the Table of Contents.
In order for NIST to automatically log and process all of the many submissions which we expect to receive for this track, participants MUST ensure that their retrieval and recognition submissions meet the following filename and content specs. Incorrectly formatted files will be rejected by NIST.
For retrieval, each submission must have a filename of the following form: <SITE_ID>-<CONDITION>-<RECOGNIZER_ID>.ret where,
The following are some example retrieval submission filenames: eth-r1.ret
(ETH retrieval using reference transcripts)
cmu-b1u.ret
(CMU retrieval using Baseline 1 no-boundaries recognizer)
shef-b1.ret
(Sheffield retrieval using Baseline 1 recognizer)
att-s1.ret
(AT&T retrieval using AT&T 1 recognizer)
ibm-cr-att1.ret
(IBM retrieval using AT&T 1 recognizer)
umd-cru-att1u.ret
(UMD retrieval using AT&T 1U recognizer)
As in TREC-7, for the story-boundaries-known condition the output of a
retrieval run is a ranked list of story (document) ids as identified in
the NDX files and <Section> tags in the R1 and B1 transcripts. These
will be submitted to NIST for scoring using the standard TREC submission
format (a space-delimited table):
23 Q0 19980104_1130_1200_CNN_HDL.0034 1 4238 ibm-cr-att-s1
23 Q0 19980105_1800_1830_ABC_WNT.0143 2 4223 ibm-cr-att-s1
23 Q0 19980105_1130_1200_CNN_HDL.1120 3 4207 ibm-cr-att-s1
23 Q0 19980515_1630_1700_CNN_HDL.0749 4 4194 ibm-cr-att-s1
23 Q0 19980303_1600_1700_VOA_WRP.0061 5 4189 ibm-cr-att-s1
etc.
Field Content:
The Story IDs are given in the Section (story boundary) tags.
*Note that field 5 MUST be in descending order so that ties may be handled properly. This number (not the rank) will be used to rank the documents prior to scoring. The site-given ranks will be ignored by the 'trec_eval' scoring software.
Participants may submit lists with more than 1000 documents for each topic. However, NIST will truncate the list to 1000 topics.
For the story-boundaries-unknown condition, field 3 will be a episode/
time tag of the form: <Episode-ID>:<Time-in-Seconds.Hundredths>
for the retrieved excerpt:
23 Q0 19980104_1130_1200_CNN_HDL:39.52 1 4238 ibm-cru-att-s1u
23 Q0 19980105_1800_1830_ABC_WNT:143.69 2 4223 ibm-cru-att-s1u
23 Q0 19980105_1130_1200_CNN_HDL:1120.02 3 4207 ibm-cru-att-s1u
23 Q0 19980515_1630_1700_CNN_HDL:749.81 4 4194 ibm-cru-att-s1u
23 Q0 19980303_1600_1700_VOA_WRP:61.02 5 4189 ibm-cru-att-s1u
etc.
Sites are to submit their retrieval output to NIST for scoring using standard TREC procedures and ftp protocols. See the TREC Website at http://trec.nist.gov for more details.
Back to the Table of Contents.
As in last year's SDR Track, only 1-Best-algorithm recognizer transcripts will be accepted by NIST for scoring and if received in time, will be shared across sites for the Cross-Recognizer retrieval conditions. Sites performing Full-SDR not using a 1-Best recognizer are encouraged to self-evaluate their recognizer in their TREC paper.
Since the concept of recognizer transcript sharing for Cross-Recognizer Retrieval experiments appeared to be broadly accepted last year, NIST will assume that all submitted recognizer transcripts are to be scored and made available to other participants for Cross-Recognizer Retrieval. If you would like to submit your recognizer transcripts for scoring, but do NOT want them shared, you must notify NIST (cedric.auzanne@nist.gov) of the system/run to exclude from sharing PRIOR to submission.
Submitted 1-Best recognizer transcripts must be formatted as follows: Each
recognizer transcript (one per show) is to have a filename of the following
form: <EPISODE>.srt where,
A System Description file must be created for each submitted set of recognizer-produced
transcripts which outlines pertinent features of the recognition system
used. The file should be named: <RECOGNIZER>-<RUN>.desc where,
Minimally, the system description MUST identify the language model mode which was employed: "Fixed" or "Rolling". If a rolling language model was used, the update period should be identified.
The format for the System Description is as follows:
System ID: (eg, NIST-S1U)
The SRT files and System Description File should be placed in a directory with the following name: <RECOGNIZER>-<RUN> where,
Submit your SRT files as follows:
A gnu-zipped tar archive of the above directory should then be created (e.g., att-s1.tgz) using the -cvzf options in GNU tar. This file can now be submitted to NIST for scoring/sharing via anonymous ftp to jaguar.ncsl.nist.gov using your email address as the password. Once you are logged in, cd to the "incoming/sdr99" directory. Set your mode to binary and "put" the file. This is a "blind" directory, so you will not be able to "ls" your file. Once you have uploaded the file, send email to cedric.auzanne@nist.gov to indicate that a file is waiting. He will send you a confirmation after the file is successfully extracted and email again later with your SCLITE scores. To keep things simple and file sizes down, please submit separate runs (s1 and s2) in separate tgz files.
The submitted output of a 1-Best recognizer must be in the standard SDR Speech Recognizer Trancription (SRT) format. See Appendix A for an example.
Back to the Table of Contents.
The TREC-8 SDR Track retrieval performance will be scored using the NIST
"trec_eval" Precision/Recall scoring software. A "shar" file containing
the trec_eval software is available via anonymous ftp from the following
URL:
https://github.com/usnistgov/trec_eval
For TREC-8 SDR, the primary retrieval measure will be Mean Average Precision. Other retrieval measures will include: Precision at standard Document rank cutoff levels, single number Average Precision over all relevant documents, and single number R-Precision, precision after R relevant documents retrieved.
These measures are defined in Appendices to TREC Proceedings, and may also be found on the TREC Website at http://trec.nist.gov.
For the known-story-boundaries condition, NIST will truncate the submitted list to 1000 documents and score it using trec_eval.
For the story-boundaries-unknown (U) retrieval conditions, NIST will programmatically do the following :
The mapping will simply involve converting the time tag to the story ID of the story that the identified time resides within.
NIST will provide a mapping/filtering tool to implement (1 - 2): UIDmatch.pl - convert time-based retrieval output to doc-based output format for trec_eval scoring. See the IR scoring tools page.
Back to the Table of Contents.
TREC-8 Full SDR participants who use 1-best recognition are encouraged to submit their recognizer transcripts in SRT form for scoring by NIST. NIST will employ its "sclite" Speech Recognition Scoring Software to benchmark the Story Word Error Rate for each submission. These scores will be used to examine the relationship between recognition error rate and retrieval performance. Note that to ensure consistency among all forms of the evaluation collection, all SRTs submitted for the Story Boundaries Known retrieval conditions received will be filtered to remove any speech outside the evaluation per the corresponding NDX files.
A randomly-selected 10-hour subset of the SDR collection will be transcribed in Hub-4 form so that the speech recognition transcripts can be scored. This will provide the primary speech recognition measures for the SDR track. If possible, NIST will also create a filtered version of the closed caption transcripts to be used in scoring the entire collection. Because of errors in the closed captions (especially dropouts which are algorithmically uncorrectable), it is assumed that the error rates will be considerably higher. NIST will also use the 10-hour Hub-4-style subset to estimate the closed caption transcript error.
The NIST SCLITE Scoring software is available via the following URL: http://www.nist.gov/speech/tools/index.htm. This page contains a ftp-able link to SCTK, the NIST Speech Recognition Scoring Toolkit which contains SCLITE. The SCLITE software may be updated to accomodate large test sets. The SDR email list will be notified as updates become available.
NIST will provide the following additional scripts to permit useful transformations of the SDR speech recognizer transcripts:
See the Speech scoring tools page.
Note that two forms of NDX files will be provided: 1 set for the Story Boundaries Known (SK) condition and another set for the Story Boundaries Unknown (SU) Condition. The ctm2srt.pl filter and SK NDX file can be used with a CTM file created for the SU condition to create an SRT file for the SK condition as follows:
Since unverified reference transcripts are used in the SDR Track, the SDR Word Error Rates should not be directly compared to those for Hub-4 evaluations which use carefully checked/corrected annotations and special orthographic mapping files.
Back to the Table of Contents.
Participants must make arrangements with the Linguistic Data Consortium to obtain use of the TDT-2 recorded audio and transcriptions used in the SDR Track. The recorded audio data is available in Shorten-compressed form on approximately 75 CD-ROMs. The transcription and associated textual data will be made available via ftp or via CD-ROM by special request.
Participants are asked to give full details in their Workbook/Proceedings papers of the resources used at each stage of processing, as well as details of their SR and IR methods. Participants not using 1-best recognizers for Full-SDR should also provide an appropriate analysis of the performance of the recognition algorithm they used and its effect on retrieval.
Back to the Table of Contents.
Note: All transcription files are SGML-tagged.
.sph - SPHERE waveform: SPHERE-formatted digitized recording of a broadcast, used as input to speech recognition systems. Waveform format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order.
NIST_1A
(digitized 16-bit waveform follows header)
1024
sample_count -i 27444801
sample_rate -i 16000
channel_count -i 1
sample_byte_format -s2 10
sample_n_bytes -i 2
sample_coding -s3 pcm
sample_min -i -27065
sample_max -i 27159
sample_checksum -i 31575
database_id -s7 Hub4_96
broadcast_id NPR_MKP_960913_1830_1900
sample_sig_bits -i 16
end_head
.
.
.
.ltt - Lexical TREC Transcription: ASR-style reference transcription with all SGML tags removed except for Episode and Section. "Non-News" Sections are excluded. This format is used as the source for the Reference Retrieval condition.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline
News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
it's friday september thirteenth i'm david brancaccio and here's some of what's
happening in business and the world
</Section>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
agricultural products giant archer daniels midland is often described as politically
well connected any connections notwithstanding the federal government is pursuing
a probe into whether the company conspired to fix the price of a key additive
for livestock feed
...
</Section>
...
</Episode>
.ndx - Index: Specifies <Sections> in waveform and establishes story boundaries and ID's. Similar to LTT format without text. Non-transcribed Sections are excluded.
For the known story boundaries condition, the ndx format
will require one Section tag per story as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline
News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081>
...
</Episode>
For the unknown story boundaries condition, the ndx format
will require a single "FAKE" Section tag that will encompass the entire Episode
as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline
News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
</Episode>
Note that the start time of the fake section is the start time of the first
NEWS story and the end time is the end time of the last NEWS story of the
show.
.srt - Speech Recogniser Transcript (contrived example): Output of speech recogniser for a .sph recorded waveform file which will be used as input for retrieval. Each file must contain an <Episode> tag and properly interleaved <Section> tags taken from the corresponding .ndx file. Each <Word> tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word.
For the known story boundaries condition, the Section tags follow the ones specified in the ndx file.
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline
News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075>
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
...
</Episode>
For the unknown story boundaries condition, the srt format will require a single "null" Section tag that will encompass the entire Episode as follows:
<Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline
News" Language=English Version=1 Version_Date=8-Apr-1999>
<Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL>
...
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
...
</Section>
</Episode>
Back to the Table of Contents.
srt2ltt.pl | This filter transforms the Speech Recognizer Transcription (SRT) format with word times into the Lexical TREC Transcription (LTT) form. This resulting simplified form of the speech recogniser transcription can be used for retrieval if word times are not desired. |
srt2ctm.pl | This filter transforms the Speech Recognizer Transcription (SRT) format into the CTM format used by the NIST SCLITE Speech Recognition Scoring Software. |
ctm2srt.pl | This filter together with the corresponding NDX file transforms the CTM format used by the NIST SCLITE Speech Recognition Scoring Software into the SDR Speech Recognizer Transcription (SRT) format. Material not specified in the NDX time tags is excluded. |
Back to the Table of Contents.