1998 TREC-7 SPOKEN DOCUMENT RETRIEVAL (SDR) TRACK : Specification 3 ------------------------------------------------------------------- Updated 30-Jun-1998 Karen Sparck Jones, John Garofolo, Ellen Voorhees This is the specification for implementation of the TREC-7 Spoken Document Retrieval (SDR) Track. Contents -------- 1. Background from TREC-6 2. What's New and Different 3. TREC-7 SDR Track in a Nutshell 4. Baseline Speech Recognizer 5. Baseline Retrieval Engine 6. Spoken Document Test Collection 6.1 Collection Documents 6.2 Collection File Types 6.3 Story Boundaries 7. SDR System Date 8. Development Test Data 9. Speech Recognition Training/Model Generation 10. Retrieval Training, Indexing, and Query Generation 11. SDR Participation Conditions and Levels 12. Evaluation Retrieval Conditions 13. Topics (Queries) 14. Relevance Assessments 15. Retrieval (Indexing and Searching) Constraints 16. Submission Formats 16.1 Retrieval Submission Format 16.2 Recognition Submission Format 17. Scoring 17.1 Retrieval Scoring 17.2 Speech Recognition Scoring 18. Schedule 19. Data Licensing and Costs 20. Reporting Conventions 21. Contacts Appendix A: SDR Corpus File Formats Appendix B: SDR Corpus Filters 1. Background from TREC-6 ------------------------- The 1997 TREC-6 SDR evaluation met its goals in bringing the information retrieval (IR) and speech recognition (SR) communities together, exploring the feasibility of SDR implementation, and in debugging the logistics for an SDR evaluation. In the process, we learned that usable SDR could be implemented for a known-item retrieval task with a small 50-hour spoken document collection and that the task could successfully be measured and evaluated. But, we learned from the initial SDR Track that the document collection needs to be significantly larger for realistic retrieval experiments and that an adhoc (rather than known-item) form of evaluation would be more effective in gauging the technology. Fuller details on TREC-6 can be obtained from the track the TREC-6 Proceedings published by NIST, and the Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, February 8-11, 1998. 2. What's New and Different --------------------------- Larger Test Collection: The 1998 TREC-7 SDR test collection will be approximately double the size of the 1997 TREC-6 SDR collection: about 100 hours of speech and close to 3,000 news stories. TREC adhoc-style topics and assessments: The SDR system output for TREC-7 will be essentially the same as last year. However, rather than scoring "percent retrieved at rank 1" and "mean reciprocal rank" which were used in last year's Known-Item task, NIST will implement a full ad-hoc-style relevence assessment evaluation with Precision and Recall measures. NIST will create a document "pool" for each topic which assessors will then evaluate for relevance. The pool will be the union of the top 100 documents for the Reference retrieval condition from all participants. If it is feasible, we may also add documents from other retrieval conditions to the pool. Well-defined training data cutoffs: Rules and cutoff dates have been established for recognizer Acoustic and Language Model training. See the remainder of the document for details and additional information. 3. TREC-7 SDR Track in a Nutshell --------------------------------- Training Collection: 100-hour Broadcast News Corpus subset (audio and transcripts) (Training Indices (NDX), Waveforms (SPH), Human Transcripts (LTT), Recognizer Transcripts (SRT)) See details for training data date cutoffs Test Collection: 100-hour Broadcast News Corpus subset (audio and transcripts) (Test Indices (NDX), Waveforms (SPH), Human Transcripts (LTT), Recognizer Transcripts (SRT)) Participation: Full-SDR: speech recognition + retrieval R1,B1,B2,S1 retrieval conditions required TREC Adhoc Task not required Quasi-SDR: retrieval only using supplied recognizer transcripts R1,B1,B2 retrieval conditions required TREC Adhoc Task required Topics: 25, fully automatic processing required Retrieval Conditions: R1: Reference Retrieval using human-generated "perfect" transcripts B1: Baseline Retrieval using medium error recognizer transcripts B2: Baseline Retrieval using high error recognizer transcripts S1: Full SDR using own recognizer S2: Full SDR using own secondary recognizer CR-: Cross-Recognizer Retrieval using other participants' recognizer transcripts Scoring: Retrieval: Pooled relevance assessment, Precision/Recall measures Speech Recognition: Story Word Error Rate Important Dates: SDR Track Registration: ASAP Begin recognition (Full SDR): May 7, 1998 Begin retrieval (Full SDR and Quasi-SDR): July 1, 1998 All Runs Due: August 30, 1998 TREC notebook papers due: October 21, 1998 (tentative) TREC-7: November 9-11, 1998 4. Baseline Speech Recognizer ----------------------------- As in TREC-6, Baseline recognizer transcripts will be provided for retrieval sites who do not have access to recognition technology. Using these baseline recognizer transcripts, sites without recognizers can participate in the "Quasi-SDR" subset of the Track. Note that all sites (Full SDR and Quasi-SDR) will be required to implement retrieval runs on the baseline recognizer transcripts. This will provide a valuable "control" condition for retrieval. This year, two baseline recognizer transcripts will be provided by NIST. 1. Medium Error Baseline Recognizer - ~35% Word Error Rate transcript 2. High Error Baseline Recognizer - ~50% Word Error Rate transcript The NIST recognizer used to produce 1 and 2 the baselines is an instantiation of the SPHINX-3 recognition engine kindly provided by CMU. NIST will also provide recognizer transcripts submitted for sharing by Full SDR participants. If time and resources permit, NIST will also produce a "optimized" recognizer transcript using its ROVER software over the shared recognizer transcripts. 5. Baseline Retrieval Engine ---------------------------- Sites in the speech community without access to retrieval engines may use the NIST ZPRISE retrieval engine. see http://www-nlpir.nist.gov/prise/prise.html 6. Spoken Document Test Collection ---------------------------------- 6.1 Collection Documents The "documents" in the SDR Track are news stories taken from the Linguistic Data Consortium (LDC) Broadcast News (BN) corpora, which are also used for the DARPA CSR evaluations. A story is generally defined as a continuous stretch of news material with the same content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), which will have been established by hand segmentation of the news programmes (at the LDC). However, there are two types of "stories" in the test collection: "reports" which are topical and content rich and "filler" which are transitional segments. Also note that some report type stories such as news summaries may contain topically varying material. Also note that a story is likely to involve more than one speaker, background music, noise etc. 6.2 Collection File Types The test collection for the SDR Track will consist of digitized NIST SPHERE-formatted waveform (*.sph) files containing recordings of news broadcasts from various radio and television sources aired between June 1997 and January 1998 and several human-transcribed and recognizer-transcribed textual versions of the recordings. The collection has been licensed, recorded and transcribed by the Linguistic Data Consortium (LDC). Note that this set of data is also being used as new Hub-4 training data this year. NB. Since use of the collection is controlled by the LDC, all SDR participants must make arrangements directly with the LDC to obtain access to the collection. See below for contact info. The test collection contains 114 waveform files. Each waveform filename consists of a basename identifying the broadcast and a .sph extension. The following auxilary file types (with the same basename as the waveform file they correspond to) are also provided: *.ndx - Spoken Document Index. Index file specifying document IDs and regions to recognize for the Full SDR retrieval condition. *.dtt - Detailed TREC Transcript. These SGML transcripts correspond are the original LDC Hub-4 transcripts with SDR document IDs added in the
tags. Note that these are to be used only for diagnostic purposes and will not be used in the evaluation. *.ltt - Lexical TREC Transcript. This SGML format provides the textual version of the collection used in the Reference Retrieval condition and is considered to be "ground truth" for this track. The LTT format is a simplified derivation of the DTT format with non-lexical tags and special characters removed and is designed to simulate the output of a "perfect" word recognizer. Note that, unlike in Hub-4, no additional verification of the transcripts will be performed, so transcription errors are possible. *.srt - Speech Recognizer Transcript. This SGML format is to be produced by one-best speech recognizers for the Full SDR retrieval condition and is the format to be submitted to NIST for pooling and scoring. This format will also be used for distribution of the Baseline recognizer output. A PERL script (srt2ltt.pl) will be provided to convert these into the LTT format for retrieval runs. NIST will also provide srt2ctm.pl and ctm2srt.pl scripts to allow conversion between the SRT format and the CTM format NIST uses to score recognizer output. The LDC will provide the SPH files and DTT and LTT transcripts to the SDR Track participants. NIST will provide the NDX files and Baseline and contributed SRT files directly to participants. Information will be provided to the SDR mail list regarding how to obtain these files. 6.3 Story Boundaries To eliminate complexity, the temporal boundary of news stories in the collection will be "known" in all test conditions. This information is given in the Index (*.ndx) files (as well as in all the other text filetypes) in the SGML
tags which specify the document IDs and start and end times for each story within the collection. Note that sections of the waveform files containing commercials and other "out of bounds" material which have not be transcribed by the LDC will be exluded from the SDR track. As such, these sections will be exluded from the NDX files. So, only the sections specified in the index files should be recognized and retrieved. Note that except for the time boundaries and Story IDs provided in the
tags, NO OTHER INFORMATION provided in the SGML tags may be used or indexed in any way for the test including any classification or topic information which may be present. 7. SDR System Date ------------------ For all intents and purposes, the "current" recognition/retrieval date for this evaluation will be 1 May 1998. Therefore, no acoustic or textual materials broadcast or published after this date may be used in developing either the recognition or retrieval system component. 8. Development Test Data ------------------------ No Development Test data is specified or provided for the SDR track although this year's training set may be split into the training/test sets used last year for development test purposes. 9. Speech Recognition Training/Model Generation ----------------------------------------------- The first 100 hours of LDC Broadcast News data collected in 1996 is designated as the suggested training material for the 1998 SDR evaluation. This, however, does not preclude the use of other training materials as long as they conform to the restrictions listed later in this section. There is no designated supplementary textual data for SDR language model training. Language models developed for the 1997 Hub-4 evaluation may be employed in recognizing the SDR98 test collection. However, sites are encouraged to explore the development of more suitable language models for the SDR98 task as long as the materials used conform to the restrictions below. 1. No acoustic or transcription material for broadcast excerpts appearing in the SDR98 test collection may be used for acoustic or language model training. However for managability, this does not preclude use of excerpts from other sources which happen to be duplicated in the test material. 2. No acoustic or transcription material from radio or television news sources broadcast after 05/31/97 may be used for acoustic or language model training. This should effectively prevent the possibility of "cross-contamination" of training/test material and eliminates the duplication problem in 1. 3. Any other acoustic or textual data not excluded in 2 such as newswire texts, Web articles, etc. published prior to 05/01/98 may be used. This permits the creative development of up-to-date language models for the SDR task. Sites are requested to report the training materials they employed in their site reports and TREC papers. Sites may use acoustic or textual materials which are not publicly available in 1998. However, sites using these materials MUST make arrangements to make these materials available at reasonable cost by 1 March 1999 for use by all sites in the next SDR evaluation. Of course, earlier availability of these materials is encouraged. Note that no special arrangements need to be made for materials which are publicly available at the time of the evaluation such as Newswire texts and Web documents. 10. Retrieval Training, Indexing, and Query Generation ------------------------------------------------------ For retrieval, the SDR track is an automatic ad hoc task. This means both that any collection-wide statistics may be used in indexing and that the retrieval system may NOT be tuned using the test topics. Participants may not use statistics generated from the reference transcripts collection in the baseline or recognizer transcript collections. As in recognition training, any auxiliary IR training material or auxiliary data structures such as thesauri that are used must predate the May 1, 1998 recognition/retrieval date. Likewise, any IR training material which is derived from spoken broadcast sources (transcripts) must predate the test collection (prior to May-31-97). A set of 5 training topics will be released with the other training data. These topics have a statement of relevance criteria and relevance judgements (ie relevant document lists) produced by doing PRISE searches on the first 50 hours (training set 1) of the training collection only. The judgements may therefore be incomplete. For the SDR Track, all sites are required to implement fully automatic retrieval. Therefore, sites may not perform manual query generation in implementing retrieval for their submitted results. Participants are, of course, free to perform whatever side experiments they like and report these at TREC as contrasts. 11. SDR Participation Conditions and Levels ------------------------------------------- Interested sites are requested to register for the SDR Track as soon as possible so that they are added to the SDR mailing list. Registration in this case merely indicates your interest and does not imply a committment to participate in the track. Since this is a TREC track, participants are subject to the TREC conditions for participation, including signing licensing agreements for the data. Dissemination of TREC work and results other than in the (publically available) conference proceedings is welcomed, but specific advertising claims based on TREC results is forbidden. The conference held in November is open only to participating groups that submit results and to government sponsors. (Signed-up participants should have received more detailed guidelines.) Participants implementing Full SDR (speech recognition and retrieval) are exempted from participation in the central TREC Adhoc Task. However, Quasi-SDR (retrieval only) participants must also implement the Adhoc Task. Participants must implement either Full SDR or Quasi-SDR retrieval as defined below. Note that sites may not participate by simply pipelining the baseline recognizer transcripts and baseline retrieval engine. Participants should implement at least one of the two major system components. As in TREC-6, sites with speech recognition expertise and sites with retrieval expertise are encouraged to team up to implement Full SDR. The 1998 SDR Track has two participation levels and several retrieval conditions as detailed below. Given the large number of conditions this year, sites are permitted to submit only 1 run per condition. Participation Levels: 1. Full SDR Required Retrieval Runs: S1, B1, B2, R1 (see below) Sites choosing to participate in Full SDR must produce a ranked document list for each test topic from the recorded audio waveforms. This participation level requires the implementation of both speech recognition and retrieval. In addition, Full SDR participants must implement the Baseline and Reference retrieval conditions. Participants may submit an optional second Full SDR run using an alternate recognizer (see below for requirements). Participants may also submit optional Cross-Recognizer runs as described below. 2. Quasi-SDR Required Retrieval Runs: B1, B2, R1 (see below) Sites without access to speech recognition technology may participate in the "Quasi-SDR" subset of the test by implementing retrieval on provided recognizer-produced transcripts. In addition, Quasi-SDR participants must implement the Reference retrieval condition. Participants may submit optional Cross-Recognizer runs as described below 12. Evaluation Retrieval Conditions ----------------------------------- The following are the retrieval conditions for the SDR Track. Note that some retrieval conditions are required and others are optional. SPEECH (S1,S2): (S1 is Required for Full SDR Participants) Systems process audio input (*.sph) specified in the test indices (*.ndx) and perform retrieval against the test topics to produce a ranked document list for each topic. The ranked document lists will be evaluated against the reference lists developed by the NIST assessors. If sites use 1-best recognition, they are encouraged to submit their recognition output (*.srt) to NIST for scoring and sharing with other participants for cross-component retrieval runs. If sites use other forms of recognition such as lattices, phone strings, etc., they are encouraged to report their approach including an analysis of recognition performance in their TREC papers. Sites may perform an optional second run with an alternate recognizer (S2). NIST encourages sites to submit their S1 recognizer transcripts in time for NIST to redistribute them to other participants for the Cross-Recognizer retrieval condition (See Section 18, Schedule). The optional retrieval run with a secondary recognizer run (S2) has been added to permit sites to continue development of both their recognition and retrieval algorithms up to the due date for retrieval results while still permitting them to share their S1 transcripts for the Cross-Recognizer condition. Therefore, sites do not need to submit their S2 recognizer transcripts until the final retrieval due date. Note, however, that sites are encouraged to submit both their S1 and S2 recognizer transcripts at their earliest convenience to facilitate their use in experiments by the other participants. BASELINE (B1,B2): (Required for all Participants) Systems process provided pre-recognized transcripts of audio (*.srt) and perform retrieval against the test topics to produce a ranked document list for each topic. The ranked document lists will be evaluated against the reference lists developed by the NIST assessors. This condition provides a control condition using a "standard" fixed recognizer. It also provides recognizer data for sites without access to recognition technology who wish to participate in the QSDR subset of the Track. Two BASELINE Conditions are anticipated: Medium Error (B1): NIST-Provided *.srt files with ~35% Word Error Rate High Error (B2): NIST-Provided *.srt files with ~50% Word Error Rate (Although the estimated Word Error Rates are given here, they should not be used in tuning retrieval unless they can be automatically estimated without using the reference transcripts.) REFERENCE (R1): (Required for all Participants) Systems process human-generated transcripts of audio (*.ltt) and perform retrieval against the test topics to produce a ranked document list for each topic. The ranked document lists will be evaluated against the reference lists developed by the NIST assessors. This condition provides a control condition using "perfect" recognition. Note the LTT files have been filtered to remove any material outside the evaluation per the NDX files. CROSS-RECOGNIZER (CR-ROVER,CR-CMU-S1,CR-IBM-S1,...): (Optional) Systems process other participants' S1 recognizer transcripts (*.srt) and perform retrieval against the test topics to produce a ranked document list for each topic. The ranked document lists will be evaluated against the reference lists developed by the NIST assessors. This condition provides a cross-component control condition and provides sites with access to additional recognizer transcripts. Note that the shared SRTs will be filtered to remove any material outside the evaluation per the NDX files. In addition to providing the shared recognizer transcripts, NIST may produce a set of "optimized" low error rate recognizer transcript using its ROVER software with all the submitted S1 recognizer transcripts. Participants MUST use the SAME retrieval strategy for all conditions (that is, term weighting method, stop word list, use of phrases, retrieval model, etc must remain constant). For sites implementing S1 and/or S2 using non-word-based recognition (phone, word-spotting, lattice, etc.), they should use the closest retrieval strategy possible across conditions. Sites may not use Word Error Rate or other measures as generated by scoring the recognizer transcripts against the reference transcripts to tune their retrieval algorithms in the S1, S2, B1, B2, and CR retrieval conditions (all conditions where recognized transcripts are used). The reference transcripts may not be used in any form for any retrieval condition except of course for R1. 13. Topics (Queries) -------------------- The TREC-7 SDR Track will have 23* topics (queries) constructed by the NIST assessors. Each topic will consist of a concise 1-sentence string. For SDR, the search topics must be processed automatically, without any manual intervention. Note that participants are welcome to run their own manually-assisted contrastive runs and report on these at TREC. However, these will not be scored or reported by NIST. * Because of the limited size, scope, and temporal distribution of the test collection, the assessors were unable to develop the targeted 25 topics. The small reduction should not adversely affect the evaluation. The topics will be supplied in written form. Given the number of retrieval conditions this year, spoken versions of the queries will not be included as part of the test set. However, participants are welcome to run their own contrastive spoken input tests and report on these at TREC. 14. Relevance Assessments ------------------------- Relevance assessments for the SDR Track will be provided by the NIST assessors. As in the Adhoc Task, the top 100-ranked documents for each topic from each system for the Reference Condition (R1) will be pooled and evaluated for relevance. If time and resources permit, additional documents from other retrieval conditions may be added to the pool as well. Note that this approach is employed to make the assessment task manageable, but may not cover all documents that are relevant to the topics. 15. Retrieval (Indexing and Searching) Constraints -------------------------------------------------- Since the focus of the SDR Track is on the automatic retrieval of spoken documents, manual indexing of documents, manual construction or modification of search topics, and manual relevance feedback may not be used in implementing retrieval runs for scoring by NIST. All submitted retrieval runs must be fully automatic. Note that fully automatic "blind" feedback and similar techniques are permissible and manually-produced reference data such as dictionaries and thesauri may be employed. Participants are free to perform internal experiments with manual intervention and report on these at TREC. 16. Submission Formats ---------------------- In order for NIST to automatically log and process all of the many submissions which we expect to receive for this track, participants MUST ensure that their retrieval and recognition submissions meet the following filename and content specs. Incorrectly formatted files will be rejected by NIST. 16.1 Retrieval Submission Format -------------------------------- For retrieval, each submission must have a filename of the following form: --.ret where, SITE_ID is a brief but informative lowercase string containing no whitespace, hyphens, or periods which uniquely identifies your site. For team efforts, the SITE_ID should identify the retrieval site only unless you have a team name. The same SITE_ID must be used in all retrieval submissions from your site or team. (e.g., att, city, clarit, cmu, cu, dublin, eth, glasgow, ibm, nsa, rmit, shef, umass, umd) CONDITION is a lowercase identifier for the retrieval condition used. (e.g, r1, b1, b2, s1, s2, cr) RECOGNIZER is a lowercase string containing no whitespace, hyphens, or periods which provides a unique identifier for the recognizer transcript set used in the Full SDR (s1 and s2) and Cross-Recognizer (cr) Retrieval conditions. The number at the end of site should correspond to the Full SDR condition (s1 or s2) for which the recognizer was originally used. Note that RECOGNIZER should NOT be listed for the Baseline Retrieval conditions (b1/b2). e.g., ibm1, cmu1, shef2, rover1, etc. The following are some example retrieval submission filenames: eth-r1.ret (ETH retrieval using reference transcripts) cmu-b1.ret (CMU retrieval using Baseline 1 recognizer) shef-b2.ret (Sheffield retrieval using Baseline 2 recognizer) att-s1.ret (AT&T retrieval using AT&T 1 recognizer) ibm-cr-att1.ret (IBM retrieval using AT&T 1 recognizer) As in TREC-6, the output of a retrieval run is a ranked list of story (document) ids. These will be submitted to NIST for scoring using the standard TREC submission format (a space-delimited table): 23 Q0 k960913.4 1 4238 ibm-cr-att1 23 Q0 k960825.2 2 4223 ibm-cr-att1 23 Q0 k960514.7 3 4207 ibm-cr-att1 23 Q0 k961012.1 4 4194 ibm-cr-att1 23 Q0 k960913.3 5 4189 ibm-cr-att1 etc. Field Content 1 Topic ID 2 Currently unused (must be "Q0") 3 Story ID of retrieved document 4 Document rank 5 *Retrieval system score (INT or FP) which generated the rank. 6 Site/Run ID (should be same as file basename) *Note that field 5 MUST be in descending order so that ties may be handled properly. This number (not the rank) will be used to rank the documents prior to scoring. The site-given ranks will be ignored by the 'trec_eval' scoring software. Participants should submit such a list with not more than 1000 documents for each topic. The protocol for submission of retrieval results to NIST for scoring was not yet ready at the time this document was released. It will be made available in later email. 16.2 Recognition Submission Format ---------------------------------- As in last year's SDR Track, only 1-Best-algorithm recognizer transcripts will be accepted by NIST for scoring. This year, with the addition of the Cross-Recognizer condition, these will also be shared across sites. Sites performing Full-SDR not using a 1-Best recognizer are encouraged to self-evaluate their recognizer in their TREC paper. Since the concept of recognizer transcript sharing for Cross-Recognizer Retrieval experiments appeared to be broadly accepted last year, NIST will assume that all submitted recognizer transcripts are to be scored and made available to other participants for Cross-Recognizer Retrieval. If you would like to submit your recognizer transcripts for scoring, but do NOT want them shared, you must notify NIST (cedric.auzanne@nist.gov) of the system/run to exclude from sharing PRIOR to submission. Note that we will be turning around the submissions very quickly as they are received and making them available for sharing. Submitted 1-Best recognizer transcripts must be formatted as follows: Each recognizer transcript must have a filename of the following form: .srt where, EPISODE is the ID (basename) for the corresponding NDX and SPHERE file recognized (e.g., ea980107). Each submission must contain 114 of these files - one per NDX/SPHERE file. The 114 SRT files should be placed in a directory with the following name: - where, RECOGNIZER is a brief but informative lower-case string containing no whitespace, hyphens, or periods which identifies the source of the recognizer (e.g., att, cmu, dragon, ibm, etc.) RUN identifies the recognizer used (s1 or s2) Submit your SRT files as follows: A gnu-zipped tar archive of the above directory should then be created (e.g., att-s1.tgz) using the -cvzf options in GNU tar. This file can now be submitted to NIST for scoring/sharing via anonymous ftp to jaguar.ncsl.nist.gov using your email address as the password. Once you are logged in, cd to the "incoming/sdr98" directory. Set your mode to binary and "put" the file. This is a "blind" directory, so you will not be able to "ls" your file. Once you have uploaded the file, send email to cedric.auzanne@nist.gov to indicate that a file is waiting. He will send you a confirmation after the file is successfully extracted and email again later with your SCLITE scores. To keep things simple and file sizes down, please submit separate runs (s1 and s2) in separate tgz files. The output of a 1-Best recognizer is the standard SGML SDR Speech Recognizer Trancription (SRT) format of the form:
word1 word2 ...
wordx wordy ...
... Each SRT files must contain the and properly interleaved
tags from the corresponding .ndx file which specify the source and story boundaries. Recognized material temporally outide of the given
tags should NOT be included so that it will not corrupt retrieval. Each tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word. The words are to be case-insensitive in Lexical Standard Normal Othographic Representation (Lexical-SNOR) form. However, lower case is suggested for convenience. See Appendix A for an example. NOTE: ALL SRT files received will be automatically filtered to remove all words with time boundaries occuring outside test material designated in the NDX files. Please recognize AND retrieve from ONLY the material specified in the NDX files so that your results are comparable with others. 17. Scoring ----------- 17.1 Retrieval Scoring The TREC-7 SDR Track retrieval performance will be scored using the NIST "trec_eval" Precision/Recall scoring software. A "shar" file containing the trec_eval software is available via anonymous ftp from the following URL: ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar Since pooled relevance assessments (as in the TREC Adhoc Task) are to be employed, the Known-Item Retrieval measures used in the TREC-6 SDR Track will not be used. For TREC-7 SDR, the retrieval measures will include: Precision at standard Document rank cutoff levels, single number Average Precision over all relevant documents, and single number R-Precision, precision after R relevant documents retrieved. These measures are defined in Appendices to TREC Proceedings, and may also be found on the TREC Website at http://trec.nist.gov. 17.2 Speech Recognition Scoring TREC-7 Full SDR participants who use 1-best recognition are encouraged to submit their recognizer transcripts in SRT form for scoring by NIST. NIST will employ its "sclite" Speech Recognition Scoring Software to benchmark the Story Word Error Rate for each submission. These scores will be used to examine the relationship between recognition error rate and retrieval performance. Note that to ensure consistency among all forms of the evaluation collection, all SRTs received will be filtered to remove any speech outside the evaluation per the corresponding NDX files. The NIST SCLITE Scoring software is available via the following URL: http://www.nist.gov/speech/software.htm. This page contains a ftp-able link to SCTK, the NIST Speech Recognition Scoring Toolkit which contains SCLITE. NIST will provide the following additional scripts to permit useful transformations of the SDR speech recognizer transcripts: srt2ctm.pl - convert SRT format to SCLITE CTM (for SR scoring) srt2ltt.pl - convert SRT format to LTT format (for retrieval) ctm2srt.pl - convert SCLITE CTM format to SRT format ********************* N O T E ********************* Since unverified reference transcripts are used in the SDR Track, the SDR Word Error Rates should not be directly compared to those for Hub-4 evaluations which use carefully checked/corrected annotations and special orthographic mapping files. *************************************************** If NIST has sufficient time, it may also apply experimental "content word" recognition scoring measures to explore approaches for modelling the effect of key content word errors on retrieval performance. 18. Schedule ------------ Site registration : ASAP Training data : 01 May 1998 100 hours TREC-6 training documents (SPH, DTT, LTT, NDX) 5 training topics Test data: Speech Waveform files (SPH) Apr (from LDC) NDX (Test Indices with story bounds) 07 May 1998 DTT Detailed transcriptions 01 Jul 1998 LTT Reference transcriptions 01 Jul 1998 SRT Baseline transcriptions 01 Jul 1998 Topics 01 Jul 1998 Participant-Shared recognizer transcriptions (SRT) 20 Jul 1998 Search results to NIST 30 Aug 1998 Relevance judgements released by NIST 05 Oct 1998 Scored Retrieval Results released by NIST 07 Oct 1998 Conference workbook papers to NIST 21 Oct 1998 (tentative) TREC-7 Conference 9-11 Nov 1998 19. Data Licensing and Costs ---------------------------- Participants must make arrangements with the Linguistic Data Consortium to obtain use of the speech waveforms and transcriptions used in the SDR Track. 20. Reporting Conventions ------------------------- Participants are asked to give full details in their Workbook/Proceedings papers of the resources used at each stage of processing, as well as details of their SR and IR methods. Participants not using 1-best recognizers for Full-SDR should also provide an appropriate analysis of the performance of the recognition algorithm they used and its effect on retrieval. ============================================================================ APPENDIX A: SDR Corpus File Formats ============================================================================ Note: All transcription files are SGML-tagged. -------------------------------------------------------------------------- .sph - SPHERE waveform: SPHERE-formatted digitized recording of a broadcast, used as input to speech recognition systems. Waveform format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order. NIST_1A 1024 sample_count -i 27444801 sample_rate -i 16000 channel_count -i 1 sample_byte_format -s2 10 sample_n_bytes -i 2 sample_coding -s3 pcm sample_min -i -27065 sample_max -i 27159 sample_checksum -i 31575 database_id -s7 Hub4_96 broadcast_id NPR_MKP_960913_1830_1900 sample_sig_bits -i 16 end_head (digitized 16-bit waveform follows header) . . . -------------------------------------------------------------------------- .dtt - Detailed TREC Transcription: LDC-produced Broadcast News transcription with absolute Section (story) IDs added. This format is used only for 1997 100-hour data set now designated as training. The LDC has produced the 100-hour data set used as the 1998 test collection in a new format (DTT2). See below for .dtt2 format. This form of the data is for achive/research purposes only and is not used in the evaluation. . . .
it's friday september thirteenth i'm david brancaccio and here's some of what's happening in business and the world
agricultural products giant archer daniels midland is often described as politically well connected {breath} any connections notwithstanding the federal government is pursuing a probe into whether the company conspired to fix the price of a key additive for livestock feed {breath} . . . . . .
. . .
-------------------------------------------------------------------------- .dtt2 - Detailed TREC Transcription 2: Revised LDC-produced Broadcast News transcription with absolute Section (story) IDs added. This new format is used in only the 1998 100-hour data set now designated as the 1998 SDR test collection. This form of the data is for archive/research purposes only and is not used in the evaluation.
> . . .
. . .
-------------------------------------------------------------------------- .ltt - Lexical TREC Transcription: Detailed TREC Transcription with all SGML tags removed except for Episode and Section. "Out-of-bounds" and non-transcribed Sections are excluded. This format is used as the source the Reference Retrieval condition and as the reference for speech recognition scoring. Note that both the .dtt and .dtt2 formats will be filtered by NIST into this single .ltt form. . . .
it's friday september thirteenth i'm david brancaccio and here's some of what's happening in business and the world
agricultural products giant archer daniels midland is often described as politically well connected any connections notwithstanding the federal government is pursuing a probe into whether the company conspired to fix the price of a key additive for livestock feed . . .
. . .
-------------------------------------------------------------------------- .ndx - Index: Specifies in waveform to be recognized and establishes story boundaries and ID's. Similar to LTT format without text. "Out-of-bounds" and non-transcribed Sections are excluded.
-------------------------------------------------------------------------- .srt - Speech Recogniser Transcription (contrived example): Output of speech recogniser for a .sph recorded waveform file which will be used as input for retrieval. Each file must contain an tag and properly interleaved
tags taken from the corresponding .ndx file. Each tag contains the start-time and end-time (in seconds with two decimal places) and the recognized word. . . .
his friday'S september thirteenth i'm david bran cat she toe here'S some of what'S happening in business in the world
agricultural produce giant archer daniels middle is often described as politically well connected any connections not wash standing the fed rail government is pursuing a pro into weather the company conspired two affect the price of the key altitude for live stop feet . . .
. . .
========================================================================== APPENDIX B: SDR Corpus Filters ========================================================================== srt2ltt.pl - This filter transforms the Speech Recognizer Transcription (SRT) format with word times into the Lexical TREC Transcription (LTT) form. This resulting simplified form of the speech recogniser transcription can be used for retrieval if word times are not desired. . . .
HIS FRIDAY'S SEPTEMBER THIRTEENTH I'M DAVID BRAN CAT SHE TOE HERE'S SOME OF WHAT'S HAPPENING IN BUSINESS IN THE WORLD
AGRICULTURAL PRODUCE GIANT ARCHER DANIELS MIDDLE IS OFTEN DESCRIBED AS POLITICALLY WELL CONNECTED ANY CONNECTIONS NOT WASH STANDING THE FED RAIL GOVERNMENT IS PURSUING A PRO INTO WEATHER THE COMPANY CONSPIRED TWO AFFECT THE PRICE OF THE KEY ALTITUDE FOR LIVE STOP FEET . . .
. . .
-------------------------------------------------------------------------- srt2ctm.pl - This filter transforms the Speech Recognizer Transcription (SRT) format into the CTM format used by the NIST SCLITE Speech Recognition Scoring Software. -------------------------------------------------------------------------- ctm2srt.pl - This filter together with the corresponding SDR .ndx file transforms the CTM format used by the NIST SCLITE Speech Recognition Scoring Software into the SDR Speech Recognizer Transcription (SRT) format. --------------------------------------------------------------------------