<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>1999 TREC-8 Spoken Document Retrieval (SDR) Track Evaluation Specification.</title>
</head>

<body>
<h1>1999 TREC-8 Spoken Document Retrieval (SDR) Track 
Evaluation Specification.</h1>

<br>
<p>
<i>Updated: 20 July 1999
<br>Version: 1.2 [HTML 1.3]</i></p>

<p><a href="../pages/changes.htm">Update History</a>
</p>
A simple text version can be found <a href="sdr99_spec.txt">here</a> .<br>

<b>John Garofolo, Cedric Auzanne, Ellen Voorhees, Karen Sparck Jones</b>
<br>

<p>This is the specification for implementation of the TREC-8 Spoken Document 
  Retrieval (SDR) Track. For other associated documentation regarding the TREC-8 
  SDR Track.</p>

<p>For information regarding other TREC-8 tracks, see the TREC Website at
<a href="http://trec.nist.gov">http://trec.nist.gov</a>
</p>

<h2><a name="CONTENTS">Contents</a></h2>
<hr>
<ol type="1">
<li> <a href="#SECTION1">Background from TREC-7</a>
</li><li> <a href="#SECTION2">What's New and Different</a>
</li><li> <a href="#SECTION3">TREC-8 SDR Track in a Nutshell</a>

</li><li> <a href="#SECTION4">Baseline Speech Recognizer</a>
</li><li> <a href="#SECTION5">Baseline Retrieval Engine</a>
</li><li> <a href="#SECTION6">Spoken Document Test Collection</a>
    <ol type="1">
	<li> <a href="#SECTION6-1">Collection Documents</a>
    	</li><li> <a href="#SECTION6-2">Collection File Types</a>

    	</li><li> <a href="#SECTION6-3">Story Boundaries Conditions</a>
	<ol type="1">
		<li> <a href="#SECTION6-3-1">Known Story Boundaries</a>
    		</li><li> <a href="#SECTION6-3-2">Unknown Story Boundaries</a>
	</li></ol>
    </li></ol>

</li><li> <a href="#SECTION7">SDR System Date</a>
</li><li> <a href="#SECTION8">Development Test Data</a>
</li><li> <a href="#SECTION9">Speech Recognition Training/Model Generation</a>
</li><li> <a href="#SECTION10">Retrieval Training, Indexing, and Query Generation</a>
</li><li> <a href="#SECTION11">SDR Participation Conditions and Levels</a>
</li><li> <a href="#SECTION12">Evaluation Retrieval Conditions</a>

</li><li> <a href="#SECTION13">Topics (Queries)</a>
</li><li> <a href="#SECTION14">Relevance Assessments</a>
</li><li> <a href="#SECTION15">Retrieval (Indexing and Searching) Constraints</a>
</li><li> <a href="#SECTION16">Submission Formats</a>
    <ol type="1">
	<li> <a href="#SECTION16-1">Retrieval Submission Format</a>

    	</li><li> <a href="#SECTION16-2">Recognition Submission Format</a>
    </li></ol>
</li><li> <a href="#SECTION17">Scoring</a>
    <ol type="1">
	<li> <a href="#SECTION17-1">Retrieval Scoring</a>
    	</li><li> <a href="#SECTION17-2">Speech Recognition Scoring</a>

    </li></ol>
</li><li> <a href="#SECTION19">Data Licensing and Costs</a> 
</li><li> <a href="#SECTION20">Reporting Conventions</a>
</li></ol>

<p><a href="#APPENDIXA">Appendix A: SDR Corpus File Formats</a></p>

<p><a href="#APPENDIXB">Appendix B: SDR Corpus Filters</a></p>

<br>
<br>
<hr>

<p>
</p><ol>
  <h2> 
    <li><a name="SECTION1">Background from TREC-7</a>
  </li></h2>
  <hr>

  <p>The 1998 TREC-7 SDR evaluation met its goals in bringing the information 
    retrieval (IR) and speech recognition (SR) communities together to implement 
    an ad-hoc-style evaluation of Spoken Document Retrieval technology using an 
    87-hour broadcast news collection. The TREC-7 SDR evaluation taught us that 
    there is a direct relationship between recognition errors and retrieval accuracy 
    and that certain document expansion approaches applied to moderately accurate 
    recognized transcripts produce results which are nearly comparable results 
    for retrieval using perfect human-generated transcripts. However, the TREC-7 
    2,866-story collection was still quite small by IR standards. As such, it 
    is impossible to make conclusions about the effectiveness of the technology 
    for realistically large collections. </p>
  <p>In TREC-8, we will investigate how the technology scales for a much larger 
    broadcast news collection. We will also permit participants to explore one 
    of the challenges in real spoken document retrieval implementation - retrieval 
    of excerpts with unknown story boundaries. </p>
  <p>Further details regarding the TREC-6 SDR Track can be obtained from 
     the TREC-6 Proceedings published by NIST, and the Proceedings 
    of the DARPA Broadcast News Transcription and Understanding Workshop, February 
    8-11, 1998. </p>
  <p>Further details regarding the TREC-7 SDR Track can be obtained from
    the TREC-7 Proceedings published by NIST, and the Proceedings 
    of the DARPA Broadcast News Transcription and Understanding Workshop, February 
    28-march 3, 1999. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION2">What's New and Different</a>
  </li></h2>
  <hr>
  <h3>Larger Test Collection:</h3>
  <p>The 1999 TREC-8 SDR test collection will be approximately six times the size 
    of the 1998 TREC-7 SDR collection: 550+ hours of speech and approximately 
    21,500 news stories. </p>
  <h3>Unknown Boundaries Spoke:</h3>

  <p>A new optional condition will be included in which the story boundaries in 
    the collection are unknown. To support this, systems must implement both the 
    recognition and retrieval components of the task without access to the reference 
    story boundaries. A version of the baseline recognizer transcripts will be 
    provided without embedded story boundaries to support participation in this 
    condition by Quasi-SDR participants. The new condition will require that retrieval 
    systems output times rather than story IDs. Details regarding the implementation 
    and scoring of the new condition are provided in later sections of this document. 
  </p>
  <h3>Rolling Language Model Option:</h3>
  <p>Sites may implement their primary S1 and/or secondary S2 recognition run 
    using a "rolling" language model which adapts to newswire texts from previous 
    days. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION3">TREC-8 SDR Track in a Nutshell</a>

  </li></h2>
  <hr>
  <h3>Training Collection:</h3>
  <p>No particular training collection is specified or provided for this track. 
    All previous TREC SDR training and test materials may be used for training. 
    (A list of potential training will be given in the SDR Website.) In addition, 
    sites may make use of other training material as long as these materials are 
    publicly available and pre-date the test collection. </p>
  <h3>Test Collection:</h3>
  <p>~550+-hour TDT-2 corpus subset (audio and human/asr transcripts)</p>
  <h3>Participation:</h3>

  <ul type="disc">
    <li> Full-SDR: speech recognition + retrieval<br>
      (R1,B1,S1 retrieval conditions required) 
      <ul type="square">
        <li> TREC Adhoc Task not required<br>
          <ul type="circle">
            <li>Optional unknown boundaries spoke: B1U, S1U
          </li></ul>

        </li><li> Optional second recognition/retrieval run (S2/S2U)<br>
          <ul type="circle">
            <li>Optional Cross-Recognizer spokes, known and unknown story boundaries: 
              CR, (CRU optional)
          </li></ul>
      </li></ul>
    </li><li>Quasi-SDR: retrieval only using supplied recognizer transcripts<br>
      (R1,B1 retrieval conditions required) 
      <ul>

        <li> TREC Adhoc Task required 
        </li><li> Optional unknown boundaries spoke: B1U 
        </li><li> Optional Cross-Recognizer spokes, known and unknown 
        </li><li> story boundaries: CR, (CRU optional) 
      </li></ul>
  </li></ul>
  <h3>Topics:</h3>
  50 ad-hoc, fully automatic processing required. 
  <h3>Retrieval Conditions:</h3>

  <ul type="disc">
    <li><b>R1</b>: Reference Retrieval using human-generated "perfect" transcripts 
    </li><li><b>B1</b>: Baseline Retrieval using provided recognizer transcripts 
    </li><li><b>B1U</b>: same as B1 without embedded story boundary tags 
    </li><li><b>S1</b>: Speech Retrieval using own recognizer 
    </li><li><b>S1U</b>: same as S1 without embedded story boundary tags 
    </li><li><b>S2</b>: Speech Retrieval using own secondary recognizer 
    </li><li><b>S2U</b>: same as S2 without embedded story boundary tags 
    </li><li><b>CR-&lt;SYS_NAME&gt;</b>: Cross-Recognizer Retrieval using other participants' 
      recognizer transcripts 
    </li><li><b>CRU-&lt;SYS_NAME&gt;</b>: Cross-Recognizer Retrieval without known 
      story boundaries 
  </li></ul>

  <h3>Recognition Language Models:</h3>
  <ul type="disc">
    <li><b>FLM</b>: Fixed language model/vocabulary predating test epoch 
    </li><li><b>RLM</b>: Rolling language model/vocabulary using daily newswire adapation 
  </li></ul>
  <br>
  (Choice of above LM mode is at site's discretation) 
  <h3>Recognition Modes:</h3>

  <ul type="disc">
    <li><b>SK</b>: Story boundaries known 
    </li><li><b>SU</b>: Story boundaries unknown (Required for "U" retrieval modes) 
  </li></ul>
  <h3>Primary Scoring Metrics: </h3>
  <ul type="disc">
    <li>Retrieval: Pooled relevance assessment, Mean Average Precision 
    </li><li>Speech Recognition: Word Error Rate/Mean Story Word Error Rate 
  </li></ul>

  <h3>Important Dates:</h3>
  <ul type="disc">
    <li>SDR Track Registration: <b>ASAP</b> 
    </li><li>Begin recognition (Full SDR): <b>03-MAY-1999</b> 
    </li><li>Begin R1, B*, S* retrieval (Full SDR and Quasi-SDR): <b>19-JUL-1999</b> 
    </li><li>Recognition transcripts due (for CR Condition): <b>9am EDT 16-AUG-1999</b> 
    </li><li>R1, B*, S* Retrieval due: <b>9am EDT 30-AUG-1999</b> 
    </li><li>Begin CR* Retrieval: <b>07-SEP-1999</b> 
    </li><li>CR* Retrieval due: <b>9am EDT 28-SEP-1999</b> 
    </li><li>TREC notebook papers due: <b>27-OCT-1999</b> (estimated) 
    </li><li>TREC-8: <b>November 17-19 1999</b> 
  </li></ul>

  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION4">Baseline Speech Recognizer</a>
  </li></h2>
  <hr>
  <p>As in TREC-7, Baseline recognizer transcripts will be provided for retrieval 
    sites who do not have access to recognition technology. Using these baseline 
    recognizer transcripts, sites without recognizers can participate in the "Quasi-SDR" 
    subset of the Track. </p>
  <p>Note that all sites (Full SDR and Quasi-SDR) will be required to implement 
    retrieval runs on the baseline recognizer transcripts. This will provide a 
    valuable "control" condition for retrieval. </p>

  <p>This year, one baseline recognizer transcript set will be provided by a NIST 
    instantiation of the Rough 'N Ready BYBLOS recognition engine kindly provided 
    by BBN. The acoustic model training for the baseline recognizer was limited 
    to the 1995 (Marketplace) and 1996/97 (Broadcast News) training sets released 
    by the Linguistic Data Consortium for use in Hub-4 speech recognition evaluations. 
    The language modeling training sources for the baseline recognizer are the 
    following: </p>
  <ul type="disc">
    <li>LDC LM training corpora: 
      <ul type="circle">
        <li>131M 1992-96 BN, official Hub-4 release 
        </li><li>254M 1988-94 WSJ 
        </li><li>45M 1994-95 North American News 
      </li></ul>
    </li><li>LDC acoustic training corpora: 
      <ul type="circle">
        <li>1M 1997 BN 80 hours acoustic training
      </li></ul>

    </li><li>PSM corpora: 
      <ul type="circle">
        <li>50M 1997 BN
      </li></ul>
  </li></ul>
  <p>Two versions of the baseline recognizer trascript set are provided: 
  </p><ul type="disc">

    <li><b>B1</b> - <a href="ftp://jaguar.ncsl.nist.gov/sdr99/SDR99-nist-b1.tgz">baseline 
      1</a> : srt format with story boundaries and non-news excluded 
    </li><li><b>B1U</b> - <a href="ftp://jaguar.ncsl.nist.gov/sdr99/SDR99-nist-b1U.tgz">baseline 
      1U</a> : srt format with whole shows and no story boundaries 
  </li></ul>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION5">Baseline Retrieval Engine</a>
  </li></h2>
  <hr>
  <p>Sites in the speech community without access to retrieval engines may use 
    the NIST ZPRISE retrieval engine. </p>
  <p>see <a href="http://www-nlpir.nist.gov/works/papers/zp2/zp2.html">http://www-nlpir.nist.gov/works/papers/zp2/zp2.html</a></p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION6">Spoken Document Test Collection</a>
  </li></h2>
  <hr>
  <p>The 1999 SDR collection is based on the broadcast news audio portion of the 
    TDT-2 News Corpus which was originally collected by the Linguistic Data Consortium 
    to support the DARPA Topic Detection and Tracking Evaluations. The corpus 
    contains recordings, transcriptions, and associated data for several radio 
    and television news sources broadcast daily between January and June 1998. 
    The 1999 SDR Track will use the February - June subset of the TDT-2 corpus 
    (January is excluded so as not to conflict with Hub-4 recognizers which have 
    been trained on overlapping material from January 1998). The SDR collection 
    consists of approximately 550 hours of recordings and contains approximately 
    21,500 news stories. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <ol>

    <h2> 
      <li><a name="SECTION6-1">Collection Documents</a>
    </li></h2>
    <p>The "documents" in the SDR Track are news stories taken from the Linguistic 
      Data Consortium (LDC) TDT-2 Broadcast News Corpus (February - June 1998 
      subset), which was also used in the 1998 DARPA TDT-2 evaluation. A story 
      is generally defined as a continuous stretch of news material with the same 
      content or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank), 
      which will have been established by hand segmentation of the news programs 
      (at the LDC). </p>
    <p>There are two classifications of "stories" in the TDT-2 Corpus: "NEWS" 
      which are topical and content rich and "MISCELLANEOUS" which are transitional 
      filler or commercials. Only the "NEWS" stories will be included in the SDR 
      collection. Note that a news story is likely to involve more than one speaker, 
      background music, noise etc. </p>
    <p>The news stories comprise approximately 385 of the 550 hours of the corpus. 
      (note that this information may be used for planning purposes, but not in 
      training story boundary unknown systems) </p>
    <p>The collection has been licensed, recorded and transcribed by the Linguistic 
      Data Consortium (LDC). Since use of the collection is controlled by the 
      LDC, all SDR participants must make arrangements directly with the LDC to 
      obtain access to the collection. See below for contact info. </p>

    <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
    <h2> 
      <li><a name="SECTION6-2">Collection File Types</a>
    </li></h2>
    <p>The test collection for the SDR Track consists of digitized NIST SPHERE-formatted 
      waveform (*.sph) files containing recordings of news broadcasts from various 
      radio and television sources aired between 01-February-1998 and 30-June-1998 
      and human-transcribed and recognizer-transcribed textual versions of the 
      recordings. </p>
    <p>The test collection contains approximately 900 SPHERE-formatted waveform 
      files of recordings of entire broadcasts. Each waveform filename consists 
      of a basename identifying the broadcast and a .sph extension. </p>

    <p>The file name format is as follow: 1998MMDD-&lt;STARTTIME&gt;-&lt;ENDTIME&gt;-&lt;NETWORKNAME&gt;-&lt;SHOWNAME&gt;.sph 
    </p>
    <p>E.g., the filename, 19980107-0130-0200-CNN-HDL.sph, indicates a recording 
      of CNN Headline News taped on January 7 1998 between 1:30am and 2:00 AM. 
    </p>
    <p>The following auxilary file types (with the same basename as the waveform 
      file they correspond to) are also provided: 
    </p><ul type="disc">

      <li>*.ndx - Spoken Document Index. Index file specifying document IDs and 
        regions (stories) to recognize for the Full SDR retrieval condition. 
      </li><li>*.ltt - Lexical TREC Transcript. This SGML format provides the textual 
        version of the collection used in the Reference Retrieval condition and 
        is considered to be "ground truth" for this track. The LTT format contains 
        a simplified form of the closed caption transcripts provided for the TDT-2 
        corpus in which non-lexical tags and special characters have been removed 
        to simulate the output of a near-perfect word recognizer. (Note that these 
        closed caption quality transcripts will be less accurate than the Hub-4 
        transcripts used in last year's SDR track.) 
      </li><li>*.srt - Speech Recognizer Transcript. This SGML format is to be produced 
        by one-best speech recognizers for the Full SDR retrieval condition and 
        is the format to be submitted to NIST for scoring and sharing in the Cross-Recognizer 
        condition. This format will also be used for distribution of the Baseline 
        recognizer transcript set with known story boundaries. 
    </li></ul>
    <p>Most of the filetypes in the collection will be provided by NIST through 
      the LDC unless otherwise specified in later email. </p>
    <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
    <h2> 
      <li><a name="SECTION6-3">Story Boundaries Conditions</a>

    </li></h2>
    <p>As in past years, story boundaries will be known for the primary Reference, 
      Baseline, Speech, and Cross Recognizer retrieval conditions. As such, all 
      systems will be required to implement the Story Boundaries Known condition 
      for their primary retrieval runs. However, this year an optional Story Boundaries 
      Unknown condition will also be supported for the Baseline, Speech, and Cross 
      Recognizer retrieval conditions. The specifications for the Known Story 
      Boundaries and new Uknown Story Boundaries conditions follow. </p>
    <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
    <ol>
      <h3> 
        <li><a name="SECTION6-3-1">Known Story Boundaries Condition (required)</a>
      </li></h3>

      <p>For this condition, as in last year's track, the temporal boundary of 
        news stories in the collection will be "known". Boundary times are given 
        in the SGML &lt;Section&gt; tags contained in the Index (*.ndx) files 
        as well as in the LTT and SRT transcript filetypes. The &lt;Section&gt; 
        tags specify the document IDs and start and end times for each story within 
        the collection. </p>
      <p>Note that sections of the waveform files containing commercials and other 
        "out of bounds" material which have not be transcribed by the LDC will 
        be exluded from retrieval in the Known Story Bounary Condition. The NDX 
        files for this condition will indicate the proper subset of the corpus 
        to be indexed and retrieved. </p>
      <p><u>Note:</u> Recognition systems developed for the Known Story Boundaries 
        condition may use the story boundary information in segmenting the recordings 
        and in skipping non-news segments. However, participants are encouraged 
        to implement recognition in conformance with the rules for the Uknown 
        Story Boundaries condition (recognize entire broadcast files and ignore 
        the story boundaries for the recognition portion of the task) so that 
        these transcripts can be used for both the Known and Unknown Story Boundaries 
        conditions. NIST will supply a script to create a filtered copy of whole-broadcast 
        recognized transcripts to add embedded story boundaries and remove non-news 
        material so that these can be used for the Known Story Boundaries condition. 
      </p>

      <p>Note that except for the time boundaries and Story IDs provided in the 
        &lt;Section&gt; tags, NO OTHER INFORMATION provided in the SGML tags may 
        be used or indexed in any way for the test including any classification 
        or topic information which may be present. </p>
      <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
      <h3> 
        <li><a name="SECTION6-3-2">Unknown Story Boundaries Condition (optional)</a>
      </li></h3>

      <p>This new condition is being implemented to investigate the retrieval 
        of excerpts where story boundaries are unknown. As such, no &lt;Section&gt; 
        tags will be given for use in this condition. Full-SDR participants in 
        this condition must recognize entire broadcast audio files. The object 
        of this task is for retrieval systems to to emit a single time impulse 
        for each relevant story. As such, retrieval systems will emit time-based 
        IDs consisting of the broadcast ID plus a time. </p>
      <p>A Time ID consists of 2 fields separated by a column (:) 
      </p><ul type="disc">
        <li>the show ID (for instance 19980104_1130_1200_CNN_HDL) 
        </li><li>a time stamp to hundreds of a second (for instance 13.45 is 13 seconds 
          and 45/100 of a second) 
      </li></ul>
      <p>(e.g., 19980104_1130_1200_CNN_HDL:13.45)</p>
      <p>In scoring, the TimeIDs will be mapped to Story IDs and duplicates will 
        be eliminated. (See the Retrieval Scoring section for more details on 
        the processing of this condition.) </p>

    </ol>
  </ol>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION7">SDR System Date</a>
  </li></h2>
  <hr>

  <p>This is an online recognition/retrospective retrieval task. As such, two 
    speech recognition modes are permitted - each with system date rules: 
  </p><ul type="disc">
    <li>For Fixed Language Model/Vocabulary (FLM) Systems: 31 January 1998 
    </li><li>For Rolling Language Model/Vocabulary (RLM) Systems: Each data day may 
      be trained from parallel (non-BN-derived) newswire material from previous 
      days. 
  </li></ul>
  <p></p>
  <p>See <a href="#SECTION9">Section 9</a> for details regarding these modes and 
    acoustic and language model training requirements. </p>
  <p>The retrieval system date is July 1, 1998.</p>

  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION8">Development Test Data</a>
  </li></h2>
  <hr>
  <p>No Development Test data is specified or provided for the SDR track although 
    this year's training set may be split into the training/test sets used last 
    year for development test purposes. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION9">Speech Recognition Training/Model Generation</a>
  </li></h2>
  <hr>
  <p>The 200 hours of LDC Broadcast News data collected in 1996, 1997 and January 
    1998 is designated as the suggested training material for the 1999 SDR evaluation. 
    This, however, does not preclude the use of other training materials as long 
    as they conform to the restrictions listed later in this section. There is 
    no designated supplementary textual data for SDR language model training. 
    However, sites are encouraged to explore the development of rolling language 
    models using the NewsWire data provided in the TDT-2 corpus. Sites may choose 
    either a "fixed" or "rolling" language model mode as described below for each 
    of their S1 and S2 recognition runs. </p>
  <p>"Fixed" language model/vocabulary (FLM) systems: This is the traditional 
    speech recognition evaluation mode in which systems implement fixed (non-time-adaptive) 
    language models for recognition. If sites are implementing this recognition 
    model, for all intents and purposes, the fixed recognition date for this evaluation 
    will be 31 January 1998. Therefore, no acoustic or textual materials broadcast 
    or published after this date may be used in developing either the recognition 
    or retrieval system component. These systems will be referred to as Fixed 
    Language Model (FLM) systems and will be dated 31 January 1998. </p>
  <p>"Rolling" language model/vocabulary (RLM) systems: This option is supported 
    to investigate the utility of using automatically-adapted evolving language 
    models/vocabularies for recognition in temporal applications. These systems 
    are permitted to use newswire data (not broadcast transcripts) from previous 
    data days to automatically adapt their language models and vocabularies to 
    implement recognition for the current day. For example, sites are permitted 
    to use newswire material from March 17 to recognize material recorded on March 
    18 . These systems will be referred to as Rolling Language Model (RLM) systems. 
    The TDT-2 newsire portion of the corpus is available to support this mode. 
    The TDT-2 newswire corpus contains approximately the same number of stories 
    as the audio portion and was collected over the same time period. NIST will 
    re-format this data into a TREC-style SGML format and make it available simultaneously 
    with the waveform files. If possible, additional newswire stories (eliminated 
    from TDT-2 to control the size of the corpus) will also be made available. 
  </p>

  <p>Sites are permitted to investigate less frequent adapation schemes (e.g., 
    weekly, monthly, etc.) so long as the material used for adapation always predates 
    the current data day by at least one day. </p>
  <p>Two recognition segmentation modes are included to support the story-boundaries-known 
    and -unknown retrieval conditions: </p>
  <p><a name="SKDEF"></a>For story-boundaries-known (SK) systems (required): Systems 
    may make use of story boundary timing information for segmentation purposes. 
    They may also ignore non-news sections. However, this recognition mode is 
    discouraged since the transcripts provided by this mode may not be used in 
    story-boundaries-unknown retrieval conditions. All sites are encouraged to 
    implement recognition of whole broadcasts without story boundaries for the 
    Story Boundaries Unknown condition. As indicated in section 6.3.1, NIST will 
    create a filter to transform these whole-broadcast transcripts to the form 
    used for the Story Boundaries Known condition. This will permit these transcripts 
    to be used in both the Cross Recognizer and Cross Recognizer with Story Boundaries 
    Unknown conditions. </p>
  <p><a name="SUDEF"></a>For story-boundaries-unknown (SU) systems (optional): Systems 
    may not use story boundary timing information and must perform recognition 
    on entire broadcast files. Systems are permitted to attempt to AUTOMATICALLY 
    screen out non-news sections such as commercials, but no manual segmentation 
    may be used. These transcripts may be converted into SK-type transcripts with 
    NIST-supplied software and may, therefore, be used for both story-boundaries-unknown 
    and -known retrieval conditions. </p>
  <p><a name="RULES"></a>The following general rules apply to training for all recognition 
    modes: 
  </p><ol>
    <li> No acoustic or transcription material from radio or television news sources 
      broadcast after 31-JAN-98 other than from the SDR99 test collection may 
      be used for any purpose. 
    </li><li>No manual transcriptions of broadcast excerpts appearing in the SDR99 
      test collection may be used for acoustic or language model training. 
    </li><li>All material used for language model training/adaptation must predate 
      (non-inclusive) the broadcast date of the episode to be recognized. 
    </li><li>All material used for acoustic model training/adaptation must be contemporaneous 
      with (inclusive) or predate the broadcast date of the episode to be recognized. 
    </li><li>Any other acoustic or textual data not excluded above such as newswire 
      texts, Web articles, etc. published prior to the day of the episode to be 
      transcribed may be used for training/adaptation. 
  </li></ol>

  <p> The granularity for adaptation for recognition is 1 day. The time of day 
    that an episode (or an excerpt within an episode) was broadcast can be ignored. 
    During recognition of episodes from the "current" day, only language model 
    training data collected up through the "previous" day may be used. However, 
    material for unsupervised acoustic model adaptation from the current day may 
    be used. This implies that audio material to be recognized from the current 
    day may be processed in any order using any adaptation scheme permitted by 
    the above rules. 
  </p><p><i>Note: "Current" refers to the date the episode to be recognized was broadcasted.</i> 
  </p><p>Sites are requested to report the training materials and adapation modes 
    they employed in their site reports and TREC papers. </p>
  <p>All acoustic and textual materials used in training must be publicly available 
    at the time of the start of the evaluation. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION10">Retrieval Training, Indexing, and Query Generation</a>

  </li></h2>
  <hr>
  <p>The SDR track is an automatic ad hoc retrospective retrieval task. This means 
    both that any collection-wide statistics may be used in indexing and that 
    the retrieval system may NOT be tuned using the test topics. Participants 
    may not use statistics generated from the reference transcripts collection 
    in the baseline or recognizer transcript collections. Any auxiliary IR training 
    material or auxiliary data structures such as thesauri that are used must 
    predate the 01-JUL-1998 retrieval date. Likewise, any IR training material 
    which is derived from spoken broadcast sources (transcripts) must predate 
    the test collection (prior to 31-JAN-1998). </p>
  <p>All sites are required to implement fully automatic retrieval. Therefore, 
    sites may not perform manual query generation in implementing retrieval for 
    their submitted results. </p>
  <p>Participants are, of course, free to perform whatever side experiments they 
    like and report these at TREC as contrasts. </p>
  <p>For retrieval training purposes, the 1998 SDR-TREC-7 data is available as 
    a set of 23 topics and relevance judgements. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION11">SDR Participation Conditions and Levels</a>
  </li></h2>
  <hr>
  <p>Interested sites are requested to register for the SDR Track as soon as possible. 
    Registration in this case merely indicates your interest and does not imply 
    a committment to participate in the track. Participants must register via 
    the TREC Call for Participation Website at <a href="http://trec.nist.gov/cfp.html">http://trec.nist.gov/cfp.html</a> 
  </p>
  <p>Since this is a TREC track, participants are subject to the TREC conditions 
    for participation, including signing licensing agreements for the data. Dissemination 
    of TREC work and results other than in the (publically available) conference 
    proceedings is welcomed, but specific advertising claims based on TREC results 
    is forbidden. The conference held in November is open only to participating 
    groups that submit results and to government sponsors. (Signed-up participants 
    should have received more detailed guidelines.) </p>
  <p>Participants implementing Full SDR (speech recognition and retrieval) are 
    exempted from participation in the central TREC Adhoc Task. However, Quasi-SDR 
    (retrieval only) participants must also implement the Adhoc Task. </p>

  <p>Participants must implement either Full SDR or Quasi-SDR retrieval as defined 
    below. Note that sites may not participate by simply pipelining the baseline 
    recognizer transcripts and baseline retrieval engine. Participants should 
    implement at least one of the two major system components. As in TREC-7, sites 
    with speech recognition expertise and sites with retrieval expertise are encouraged 
    to team up to implement Full SDR. </p>
  <p>The 1999 SDR Track has two participation levels and several retrieval conditions 
    as detailed below. Given the large number of conditions this year, sites are 
    permitted to submit only 1 run per condition. </p>
  <p>Participation Levels: 
  </p><ol>
    <li>Full SDR Required Retrieval Runs: S1, B1, R1 required (see below) Sites 
      choosing to participate in Full SDR must produce a ranked document list 
      for each test topic from the recorded audio waveforms. This participation 
      level requires the implementation of both speech recognition and retrieval. 
      In addition, Full SDR participants must implement the Baseline and Reference 
      retrieval conditions. Participants may submit an optional second Full SDR 
      run using an alternate recognizer (see below for requirements). Participants 
      may also submit optional Cross-Recognizer runs and Story-Boundaries-Uknown 
      runs as described below. 
    </li><li>Quasi-SDR Required Retrieval Runs: B1, R1 required (see below) Sites without 
      access to speech recognition technology may participate in the "Quasi-SDR" 
      subset of the test by implementing retrieval on provided recognizer-produced 
      transcripts. In addition, Quasi-SDR participants must implement the Reference 
      retrieval condition. Participants may submit optional Cross-Recognizer and 
      Story-Boundaries-Unknown runs as described below. 
  </li></ol>
  <p></p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION12">Evaluation Retrieval Conditions</a>
  </li></h2>
  <hr>
  <p>The following are the retrieval conditions for the SDR Track. Note that some 
    retrieval conditions are required and others are optional. </p>
  <ul type="disc">
    <li> <b>SPEECH (S1,S2)</b>: (S1 is Required for Full SDR Participants) Systems 
      process audio input (*.sph) specified in the test indices (*.ndx) and perform 
      retrieval against the test topics to produce a ranked document list for 
      each topic. The ranked document lists will be evaluated against the reference 
      lists developed by the NIST assessors.<br>

      Note that ALL sites are encouraged to implement recognition of whole broadcasts 
      without story boundaries for the Story Boundaries Unknown condition. As 
      indicated in section 6.3.1, NIST will create a filter to transform these 
      whole-broadcast transcripts to the form used for the Story Boundaries Known 
      condition. This will permit these transcripts to be used in both the Cross 
      Recognizer and Cross Recognizer with Story Boundaries Unknown conditions.<br>
      <br>
    </li><li> <b>SPEECH STORY BOUNDARIES UNKNOWN (S1U,S2U)</b>: (optional) Systems 
      process whole audio input (*.sph) files and perform retrieval against the 
      test topics to produce a ranked time-tag list for each topic. The time-tag 
      lists will be mapped to document lists with duplicates removed and will 
      be evaluated against the reference lists developed by the NIST assessors. 
      <br>
      If sites use 1-best recognition, they are encouraged to submit their recognition 
      output (*.srt) to NIST for scoring and sharing with other participants for 
      cross-recognizer retrieval runs. If sites use other forms of recognition 
      such as lattices, phone strings, etc., they are encouraged to report their 
      approach including an analysis of recognition performance in their TREC 
      papers. Sites may implement recognition using either a fixed or rolling 
      language model as described in the Speech Recognition Training/Model Generation 
      section. Sites may also perform an optional second run with an alternate 
      recognizer (S2/S2U).<br>
      NIST encourages sites to submit their recognizer transcripts in time for 
      NIST to redistribute them to other participants for the Cross-Recognizer 
      retrieval condition (See <a href="#SECTION18">Section 18</a>, Schedule). The 
      schedule has been set this year so that this due date is just before the 
      retrieval due date to allow the maximum time for sites to complete the recognition.<br>

      <br>
    </li><li> <b>BASELINE (B1)</b>: (Required for all Participants) Systems process 
      provided pre-recognized transcripts of audio (*.srt) and perform retrieval 
      against the test topics to produce a ranked time-tag list for each topic. 
      The time-tag lists will be mapped to document lists with duplicates removed 
      and will be evaluated against the reference lists developed by the NIST 
      assessors.<br>
      <br>
    </li><li> <b>BASELINE STORY-BOUNDARIES-UNKNOWN (B1U)</b>: (optional) Systems process 
      provided pre-recognized transcripts of audio without embedded story boundaries 
      (*.srt) and perform retrieval against the test topics to produce a ranked 
      time-tag list for each topic. The time-tag lists will be mapped to document 
      lists with duplicates removed and will be evaluated against the reference 
      lists developed by the NIST assessors.<br>
      This condition provides a control condition using a "standard" fixed recognizer. 
      It also provides recognizer data for sites without access to recognition 
      technology who wish to participate in the Quasi-SDR subset of the Track.<br>

      <br>
    </li><li> <b>REFERENCE (R1)</b>: (Required for all Participants) Systems process 
      human-generated transcripts of audio (*.ltt) and perform retrieval against 
      the test topics to produce a ranked document list for each topic. The ranked 
      document lists will be evaluated against the reference lists developed by 
      the NIST assessors. This condition provides a control condition using human-generated 
      closed caption quality recognition. Note the LTT files have been filtered 
      to remove any material outside the evaluation per the NDX files.<br>
      <br>
    </li><li> <b>CROSS-RECOGNIZER (CR-CMU-S1,CR-IBM-S1,...)</b>: (Optional) Systems 
      process other participants' S1 recognizer transcripts (*.srt) and perform 
      retrieval against the test topics to produce a ranked document list for 
      each topic. The ranked document lists will be evaluated against the reference 
      lists developed by the NIST assessors. This condition provides a cross-component 
      control condition and provides sites with access to additional recognizer 
      transcripts. Note that the shared SRTs will be filtered to remove any material 
      outside the evaluation per the NDX files.<br>
      <br>
    </li><li> <b>CROSS-RECOGNIZER STORY-BOUNDARIES-UNKNOWN (CRU-CMU-S1U,CRU-IBM-S1U,...)</b>: 
      (Optional) Systems process other participants' unsegmented S1U recognizer 
      transcripts (*.srt) and perform retrieval against the test topics to produce 
      a ranked time-tag list for each topic. The time-tag lists will be mapped 
      to document lists with duplicates removed and will be evaluated against 
      the reference lists developed by the NIST assessors. 
  </li></ul>

  <p>Participants MUST use the SAME retrieval strategy for all conditions (that 
    is, term weighting method, stop word list, use of phrases, retrieval model, 
    etc must remain constant). For sites implementing S1 and/or S2 using non-word-based 
    recognition (phone, word-spotting, lattice, etc.), they should use the closest 
    retrieval strategy possible across conditions. </p>
  <p>Sites may not use Word Error Rate or other measures as generated by scoring 
    the recognizer transcripts against the reference transcripts to tune their 
    retrieval algorithms in the S1/S1U, S2/S2U, B1/B1U and CR/CRU retrieval conditions 
    (all conditions where recognized transcripts are used). The reference transcripts 
    may not be used in any form for any retrieval condition except of course for 
    R1. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION13">Topics (Queries)</a>
  </li></h2>

  <hr>
  <p>The TREC-8 SDR Track will have 50 topics (queries) constructed by the NIST 
    assessors. Each topic will consist of a concise word string made up of 1 or 
    more sentences or phrases. </p>
  <p>Examples:<br>
    What countries have been accused of human right violations?<br>
    <br>
    Find reports of fatal air crashes.<br>
    <br>

    What are the latest developments in gun control in the U.S.?<br>
    In particular, what measures are being taken to protect children from guns?<br>
  </p>
  <p>For SDR, the search topics must be processed automatically, without any manual 
    intervention. Note that participants are welcome to run their own manually-assisted 
    contrastive runs and report on these at TREC. However, these will not be scored 
    or reported by NIST. </p>
  <p>The topics will be supplied in written form. Given the number of retrieval 
    conditions this year, spoken versions of the queries will not be included 
    as part of the test set. However, participants are welcome to run their own 
    contrastive spoken input tests and report on these at TREC. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION14">Relevance Assessments</a>
  </li></h2>
  <hr>
  <p>Relevance assessments for the SDR Track will be provided by the NIST assessors. 
    As in the Adhoc Task, the top 100-ranked documents for each topic from each 
    system for the Reference Condition (R1) will be pooled and evaluated for relevance. 
    If time and resources permit, additional documents from other retrieval conditions 
    may be added to the pool as well. Note that this approach is employed to make 
    the assessment task manageable, but may not cover all documents that are relevant 
    to the topics. </p>
  <p>Note that because the Cross Recognizer conditions will be run after the R1, 
    B*, S* retrieval results are due, the Cross Recognizer results will not be 
    used in creating the assessment pools this year. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION15">Retrieval (Indexing and Searching) Constraints</a>
  </li></h2>
  <hr>
  <p>Since the focus of the SDR Track is on the automatic retrieval of spoken 
    documents, manual indexing of documents, manual construction or modification 
    of search topics, and manual relevance feedback may not be used in implementing 
    retrieval runs for scoring by NIST. All submitted retrieval runs must be fully 
    automatic. Note that fully automatic "blind" feedback and similar techniques 
    are permissible and manually-produced reference data such as dictionaries 
    and thesauri may be employed. Note the training and training date constraints 
    specified in <a href="#SECTION10">Section 10</a>. </p>
  <p>Participants are free to perform internal experiments with manual intervention 
    and report on these at TREC. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 
    <li><a name="SECTION16">Submission Formats</a>
  </li></h2>
  <hr>
  <p>In order for NIST to automatically log and process all of the many submissions 
    which we expect to receive for this track, participants MUST ensure that their 
    retrieval and recognition submissions meet the following filename and content 
    specs. Incorrectly formatted files will be rejected by NIST. </p>
  <ol>
    <h2> 
      <li><a name="SECTION16-1">Retrieval Submission Format</a>

    </li></h2>
    <p>For retrieval, each submission must have a filename of the following form: 
      &lt;SITE_ID&gt;-&lt;CONDITION&gt;-&lt;RECOGNIZER_ID&gt;.ret where, 
    </p><ul type="disc">
      <li>SITE_ID is a brief but informative lowercase string containing no whitespace, 
        hyphens, or periods which uniquely identifies your site. For team efforts, 
        the SITE_ID should identify the retrieval site only unless you have a 
        team name. The same SITE_ID must be used in all retrieval submissions 
        from your site or team. (e.g., att, city, clarit, cmu, cu, dublin, eth, 
        glasgow, ibm, nsa, rmit, shef, umass, umd) <br>
      </li><li>CONDITION is a lowercase identifier for the retrieval condition used. 
        (e.g, r1, b1, b1u, s1, s1u, s2, s2u, cr, cru)<br>

      </li><li>RECOGNIZER is a lowercase string containing no whitespace, hyphens, 
        or periods which provides a unique identifier for the recognizer transcript 
        set used in the Full SDR (s1/s1u and s2/s2u) and Cross-Recognizer (cr) 
        Retrieval conditions. The number at the end of site should correspond 
        to the Full SDR condition (s1/s1u or s2/s2u) for which the recognizer 
        was originally used. Note that RECOGNIZER should NOT be listed for the 
        Baseline Retrieval conditions (b1/b1u).<br>
        e.g., ibm1, cmu1u, shef2, rover1, etc. 
    </li></ul>
    <p>The following are some example retrieval submission filenames: <code>eth-r1.ret</code> 
      (ETH retrieval using reference transcripts)<br>
      <code>cmu-b1u.ret</code> (CMU retrieval using Baseline 1 no-boundaries recognizer)<br>

      <code>shef-b1.ret</code> (Sheffield retrieval using Baseline 1 recognizer)<br>
      <code>att-s1.ret</code> (AT&amp;T retrieval using AT&amp;T 1 recognizer)<br>
      <code>ibm-cr-att1.ret</code> (IBM retrieval using AT&amp;T 1 recognizer)<br>

      <code>umd-cru-att1u.ret</code> (UMD retrieval using AT&amp;T 1U recognizer)<br>
    </p><p>As in TREC-7, for the story-boundaries-known condition the output of a 
      retrieval run is a ranked list of story (document) ids as identified in 
      the NDX files and &lt;Section&gt; tags in the R1 and B1 transcripts. These 
      will be submitted to NIST for scoring using the standard TREC submission 
      format (a space-delimited table):<br>
      <code>23 Q0 19980104_1130_1200_CNN_HDL.0034 1 4238 ibm-cr-att-s1<br>
      23 Q0 19980105_1800_1830_ABC_WNT.0143 2 4223 ibm-cr-att-s1<br>

      23 Q0 19980105_1130_1200_CNN_HDL.1120 3 4207 ibm-cr-att-s1<br>
      23 Q0 19980515_1630_1700_CNN_HDL.0749 4 4194 ibm-cr-att-s1<br>
      23 Q0 19980303_1600_1700_VOA_WRP.0061 5 4189 ibm-cr-att-s1<br>
      etc.</code><br>
    </p>
    <p>Field Content: 
    </p><ol>

      <li>Topic ID 
      </li><li>Currently unused (must be "Q0") 
      </li><li>Story ID of retrieved document 
      </li><li>Document rank 
      </li><li>*Retrieval system score (INT or FP) which generated the rank. 
      </li><li>Site/Run ID (should be same as file basename) 
    </li></ol>
    <p></p>
    <p>The Story IDs are given in the Section (story boundary) tags.</p>
    <p>*Note that field 5 MUST be in descending order so that ties may be handled 
      properly. This number (not the rank) will be used to rank the documents 
      prior to scoring. The site-given ranks will be ignored by the 'trec_eval' 
      scoring software. </p>
    <p>Participants may submit lists with more than 1000 documents for each topic. 
      However, NIST will truncate the list to 1000 topics. </p>

    <p>For the story-boundaries-unknown condition, field 3 will be a episode/ 
      time tag of the form: &lt;Episode-ID&gt;:&lt;Time-in-Seconds.Hundredths&gt; 
      for the retrieved excerpt:<br>
      <code>23 Q0 19980104_1130_1200_CNN_HDL:39.52 1 4238 ibm-cru-att-s1u<br>
      23 Q0 19980105_1800_1830_ABC_WNT:143.69 2 4223 ibm-cru-att-s1u<br>
      23 Q0 19980105_1130_1200_CNN_HDL:1120.02 3 4207 ibm-cru-att-s1u<br>

      23 Q0 19980515_1630_1700_CNN_HDL:749.81 4 4194 ibm-cru-att-s1u<br>
      23 Q0 19980303_1600_1700_VOA_WRP:61.02 5 4189 ibm-cru-att-s1u<br>
      etc.</code><br>
    </p>
    <p>Sites are to submit their retrieval output to NIST for scoring using standard 
      TREC procedures and ftp protocols. See the TREC Website at <a href="http://trec.nist.gov">http://trec.nist.gov</a> 
      for more details. </p>

    <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
    <h2> 
      <li><a name="SECTION16-2">Recognition Submission Format</a>
    </li></h2>
    <p>As in last year's SDR Track, only 1-Best-algorithm recognizer transcripts 
      will be accepted by NIST for scoring and if received in time, will be shared 
      across sites for the Cross-Recognizer retrieval conditions. Sites performing 
      Full-SDR not using a 1-Best recognizer are encouraged to self-evaluate their 
      recognizer in their TREC paper. </p>
    <p>Since the concept of recognizer transcript sharing for Cross-Recognizer 
      Retrieval experiments appeared to be broadly accepted last year, NIST will 
      assume that all submitted recognizer transcripts are to be scored and made 
      available to other participants for Cross-Recognizer Retrieval. If you would 
      like to submit your recognizer transcripts for scoring, but do NOT want 
      them shared, you must notify NIST (<a href="mailto:cedric.auzanne@nist.gov">cedric.auzanne@nist.gov</a>) 
      of the system/run to exclude from sharing PRIOR to submission. </p>

    <p>Submitted 1-Best recognizer transcripts must be formatted as follows: Each 
      recognizer transcript (one per show) is to have a filename of the following 
      form: &lt;EPISODE&gt;.srt where,<br>
    </p><ul type="disc">
      <li>EPISODE is the ID (basename) for the corresponding NDX and SPHERE file 
        recognized (e.g., 19980304_1600_1700_VOA_WRP).
    </li></ul>
    <p></p>
    <p>A System Description file must be created for each submitted set of recognizer-produced 
      transcripts which outlines pertinent features of the recognition system 
      used. The file should be named: &lt;RECOGNIZER&gt;-&lt;RUN&gt;.desc where,<br>

    </p><ul type="disc">
      <li>RECOGNIZER is a brief but informative lower-case string containing no 
        whitespace, hyphens, or periods which identifies the source of the recognizer 
        (e.g., att, cmu, dragon, ibm, etc.),<br>
      </li><li>RUN identifies the recognizer used (s1, s1u, s2, s2u).
    </li></ul>
    <p></p>
    <p>Minimally, the system description MUST identify the language model mode 
      which was employed: "Fixed" or "Rolling". If a rolling language model was 
      used, the update period should be identified. </p>
    <p>The format for the System Description is as follows:<br>
      System ID: (eg, NIST-S1U)<br>

    </p><ol>
      <li>SYSTEM DESCRIPTION: 
      </li><li>ACOUSTIC TRAINING: 
      </li><li>GRAMMAR TRAINING: (e.g., Fixed or Rolling with N-Day Periodic Update) 
      </li><li>RECOGNITION LEXICON DESCRIPTION: 
      </li><li>DIFFERENCES FROM S1 (if S2): 
      </li><li>REFERENCES: 
    </li></ol>
    <p></p>
    <p>The SRT files and System Description File should be placed in a directory 
      with the following name: &lt;RECOGNIZER&gt;-&lt;RUN&gt; where, 
    </p><ul type="disc">

      <li>RECOGNIZER is a brief but informative lower-case string containing no 
        whitespace, hyphens, or periods which identifies the source of the recognizer 
        (e.g., att, cmu, dragon, ibm, etc.), 
      </li><li>RUN identifies the recognizer used (s1, s1u, s2, s2u).
    </li></ul>
    <p></p>
    <p>Submit your SRT files as follows:</p>
    <p>A gnu-zipped tar archive of the above directory should then be created 
      (e.g., att-s1.tgz) using the -cvzf options in GNU tar. This file can now 
      be submitted to NIST for scoring/sharing via anonymous ftp to jaguar.ncsl.nist.gov 
      using your email address as the password. Once you are logged in, cd to 
      the "incoming/sdr99" directory. Set your mode to binary and "put" the file. 
      This is a "blind" directory, so you will not be able to "ls" your file. 
      Once you have uploaded the file, send email to cedric.auzanne@nist.gov to 
      indicate that a file is waiting. He will send you a confirmation after the 
      file is successfully extracted and email again later with your SCLITE scores. 
      To keep things simple and file sizes down, please submit separate runs (s1 
      and s2) in separate tgz files. </p>
    <p>The submitted output of a 1-Best recognizer must be in the standard SDR 
      Speech Recognizer Trancription (SRT) format. See Appendix A for an example.</p>
  </ol>

  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <h2> 
    <li><a name="SECTION17">Scoring</a>
  </li></h2>
  <hr>
  <ol>
    <h2> 
      <li><a name="SECTION17-1">Retrieval Scoring</a>

    </li></h2>
    <p>The TREC-8 SDR Track retrieval performance will be scored using the NIST 
      "trec_eval" Precision/Recall scoring software. A "shar" file containing 
      the trec_eval software is available via anonymous ftp from the following 
      URL:<br>
      <a href="https://github.com/usnistgov/trec_eval"> https://github.com/usnistgov/trec_eval</a> 
    </p>
    <p>For TREC-8 SDR, the primary retrieval measure will be Mean Average Precision. 
      Other retrieval measures will include: Precision at standard Document rank 
      cutoff levels, single number Average Precision over all relevant documents, 
      and single number R-Precision, precision after R relevant documents retrieved. 
    </p>
    <p>These measures are defined in Appendices to TREC Proceedings, and may also 
      be found on the TREC Website at <a href="http://trec.nist.gov">http://trec.nist.gov</a>. 
    </p>

    <p>For the known-story-boundaries condition, NIST will truncate the submitted 
      list to 1000 documents and score it using trec_eval. </p>
    <p>For the story-boundaries-unknown (U) retrieval conditions, NIST will programmatically 
      do the following : 
    </p><ol>
      <li>truncate the list to 1000 documents.<br>
        <br>
      </li><li>map all time tags to unique story IDs. Note that ALL of the recorded 
        time in the collection will have assigned story IDs, including both legitimate 
        retrievable stories and non-stories such as commercials, filler, etc. 
        If a lower ranked story ID is a duplicate of a higher ranked story ID, 
        then a sequence will be appended to the duplicate (e. g., <story-id>.1). 
        All of these duplicates will therefore be scored as non-relevant. This 
        same procedure will be applied to both story and non-story material. Therefore, 
        duplication of "hits" within stories and non-stories will be equally penalized.<br>
        <br>

      </story-id></li><li>score using trec_eval. 
    </li></ol>
    <p></p>
    <p>The mapping will simply involve converting the time tag to the story ID 
      of the story that the identified time resides within. </p>
    <p>NIST will provide a mapping/filtering tool to implement (1 - 2): <b>UIDmatch.pl</b> 
      - convert time-based retrieval output to doc-based output format for trec_eval 
      scoring. See the <a href="../pages/FAQ/IRT_FAQ.htm#TOOLS">IR scoring tools</a> 
      page. </p>

    <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
    <h2> 
      <li><a name="SECTION17-2">Speech Recognition Scoring</a>
    </li></h2>
    <p>TREC-8 Full SDR participants who use 1-best recognition are encouraged 
      to submit their recognizer transcripts in SRT form for scoring by NIST. 
      NIST will employ its "sclite" Speech Recognition Scoring Software to benchmark 
      the Story Word Error Rate for each submission. These scores will be used 
      to examine the relationship between recognition error rate and retrieval 
      performance. Note that to ensure consistency among all forms of the evaluation 
      collection, all SRTs submitted for the Story Boundaries Known retrieval 
      conditions received will be filtered to remove any speech outside the evaluation 
      per the corresponding NDX files. </p>
    <p>A randomly-selected 10-hour subset of the SDR collection will be transcribed 
      in Hub-4 form so that the speech recognition transcripts can be scored. 
      This will provide the primary speech recognition measures for the SDR track. 
      If possible, NIST will also create a filtered version of the closed caption 
      transcripts to be used in scoring the entire collection. Because of errors 
      in the closed captions (especially dropouts which are algorithmically uncorrectable), 
      it is assumed that the error rates will be considerably higher. NIST will 
      also use the 10-hour Hub-4-style subset to estimate the closed caption transcript 
      error. </p>

    <p>The NIST SCLITE Scoring software is available via the following URL: <a href="http://www.nist.gov/speech/tools/index.htm">http://www.nist.gov/speech/tools/index.htm</a>. 
      This page contains a ftp-able link to SCTK, the NIST Speech Recognition 
      Scoring Toolkit which contains SCLITE. The SCLITE software may be updated 
      to accomodate large test sets. The SDR email list will be notified as updates 
      become available. </p>
    <p>NIST will provide the following additional scripts to permit useful transformations 
      of the SDR speech recognizer transcripts: 
    </p><ul type="disc">
      <li><b>srt2ctm.pl</b> - convert SRT format to SCLITE CTM (for SR scoring) 
      </li><li><b>srt2ltt.pl</b> - convert SRT format to LTT format (for retrieval) 
      </li><li><b>ctm2srt.pl</b> - convert SCLITE CTM format to SRT format 
    </li></ul>

    <p></p>
    <p>See the <a href="../pages/FAQ/SRT_FAQ.htm#CONVERT">Speech scoring tools</a> 
      page. </p>
    <p>Note that two forms of NDX files will be provided: 1 set for the Story 
      Boundaries Known (SK) condition and another set for the Story Boundaries 
      Unknown (SU) Condition. The ctm2srt.pl filter and SK NDX file can be used 
      with a CTM file created for the SU condition to create an SRT file for the 
      SK condition as follows: 
    </p><ul type="disc">
      <li>Use <b>srt2ctm.pl</b> to create a CTM version of your ASR transcript 
      </li><li>Use <b>ctm2srt.pl</b> with the resulting CTM file and SU index (NDX) 
        file to create a Story Boundaries Known version of your ASR transcript. 
    </li></ul>

    <p></p>
  </ol>
  <center>
    <h2><font color="#ff0000">N O T E</font></h2>
  </center>
  <p><b>Since unverified reference transcripts are used in the SDR Track, the 
    SDR Word Error Rates should not be directly compared to those for Hub-4 evaluations 
    which use carefully checked/corrected annotations and special orthographic 
    mapping files. </b></p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <h2> 

  <h2> 
    <li><a name="SECTION19">Data Licensing and Costs</a>
  </li></h2>
  <hr>
  <p>Participants must make arrangements with the Linguistic Data Consortium to 
    obtain use of the TDT-2 recorded audio and transcriptions used in the SDR 
    Track. The recorded audio data is available in Shorten-compressed form on 
    approximately 75 CD-ROMs. The transcription and associated textual data will 
    be made available via ftp or via CD-ROM by special request. </p>
  <h2> 
    <li><a name="SECTION20">Reporting Conventions</a>

  </li></h2>
  <hr>
  <p>Participants are asked to give full details in their Workbook/Proceedings 
    papers of the resources used at each stage of processing, as well as details 
    of their SR and IR methods. Participants not using 1-best recognizers for 
    Full-SDR should also provide an appropriate analysis of the performance of 
    the recognition algorithm they used and its effect on retrieval. </p>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <br>

  <hr>
  <h2><a name="APPENDIXA">APPENDIX A: SDR Corpus File Formats</a></h2>
  <hr>
  <br>
  <p><b>Note: All transcription files are SGML-tagged.</b></p>
  <hr>
  <p><b>.sph</b> - SPHERE waveform: SPHERE-formatted digitized recording of a 
    broadcast, used as input to speech recognition systems. Waveform format is 
    16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order. </p>

  <p> <code> NIST_1A<br>
    1024<br>
    sample_count -i 27444801<br>
    sample_rate -i 16000<br>
    channel_count -i 1<br>

    sample_byte_format -s2 10<br>
    sample_n_bytes -i 2<br>
    sample_coding -s3 pcm<br>
    sample_min -i -27065<br>
    sample_max -i 27159<br>
    sample_checksum -i 31575<br>

    database_id -s7 Hub4_96<br>
    broadcast_id NPR_MKP_960913_1830_1900<br>
    sample_sig_bits -i 16<br>
    end_head<br>
    </code> (digitized 16-bit waveform follows header)<br>
    .<br>

    .<br>
    .<br>
  </p>
  <hr>
  <p><a name="LTTDEF"></a><b>.ltt</b> - Lexical TREC Transcription: ASR-style 
    reference transcription with all SGML tags removed except for Episode and 
    Section. "Non-News" Sections are excluded. This format is used as the source 
    for the Reference Retrieval condition. </p>
  <code> &lt;Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline 
  News" Language=English Version=1 Version_Date=8-Apr-1999&gt;<br>

  &lt;Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075&gt;<br>
  it's friday september thirteenth i'm david brancaccio and here's some of what's 
  happening in business and the world<br>
  &lt;/Section&gt;<br>
  &lt;Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081&gt;<br>
  agricultural products giant archer daniels midland is often described as politically 
  well connected any connections notwithstanding the federal government is pursuing 
  a probe into whether the company conspired to fix the price of a key additive 
  for livestock feed<br>
  ...<br>

  &lt;/Section&gt;<br>
  ...<br>
  &lt;/Episode&gt;<br>
  </code><p></p>
   
  <hr>
  <p><a name="NDXDEF"></a><b>.ndx</b> - Index: Specifies &lt;Sections&gt; in waveform 
    and establishes story boundaries and ID's. Similar to LTT format without text. 
    Non-transcribed Sections are excluded. </p>

  <p><a name="NDX"></a>For the known story boundaries condition, the ndx format 
    will require one Section tag per story as follows:<br>
  </p>
  <code> &lt;Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline 
  News" Language=English Version=1 Version_Date=8-Apr-1999&gt;<br>
  &lt;Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075&gt;<br>
  &lt;Section Type=NEWS S_time=81.21 E_time=207.31 ID=19980630_2130_2200_CNN_HDL.0081&gt;<br>
  ...<br>

  &lt;/Episode&gt;<br>
  </code> 
  <p><a name="NDXU"></a>For the unknown story boundaries condition, the ndx format 
    will require a single "FAKE" Section tag that will encompass the entire Episode 
    as follows:<br>
  </p>
  <code> &lt;Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline 
  News" Language=English Version=1 Version_Date=8-Apr-1999&gt;<br>
  &lt;Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL&gt;<br>
  &lt;/Episode&gt;<br>

  </code> 
  <p>Note that the start time of the fake section is the start time of the first 
    NEWS story and the end time is the end time of the last NEWS story of the 
    show.<br>
  </p>
  <hr>
  <p><a name="SRTDEF"></a><b>.srt</b> - Speech Recogniser Transcript (contrived 
    example): Output of speech recogniser for a .sph recorded waveform file which 
    will be used as input for retrieval. Each file must contain an &lt;Episode&gt; 
    tag and properly interleaved &lt;Section&gt; tags taken from the corresponding 
    .ndx file. Each &lt;Word&gt; tag contains the start-time and end-time (in 
    seconds with two decimal places) and the recognized word. </p>

  <p>For the known story boundaries condition, the Section tags follow the ones 
    specified in the ndx file. </p>
  <code> &lt;Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline 
  News" Language=English Version=1 Version_Date=8-Apr-1999&gt;<br>
  &lt;Section Type=NEWS S_time=75.43 E_time=81.21 ID=19980630_2130_2200_CNN_HDL.0075&gt;<br>
  &lt;Word S_time=75.52 E_time=75.87&gt;his&lt;/Word&gt;<br>
  &lt;Word S_time=75.87 E_time=75.36&gt;friday'S&lt;/Word&gt;<br>

  &lt;Word S_time=76.36 E_time=76.82&gt;september&lt;/Word&gt;<br>
  &lt;Word S_time=76.82 E_time=77.47&gt;thirteenth&lt;/Word&gt;<br>
  ...<br>
  &lt;/Section&gt;<br>
  ...<br>

  &lt;/Episode&gt;<br>
  </code><p></p>
   
  <p>For the unknown story boundaries condition, the srt format will require a 
    single "null" Section tag that will encompass the entire Episode as follows: 
  </p>
  <code> &lt;Episode Filename="19980630_2130_2200_CNN_HDL" Program="CNN Headline 
  News" Language=English Version=1 Version_Date=8-Apr-1999&gt;<br>
  &lt;Section Type=FAKE S_time=58.67 E_time=1829.47 ID=19980630_2130_2200_CNN_HDL&gt;<br>
  ...<br>

  &lt;Word S_time=75.52 E_time=75.87&gt;his&lt;/Word&gt;<br>
  &lt;Word S_time=75.87 E_time=75.36&gt;friday'S&lt;/Word&gt;<br>
  &lt;Word S_time=76.36 E_time=76.82&gt;september&lt;/Word&gt;<br>
  &lt;Word S_time=76.82 E_time=77.47&gt;thirteenth&lt;/Word&gt;<br>

  ...<br>
  &lt;/Section&gt;<br>
  &lt;/Episode&gt;<br>
  </code><p></p>
   
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>
  <br>

  <hr>
  <h2><a name="APPENDIXB">APPENDIX B: SDR Corpus Filters</a></h2>
  <hr>
  <br>
  <table>
    <tbody><tr> 
      <td valign="top"><a href="ftp://jaguar.ncsl.nist.gov/sdr99/srt2ltt.pl"><b>srt2ltt.pl</b></a> 
      </td><td>This filter transforms the Speech Recognizer Transcription (SRT) format 
        with word times into the Lexical TREC Transcription (LTT) form. This resulting 
        simplified form of the speech recogniser transcription can be used for 
        retrieval if word times are not desired. 
    </td></tr>

    <tr> 
      <td valign="top"><a href="ftp://jaguar.ncsl.nist.gov/sdr99/srt2ctm.pl"><b>srt2ctm.pl</b></a> 
      </td><td>This filter transforms the Speech Recognizer Transcription (SRT) format 
        into the CTM format used by the NIST SCLITE Speech Recognition Scoring 
        Software. 
    </td></tr>
    <tr> 
      <td valign="top"><a href="ftp://jaguar.ncsl.nist.gov/sdr99/ctm2srt.pl"><b>ctm2srt.pl</b></a> 
      </td><td>This filter together with the corresponding NDX file transforms the 
        CTM format used by the NIST SCLITE Speech Recognition Scoring Software 
        into the SDR Speech Recognizer Transcription (SRT) format. Material not 
        specified in the NDX time tags is excluded. 
    </td></tr>
  </tbody></table>
  <p>Back to the <a href="#CONTENTS">Table of Contents</a>.</p>

  <br>
  <hr>
  <br>
  If you have any remarks or questions regarding the <b>FORMAT</b> of this document, 
  not the <b>CONTENT</b>, please contact <a href="mailto:christophe.laprun@nist.gov">christophe.laprun@nist.gov</a>. 
  <br>

</ol>



</body>
</html>