1998 TREC-7 SPOKEN DOCUMENT RETRIEVAL (SDR) TRACK : Specification 3
-------------------------------------------------------------------
Updated 30-Jun-1998

Karen Sparck Jones, John Garofolo, Ellen Voorhees

This is the specification for implementation of the TREC-7 Spoken
Document Retrieval (SDR) Track.

Contents
--------
 1. Background from TREC-6
 2. What's New and Different
 3. TREC-7 SDR Track in a Nutshell
 4. Baseline Speech Recognizer
 5. Baseline Retrieval Engine
 6. Spoken Document Test Collection
    6.1 Collection Documents
    6.2 Collection File Types
    6.3 Story Boundaries
 7. SDR System Date
 8. Development Test Data
 9. Speech Recognition Training/Model Generation
10. Retrieval Training, Indexing, and Query Generation
11. SDR Participation Conditions and Levels 
12. Evaluation Retrieval Conditions
13. Topics (Queries)
14. Relevance Assessments
15. Retrieval (Indexing and Searching) Constraints
16. Submission Formats
    16.1 Retrieval Submission Format
    16.2 Recognition Submission Format
17. Scoring
    17.1 Retrieval Scoring
    17.2 Speech Recognition Scoring
18. Schedule 
19. Data Licensing and Costs 
20. Reporting Conventions
21. Contacts

Appendix A: SDR Corpus File Formats

Appendix B: SDR Corpus Filters


1. Background from TREC-6
-------------------------
The 1997 TREC-6 SDR evaluation met its goals in bringing the
information retrieval (IR) and speech recognition (SR) communities
together, exploring the feasibility of SDR implementation,
and in debugging the logistics for an SDR evaluation. In the
process, we learned that usable SDR could be implemented for a
known-item retrieval task with a small 50-hour spoken document
collection and that the task could successfully be measured and
evaluated. But, we learned from the initial SDR Track that the
document collection needs to be significantly larger for realistic
retrieval experiments and that an adhoc (rather than known-item) form
of evaluation would be more effective in gauging the technology.

Fuller details on TREC-6 can be obtained from the track the TREC-6
Proceedings published by NIST, and the Proceedings of the DARPA
Broadcast News Transcription and Understanding Workshop, February
8-11, 1998.


2. What's New and Different
---------------------------
Larger Test Collection:

The 1998 TREC-7 SDR test collection will be approximately double the
size of the 1997 TREC-6 SDR collection: about 100 hours of speech and
close to 3,000 news stories.

TREC adhoc-style topics and assessments:

The SDR system output for TREC-7 will be essentially the same as last
year.  However, rather than scoring "percent retrieved at rank 1" and
"mean reciprocal rank" which were used in last year's Known-Item task,
NIST will implement a full ad-hoc-style relevence assessment
evaluation with Precision and Recall measures.  NIST will create a
document "pool" for each topic which assessors will then evaluate for
relevance.  The pool will be the union of the top 100 documents for
the Reference retrieval condition from all participants.  If it is
feasible, we may also add documents from other retrieval conditions to
the pool.

Well-defined training data cutoffs:

Rules and cutoff dates have been established for recognizer Acoustic
and Language Model training.

See the remainder of the document for details and additional information.


3. TREC-7 SDR Track in a Nutshell
---------------------------------
Training Collection: 
	100-hour Broadcast News Corpus subset (audio and transcripts)
	(Training Indices (NDX), Waveforms (SPH), Human Transcripts (LTT), 
        Recognizer Transcripts (SRT))
  	See details for training data date cutoffs

Test Collection:
	100-hour Broadcast News Corpus subset (audio and transcripts)
	(Test Indices (NDX), Waveforms (SPH), Human Transcripts (LTT), 
        Recognizer Transcripts (SRT))

Participation:
	Full-SDR: speech recognition + retrieval
		R1,B1,B2,S1 retrieval conditions required
		TREC Adhoc Task not required

	Quasi-SDR: retrieval only using supplied recognizer transcripts
		R1,B1,B2 retrieval conditions required
		TREC Adhoc Task required

Topics: 25, fully automatic processing required

Retrieval Conditions:
	R1: Reference Retrieval using human-generated "perfect" transcripts
	B1: Baseline Retrieval using medium error recognizer transcripts
	B2: Baseline Retrieval using high error recognizer transcripts
	S1: Full SDR using own recognizer
	S2: Full SDR using own secondary recognizer
	CR-<SYS_NAME>: Cross-Recognizer Retrieval using other participants' 
                       recognizer transcripts

Scoring: 
	Retrieval: Pooled relevance assessment, Precision/Recall measures
	Speech Recognition: Story Word Error Rate

Important Dates:
	SDR Track Registration: ASAP
	Begin recognition (Full SDR): May 7, 1998
	Begin retrieval (Full SDR and Quasi-SDR): July 1, 1998
	All Runs Due: August 30, 1998
	TREC notebook papers due: October 21, 1998 (tentative)
        TREC-7: November 9-11, 1998


4. Baseline Speech Recognizer
-----------------------------
As in TREC-6, Baseline recognizer transcripts will be provided for
retrieval sites who do not have access to recognition technology.
Using these baseline recognizer transcripts, sites without
recognizers can participate in the "Quasi-SDR" subset of the Track.

Note that all sites (Full SDR and Quasi-SDR) will be required to
implement retrieval runs on the baseline recognizer transcripts.
This will provide a valuable "control" condition for retrieval.

This year, two baseline recognizer transcripts will be provided by
NIST.  

1.  Medium Error Baseline Recognizer - ~35% Word Error Rate transcript
2.  High Error Baseline Recognizer - ~50% Word Error Rate transcript

The NIST recognizer used to produce 1 and 2 the baselines is an
instantiation of the SPHINX-3 recognition engine kindly provided by
CMU.

NIST will also provide recognizer transcripts submitted for sharing by
Full SDR participants.  If time and resources permit, NIST will also
produce a "optimized" recognizer transcript using its ROVER software
over the shared recognizer transcripts.


5. Baseline Retrieval Engine
----------------------------
Sites in the speech community without access to retrieval engines 
may use the NIST ZPRISE retrieval engine.

see http://www-nlpir.nist.gov/prise/prise.html


6. Spoken Document Test Collection
----------------------------------

6.1 Collection Documents

The "documents" in the SDR Track are news stories taken from the
Linguistic Data Consortium (LDC) Broadcast News (BN) corpora, which
are also used for the DARPA CSR evaluations. A story is generally
defined as a continuous stretch of news material with the same content
or theme (e.g. tornado in the Caribbean, fraud scandal at Megabank),
which will have been established by hand segmentation of the news
programmes (at the LDC).  However, there are two types of "stories" in
the test collection: "reports" which are topical and content rich and
"filler" which are transitional segments.  Also note that some report
type stories such as news summaries may contain topically varying
material.  Also note that a story is likely to involve more than one
speaker, background music, noise etc.  

6.2 Collection File Types

The test collection for the SDR Track will consist of digitized NIST
SPHERE-formatted waveform (*.sph) files containing recordings of news
broadcasts from various radio and television sources aired between
June 1997 and January 1998 and several human-transcribed and
recognizer-transcribed textual versions of the recordings.  The
collection has been licensed, recorded and transcribed by the
Linguistic Data Consortium (LDC).  Note that this set of data is also
being used as new Hub-4 training data this year.

NB. Since use of the collection is controlled by the LDC, all SDR
participants must make arrangements directly with the LDC to obtain
access to the collection.  See below for contact info.

The test collection contains 114 waveform files.  Each waveform
filename consists of a basename identifying the broadcast and a
.sph extension.  The following auxilary file types (with the same
basename as the waveform file they correspond to) are also provided:

*.ndx - Spoken Document Index.  Index file specifying document IDs and 
        regions to recognize for the Full SDR retrieval condition.

*.dtt - Detailed TREC Transcript.  These SGML transcripts correspond are
        the original LDC Hub-4 transcripts with SDR document IDs added
        in the <Section> tags.  Note that these are to be used only for 
        diagnostic purposes and will not be used in the evaluation.

*.ltt - Lexical TREC Transcript.  This SGML format provides the textual
        version of the collection used in the Reference Retrieval condition 
        and is considered to be "ground truth" for this track.  The LTT 
        format is a simplified derivation of the DTT format with non-lexical 
        tags and special characters removed and is designed to simulate the 
        output of a "perfect" word recognizer.  Note that, unlike in Hub-4, 
        no additional verification of the transcripts will be performed, so
        transcription errors are possible.

*.srt - Speech Recognizer Transcript.  This SGML format is to be produced
        by one-best speech recognizers for the Full SDR retrieval condition
        and is the format to be submitted to NIST for pooling and scoring.
        This format will also be used for distribution of the Baseline 
        recognizer output.  A PERL script (srt2ltt.pl) will be provided 
        to convert these into the LTT format for retrieval runs.  NIST will 
        also provide srt2ctm.pl and ctm2srt.pl scripts to allow conversion 
        between the SRT format and the CTM format NIST uses to score 
        recognizer output.

The LDC will provide the SPH files and DTT and LTT transcripts to the
SDR Track participants.  NIST will provide the NDX files and Baseline
and contributed SRT files directly to participants.  Information will
be provided to the SDR mail list regarding how to obtain these files.

6.3 Story Boundaries

To eliminate complexity, the temporal boundary of news stories in the
collection will be "known" in all test conditions.  This information
is given in the Index (*.ndx) files (as well as in all the other text
filetypes) in the SGML <Section> tags which specify the document IDs
and start and end times for each story within the collection.

Note that sections of the waveform files containing commercials and
other "out of bounds" material which have not be transcribed by the
LDC will be exluded from the SDR track.  As such, these sections will
be exluded from the NDX files.  So, only the sections specified in the
index files should be recognized and retrieved.

Note that except for the time boundaries and Story IDs provided in the
<Section> tags, NO OTHER INFORMATION provided in the SGML tags may be
used or indexed in any way for the test including any classification
or topic information which may be present.

7. SDR System Date
------------------
For all intents and purposes, the "current" recognition/retrieval date
for this evaluation will be 1 May 1998.  Therefore, no acoustic
or textual materials broadcast or published after this date may be 
used in developing either the recognition or retrieval system component.

8. Development Test Data
------------------------
  No Development Test data is specified or provided for the SDR track
  although this year's training set may be split into the training/test
  sets used last year for development test purposes.

9. Speech Recognition Training/Model Generation
-----------------------------------------------
The first 100 hours of LDC Broadcast News data collected in 1996 is
designated as the suggested training material for the 1998 SDR
evaluation.  This, however, does not preclude the use of other
training materials as long as they conform to the restrictions listed
later in this section. There is no designated supplementary textual
data for SDR language model training. Language models developed for the
1997 Hub-4 evaluation may be employed in recognizing the SDR98 test
collection.  However, sites are encouraged to explore the
development of more suitable language models for the SDR98 task as
long as the materials used conform to the restrictions below.

1. No acoustic or transcription material for broadcast excerpts
appearing in the SDR98 test collection may be used for acoustic or
language model training.  However for managability, this does not
preclude use of excerpts from other sources which happen to be
duplicated in the test material.

2. No acoustic or transcription material from radio or television news
sources broadcast after 05/31/97 may be used for acoustic or language
model training.  This should effectively prevent the possibility of
"cross-contamination" of training/test material and eliminates the
duplication problem in 1.

3. Any other acoustic or textual data not excluded in 2 such as
newswire texts, Web articles, etc. published prior to 05/01/98 may
be used.  This permits the creative development of up-to-date language
models for the SDR task.

Sites are requested to report the training materials they employed
in their site reports and TREC papers.

Sites may use acoustic or textual materials which are not publicly
available in 1998.  However, sites using these materials MUST make
arrangements to make these materials available at reasonable cost by
1 March 1999 for use by all sites in the next SDR evaluation.
Of course, earlier availability of these materials is encouraged.
Note that no special arrangements need to be made for materials which
are publicly available at the time of the evaluation such as
Newswire texts and Web documents.


10. Retrieval Training, Indexing, and Query Generation
------------------------------------------------------
For retrieval, the SDR track is an automatic ad hoc task.  This means
both that any collection-wide statistics may be used in indexing and
that the retrieval system may NOT be tuned using the test topics.
Participants may not use statistics generated from the reference
transcripts collection in the baseline or recognizer transcript
collections.  As in recognition training, any auxiliary IR training
material or auxiliary data structures such as thesauri that are used
must predate the May 1, 1998 recognition/retrieval date.  Likewise,
any IR training material which is derived from spoken broadcast
sources (transcripts) must predate the test collection (prior to
May-31-97).

A set of 5 training topics will be released with the other training
data.  These topics have a statement of relevance criteria and
relevance judgements (ie relevant document lists) produced by doing
PRISE searches on the first 50 hours (training set 1) of the training
collection only.  The judgements may therefore be incomplete.

For the SDR Track, all sites are required to implement fully automatic
retrieval.  Therefore, sites may not perform manual query generation
in implementing retrieval for their submitted results.  

Participants are, of course, free to perform whatever side experiments
they like and report these at TREC as contrasts.


11. SDR Participation Conditions and Levels 
-------------------------------------------
Interested sites are requested to register for the SDR Track as soon
as possible so that they are added to the SDR mailing list.
Registration in this case merely indicates your interest and does not
imply a committment to participate in the track.


Since this is a TREC track, participants are subject to the TREC
conditions for participation, including signing licensing agreements
for the data.  Dissemination of TREC work and results other than in
the (publically available) conference proceedings is welcomed, but
specific advertising claims based on TREC results is forbidden.  The
conference held in November is open only to participating groups that
submit results and to government sponsors.  (Signed-up participants
should have received more detailed guidelines.)

Participants implementing Full SDR (speech recognition and retrieval)
are exempted from participation in the central TREC Adhoc Task.
However, Quasi-SDR (retrieval only) participants must also implement
the Adhoc Task.

Participants must implement either Full SDR or Quasi-SDR retrieval as
defined below.  Note that sites may not participate by simply
pipelining the baseline recognizer transcripts and baseline retrieval
engine.  Participants should implement at least one of the two major
system components.  As in TREC-6, sites with speech recognition
expertise and sites with retrieval expertise are encouraged to team up
to implement Full SDR.

The 1998 SDR Track has two participation levels and several retrieval
conditions as detailed below.  Given the large number of conditions
this year, sites are permitted to submit only 1 run per condition.

Participation Levels:

1.  Full SDR Required Retrieval Runs: S1, B1, B2, R1 (see below)
    Sites choosing to participate in Full SDR must produce a ranked
    document list for each test topic from the recorded audio waveforms.
    This participation level requires the implementation of both
    speech recognition and retrieval.  In addition, Full SDR participants
    must implement the Baseline and Reference retrieval conditions.
    Participants may submit an optional second Full SDR run using
    an alternate recognizer (see below for requirements).  
    Participants may also submit optional Cross-Recognizer runs as
    described below.

2.  Quasi-SDR Required Retrieval Runs: B1, B2, R1 (see below)
    Sites without access to speech recognition technology may participate
    in the "Quasi-SDR" subset of the test by implementing retrieval
    on provided recognizer-produced transcripts.  In addition, Quasi-SDR
    participants must implement the Reference retrieval condition.
    Participants may submit optional Cross-Recognizer runs as described
    below


12. Evaluation Retrieval Conditions
-----------------------------------
The following are the retrieval conditions for the SDR Track.  Note
that some retrieval conditions are required and others are optional.

  SPEECH (S1,S2): (S1 is Required for Full SDR Participants)
  Systems process audio input (*.sph) specified in
  the test indices (*.ndx) and perform retrieval against the test
  topics to produce a ranked document list for each topic. The ranked 
  document lists will be evaluated against the reference lists developed 
  by the NIST assessors.  If sites use 1-best recognition, they are 
  encouraged to submit their recognition output (*.srt) to NIST
  for scoring and sharing with other participants for cross-component
  retrieval runs.  If sites use other forms of recognition such as
  lattices, phone strings, etc., they are encouraged to report their
  approach including an analysis of recognition performance in their 
  TREC papers.  Sites may perform an optional second run with an alternate
  recognizer (S2).  

  NIST encourages sites to submit their S1 recognizer transcripts in
  time for NIST to redistribute them to other participants for
  the Cross-Recognizer retrieval condition (See Section 18, Schedule).
  The optional retrieval run with a secondary recognizer run (S2) has
  been added to permit sites to continue development of both their recognition 
  and retrieval algorithms up to the due date for retrieval results while still
  permitting them to share their S1 transcripts for the Cross-Recognizer condition.  
  Therefore, sites do not need to submit their S2 recognizer transcripts
  until the final retrieval due date.   Note, however, that sites are encouraged
  to submit both their S1 and S2 recognizer transcripts at their earliest
  convenience to facilitate their use in experiments by the other participants.

  BASELINE (B1,B2): (Required for all Participants)
  Systems process provided pre-recognized transcripts
  of audio (*.srt) and perform retrieval against the test topics to
  produce a ranked document list for each topic.  The ranked document 
  lists will be evaluated against the reference lists developed by the 
  NIST assessors. This condition provides a control condition using 
  a "standard" fixed recognizer.  It also provides recognizer data
  for sites without access to recognition technology who wish to
  participate in the QSDR subset of the Track.

  Two BASELINE Conditions are anticipated:
     Medium Error (B1): NIST-Provided *.srt files with ~35% Word Error Rate
     High Error (B2): NIST-Provided *.srt files with ~50% Word Error Rate

  (Although the estimated Word Error Rates are given here, they should not
  be used in tuning retrieval unless they can be automatically estimated
  without using the reference transcripts.)

  REFERENCE (R1): (Required for all Participants)
  Systems process human-generated transcripts of audio (*.ltt) and 
  perform retrieval against the test topics to produce a ranked document
  list for each topic.  The ranked document lists will be evaluated 
  against the reference lists developed by the NIST assessors. This
  condition provides a control condition using "perfect" recognition.  Note
  the LTT files have been filtered to remove any material outside the 
  evaluation per the NDX files.

  CROSS-RECOGNIZER (CR-ROVER,CR-CMU-S1,CR-IBM-S1,...): (Optional)
  Systems process other participants' S1 recognizer transcripts (*.srt) and 
  perform retrieval against the test topics to produce a ranked document 
  list for each topic.  The ranked document lists will be evaluated against
  the reference lists developed by the NIST assessors. This condition 
  provides a cross-component control condition and provides sites
  with access to additional recognizer transcripts.  Note that the shared
  SRTs will be filtered to remove any material outside the evaluation
  per the NDX files.

  In addition to providing the shared recognizer transcripts, NIST may produce 
  a set of "optimized" low error rate recognizer transcript using its ROVER 
  software with all the submitted S1 recognizer transcripts.  

Participants MUST use the SAME retrieval strategy for all conditions
(that is, term weighting method, stop word list, use of phrases,
retrieval model, etc must remain constant).  For sites implementing S1
and/or S2 using non-word-based recognition (phone, word-spotting, lattice,
etc.), they should use the closest retrieval strategy possible
across conditions.

Sites may not use Word Error Rate or other measures as generated
by scoring the recognizer transcripts against the reference transcripts
to tune their retrieval algorithms in the S1, S2, B1, B2, and CR retrieval 
conditions (all conditions where recognized transcripts are used).  The
reference transcripts may not be used in any form for any retrieval condition
except of course for R1.
  

13. Topics (Queries)
--------------------
The TREC-7 SDR Track will have 23* topics (queries) constructed by the
NIST assessors.  Each topic will consist of a concise 1-sentence
string.  For SDR, the search topics must be processed automatically,
without any manual intervention.  Note that participants are welcome
to run their own manually-assisted contrastive runs and report on
these at TREC.  However, these will not be scored or reported by NIST.

* Because of the limited size, scope, and temporal distribution of the 
test collection, the assessors were unable to develop the targeted
25 topics.  The small reduction should not adversely affect the
evaluation.

The topics will be supplied in written form. Given the number of
retrieval conditions this year, spoken versions of the queries will
not be included as part of the test set.  However, participants are
welcome to run their own contrastive spoken input tests and report on
these at TREC.


14. Relevance Assessments
-------------------------
Relevance assessments for the SDR Track will be provided by the NIST
assessors.  As in the Adhoc Task, the top 100-ranked documents for
each topic from each system for the Reference Condition (R1) will be
pooled and evaluated for relevance.  If time and resources permit,
additional documents from other retrieval conditions may be added to
the pool as well.  Note that this approach is employed to make the
assessment task manageable, but may not cover all documents that are
relevant to the topics.


15. Retrieval (Indexing and Searching) Constraints
--------------------------------------------------
Since the focus of the SDR Track is on the automatic retrieval of
spoken documents, manual indexing of documents, manual construction or
modification of search topics, and manual relevance feedback may not be
used in implementing retrieval runs for scoring by NIST.  All submitted
retrieval runs must be fully automatic.  Note that fully automatic
"blind" feedback and similar techniques are permissible and
manually-produced reference data such as dictionaries and thesauri may
be employed.

Participants are free to perform internal experiments with manual
intervention and report on these at TREC.


16. Submission Formats
----------------------
In order for NIST to automatically log and process all of the many
submissions which we expect to receive for this track, participants
MUST ensure that their retrieval and recognition submissions meet the
following filename and content specs.  Incorrectly formatted files
will be rejected by NIST.

16.1 Retrieval Submission Format
--------------------------------
For retrieval, each submission must have a filename of the following 
form:

	<SITE_ID>-<CONDITION>-<RECOGNIZER_ID>.ret

	where,

	SITE_ID is a brief but informative lowercase string containing
	no whitespace, hyphens, or periods which uniquely identifies
        your site.  For team efforts, the SITE_ID should identify the 
	retrieval site only unless you have a team name.
	The same SITE_ID must be used in all retrieval submissions from 
	your site or team.

	(e.g., att, city, clarit, cmu, cu, dublin, eth, glasgow, ibm, nsa, 
        rmit, shef, umass, umd)

	CONDITION is a lowercase identifier for the retrieval condition used.

	(e.g, r1, b1, b2, s1, s2, cr)

	RECOGNIZER is a lowercase string containing no whitespace, hyphens,
	or periods which provides a unique identifier for the recognizer
	transcript set used in the Full SDR (s1 and s2) and Cross-Recognizer 
        (cr) Retrieval conditions. The number at the end of site should correspond
        to the Full SDR condition (s1 or s2) for which the recognizer was originally
        used. Note that RECOGNIZER should NOT be listed for the Baseline Retrieval         	conditions (b1/b2).

	e.g., ibm1, cmu1, shef2, rover1, etc.

The following are some example retrieval submission filenames:

	eth-r1.ret  (ETH retrieval using reference transcripts)
	cmu-b1.ret (CMU retrieval using Baseline 1 recognizer)
	shef-b2.ret (Sheffield retrieval using Baseline 2 recognizer)
	att-s1.ret (AT&T retrieval using AT&T 1 recognizer)
	ibm-cr-att1.ret (IBM retrieval using AT&T 1 recognizer)

As in TREC-6, the output of a retrieval run is a ranked list of story
(document) ids.  These will be submitted to NIST for scoring using the
standard TREC submission format (a space-delimited table):

   23 Q0 k960913.4 1 4238 ibm-cr-att1
   23 Q0 k960825.2 2 4223 ibm-cr-att1
   23 Q0 k960514.7 3 4207 ibm-cr-att1
   23 Q0 k961012.1 4 4194 ibm-cr-att1
   23 Q0 k960913.3 5 4189 ibm-cr-att1
      etc.

Field Content
1 Topic ID
2 Currently unused (must be "Q0")
3 Story ID of retrieved document
4 Document rank
5 *Retrieval system score (INT or FP) which generated the rank.  
6 Site/Run ID (should be same as file basename)

*Note that field 5 MUST be in descending order so that ties may
be handled properly.  This number (not the rank) will be used to rank
the documents prior to scoring.  The site-given ranks will be ignored
by the 'trec_eval' scoring software.

Participants should submit such a list with not more than 1000 
documents for each topic.

The protocol for submission of retrieval results to NIST for scoring was 
not yet ready at the time this document was released.   It will be made 
available in later email.

16.2 Recognition Submission Format
----------------------------------
As in last year's SDR Track, only 1-Best-algorithm recognizer 
transcripts will be accepted by NIST for scoring.  This year, with 
the addition of the Cross-Recognizer condition, these will also be 
shared across sites.  Sites performing Full-SDR not using a 1-Best 
recognizer are encouraged to self-evaluate their recognizer in their 
TREC paper.

Since the concept of recognizer transcript sharing for
Cross-Recognizer Retrieval experiments appeared to be broadly accepted
last year, NIST will assume that all submitted recognizer transcripts
are to be scored and made available to other participants for
Cross-Recognizer Retrieval.  If you would like to submit your
recognizer transcripts for scoring, but do NOT want them shared,
you must notify NIST (cedric.auzanne@nist.gov) of the
system/run to exclude from sharing PRIOR to submission.  Note that
we will be turning around the submissions very quickly as they
are received and making them available for sharing.

Submitted 1-Best recognizer transcripts must be formatted as follows:
Each recognizer transcript must have a filename of the following form:

	<EPISODE>.srt

	where,

     	EPISODE is the ID (basename) for the corresponding NDX and SPHERE file 
	recognized (e.g., ea980107).  Each submission must contain
	114 of these files - one per NDX/SPHERE file.

The 114 SRT files should be placed in a directory with the following 
name:

	<RECOGNIZER>-<RUN>

	where,

	RECOGNIZER is a brief but informative lower-case string containing
	no whitespace, hyphens, or periods which identifies the source of 
	the recognizer (e.g., att, cmu, dragon, ibm, etc.)

	RUN identifies the recognizer used (s1 or s2)

Submit your SRT files as follows:

A gnu-zipped tar archive of the above directory should
then be created (e.g., att-s1.tgz) using the -cvzf options
in GNU tar.  This file can now be submitted to NIST for scoring/sharing 
via anonymous ftp to jaguar.ncsl.nist.gov using your email address
as the password.  Once you are logged in, cd to the "incoming/sdr98" 
directory.  Set your mode to binary and "put" the file.  This is a 
"blind" directory, so you will not be able to "ls" your file.  Once you
have uploaded the file, send email to cedric.auzanne@nist.gov to indicate 
that a file is waiting.  He will send you a confirmation after the file is 
successfully extracted and email again later with your SCLITE scores.  To 
keep things simple and file sizes down, please submit separate runs 
(s1 and s2) in separate tgz files.

The output of a 1-Best recognizer is the standard SGML SDR Speech
Recognizer Trancription (SRT) format of the form:

<Episode ...>
<Section ...>
<Word S_time=nn.nn E_time=nn.nn>word1</Word>
<Word S_time=nn.nn E_time=nn.nn>word2</Word>
...
</Section ...>
<Word S_time=nn.nn E_time=nn.nn>wordx</Word>
<Word S_time=nn.nn E_time=nn.nn>wordy</Word>
...
</Section ...>
...
</Episode>

Each SRT files must contain the <Episode> and properly interleaved
<Section> tags from the corresponding .ndx file which specify the
source and story boundaries.  Recognized material temporally outide of
the given <Section> tags should NOT be included so that it will not
corrupt retrieval.  Each <Word> tag contains the start-time and
end-time (in seconds with two decimal places) and the recognized word.
The words are to be case-insensitive in Lexical Standard Normal
Othographic Representation (Lexical-SNOR) form.  However, lower case
is suggested for convenience.

See Appendix A for an example.

NOTE: ALL SRT files received will be automatically filtered to remove
all words with time boundaries occuring outside test material designated
in the NDX files. Please recognize AND retrieve from ONLY the material
specified in the NDX files so that your results are comparable with
others.


17. Scoring
-----------

17.1 Retrieval Scoring

The TREC-7 SDR Track retrieval performance will be scored using the
NIST "trec_eval" Precision/Recall scoring software.  A "shar" file
containing the trec_eval software is available via anonymous ftp from
the following URL:
	ftp://ftp.cs.cornell.edu/pub/smart/trec_eval.v3beta.shar

Since pooled relevance assessments (as in the TREC Adhoc Task) are to
be employed, the Known-Item Retrieval measures used in the TREC-6 SDR
Track will not be used.  For TREC-7 SDR, the retrieval measures will
include: Precision at standard Document rank cutoff levels, single
number Average Precision over all relevant documents, and single
number R-Precision, precision after R relevant documents retrieved.

These measures are defined in Appendices to TREC Proceedings, and
may also be found on the TREC Website at http://trec.nist.gov.

17.2 Speech Recognition Scoring

TREC-7 Full SDR participants who use 1-best recognition are encouraged
to submit their recognizer transcripts in SRT form for scoring by
NIST.  NIST will employ its "sclite" Speech Recognition Scoring
Software to benchmark the Story Word Error Rate for each submission.
These scores will be used to examine the relationship between
recognition error rate and retrieval performance.  Note that to
ensure consistency among all forms of the evaluation collection, 
all SRTs received will be filtered to remove any speech outside the 
evaluation per the corresponding NDX files.

The NIST SCLITE Scoring software is available via the following URL:
http://www.nist.gov/speech/software.htm.  This page contains a
ftp-able link to SCTK, the NIST Speech Recognition Scoring Toolkit
which contains SCLITE.

NIST will provide the following additional scripts to permit useful
transformations of the SDR speech recognizer transcripts:

	srt2ctm.pl - convert SRT format to SCLITE CTM (for SR scoring)
	srt2ltt.pl - convert SRT format to LTT format (for retrieval)
	ctm2srt.pl - convert SCLITE CTM format to SRT format

	 ********************* N O T E *********************
Since unverified reference transcripts are used in the SDR
Track, the SDR Word Error Rates should not be directly compared to
those for Hub-4 evaluations which use carefully checked/corrected
annotations and special orthographic mapping files.
	 ***************************************************

If NIST has sufficient time, it may also apply experimental "content
word" recognition scoring measures to explore approaches for modelling
the effect of key content word errors on retrieval performance.


18. Schedule 
------------
  Site registration :	                     ASAP

  Training data :                            01 May 1998
     100 hours TREC-6 training 
     documents (SPH, DTT, LTT, NDX)
     5 training topics

  Test data:
     Speech Waveform files (SPH)             Apr (from LDC)
     NDX (Test Indices with story bounds)    07 May 1998
 
     DTT Detailed transcriptions             01 Jul 1998
     LTT Reference transcriptions            01 Jul 1998
     SRT Baseline transcriptions             01 Jul 1998
     Topics                                  01 Jul 1998

     Participant-Shared recognizer 
     transcriptions (SRT)		     20 Jul 1998           
                        
  Search results to NIST                     30 Aug 1998

  Relevance judgements released	by NIST	     05 Oct 1998
  
  Scored Retrieval Results released by NIST  07 Oct 1998

  Conference workbook papers to NIST         21 Oct 1998 (tentative)

  TREC-7 Conference                          9-11 Nov 1998


19. Data Licensing and Costs 
----------------------------
Participants must make arrangements with the Linguistic Data
Consortium to obtain use of the speech waveforms and transcriptions
used in the SDR Track.  


20. Reporting Conventions
-------------------------
Participants are asked to give full details in their
Workbook/Proceedings papers of the resources used at each stage of
processing, as well as details of their SR and IR methods.
Participants not using 1-best recognizers for Full-SDR should also
provide an appropriate analysis of the performance of the recognition
algorithm they used and its effect on retrieval.


============================================================================
  		  APPENDIX A: SDR Corpus File Formats
============================================================================

Note: All transcription files are SGML-tagged.

--------------------------------------------------------------------------

.sph - SPHERE waveform: SPHERE-formatted digitized recording of a 
       broadcast, used as input to speech recognition systems. Waveform 
       format is 16-bit linear PCM, 16kHz. sample rate, MSB/LSB byte order.    


NIST_1A
   1024
sample_count -i 27444801
sample_rate -i 16000
channel_count -i 1
sample_byte_format -s2 10
sample_n_bytes -i 2
sample_coding -s3 pcm
sample_min -i -27065
sample_max -i 27159
sample_checksum -i 31575
database_id -s7 Hub4_96
broadcast_id NPR_MKP_960913_1830_1900
sample_sig_bits -i 16
end_head
(digitized 16-bit waveform follows header)
.
.
.

--------------------------------------------------------------------------

.dtt - Detailed TREC Transcription: LDC-produced Broadcast News transcription
       with absolute Section (story) IDs added. This format is
       used only for 1997 100-hour data set now designated as training.  
       The LDC has produced the 100-hour data set used as the 1998 test
       collection in a new format (DTT2). See below for .dtt2 format.
       This form of the data is for achive/research purposes only and is 
       not used in the evaluation.

<Episode Filename=file4.wav Program="NPR_Marketplace" Scribe="NIST_Reconciled"
Date="960913:1830" Version=1 Version_Date=961213>
.
.
.
<Section Type=Filler S_time=75.438250 E_time=81.214313 ID="k960913.3">
<Segment Speaker="David_Brancaccio" Mode=Planned Fidelity=High 
 S_time=75.500000 E_time=81.214313>
<Expand E_form="it is">it's</Expand> friday september thirteenth 
<Expand E_form="i am">i'm</Expand> david brancaccio and 
<Expand E_form="here is">here's</Expand> some of 
<Expand E_form="what is">what's</Expand> happening in business and the world
<Sync Time=80.883875>
</Segment>
</Section>
<Section Type=Story S_time=81.214313 E_time=207.317250 ID="k960913.4"
 Topic="Archer Daniels Midland Price-Fixing Probe">
<Segment Speaker="David_Brancaccio" Mode=Planned Fidelity=High  
S_time=81.214313 E_time=107.508688>
agricultural products giant archer daniels midland is often described as
<Sync Time=85.404938>
politically well connected {breath}
<Background Type=Music Time=86.806875 Level=Low>
any connections notwithstanding the federal government is
<Sync Time=89.822500>
pursuing a probe into whether the company conspired to fix the price of a key 
additive
<Background Type=Music Time=94.247437 Level=Off>
for livestock feed {breath}
.
.
.
</Segment>
.
.
.
</Section>
.
.
.
</Episode>

--------------------------------------------------------------------------

.dtt2 - Detailed TREC Transcription 2: Revised LDC-produced Broadcast News 
        transcription with absolute Section (story) IDs added. This new 
        format is used in only the 1998 100-hour data set now designated as 
        the 1998 SDR test collection.  This form of the data is for
	archive/research purposes only and is not used in the evaluation.

<episode filename="ea980107" program="ABC World News Tonight" language=English
 version=1 version_date=17-Feb-98>
 
<section type=nontrans startTime=0.1 endTime=3.613 ID=ea980107.1>
</section>
 
<section type=filler startTime=3.613 endTime=28.160 ID=ea980107.2>
<turn speaker=Peter_Jennings spkrtype=male startTime=3.613 endTime=21.059>
<time sec=3.613>
 On World News Tonight this Wednesday, ^Terry ^Nichols avoids the death 
sentence for his role in the ^Oklahoma ^City bombing. {breath} One of the 
world's most popular a airplanes, the government orders an immediate safety 
inspection. And our report on the cloning controversy because one ^American 
scientist says he's ready to go.
</turn>
<turn speaker=Richard_Seed spkrtype=male startTime=21.059 endTime=24.336>
<time sec=21.059>
 Remember, you can't stop science.
</turn>
</section>
 
<section type=report startTime=28.160 endTime=194.311 ID=ea980701.3>
<turn speaker=spkr_1 spkrtype=male startTime=28.160 endTime=32.160>
<time sec=28.160>
 From _A_B_C, this is World News Tonight with ^Peter ^Jennings-
</turn>
<turn speaker=Peter_Jennings spkrtype=male startTime=32.160 endTime=60.580>
<time sec=32.160>
.
.
.
</turn>>
.
.
.
</section>
.
.
.
</episode>


--------------------------------------------------------------------------

.ltt - Lexical TREC Transcription: Detailed TREC Transcription with
       all SGML tags removed except for Episode and Section.  
       "Out-of-bounds" and non-transcribed Sections are excluded. 
       This format is used as the source the Reference Retrieval 
       condition and as the reference for speech recognition scoring.  
       Note that both the .dtt and .dtt2 formats will be filtered by NIST 
       into this single .ltt form.

<Episode Filename=k960913.wav Program="NPR_Marketplace" 
Scribe="NIST_Reconciled" Date="960913:1830" Version=1 Version_Date=961213>
.
.
.
<Section Type=Filler S_time=75.438250 E_time=81.214313 ID=k960913.3>
it's friday september thirteenth i'm david brancaccio and here's some
of what's happening in business and the world
</Section >
<Section Type=Story S_time=81.214313 E_time=207.317250 ID=k960913.4
Topic="Archer Daniels Midland Price-Fixing Probe">
agricultural products giant archer daniels midland is often described
as politically well connected any connections notwithstanding the
federal government is pursuing a probe into whether the company
conspired to fix the price of a key additive for livestock feed
.
.
.
</Section >
.
.
.
</Episode>

--------------------------------------------------------------------------

.ndx - Index: Specifies <Sections> in waveform to be recognized and
       establishes story boundaries and ID's.  Similar to LTT format
       without text.  "Out-of-bounds" and non-transcribed Sections
       are excluded.

<Episode Filename=k960913.wav Program="NPR_Marketplace" 
Scribe="NIST_Reconciled" Date="960913:1830" Version=1 Version_Date=961213>
<Section Type=Filler S_time=75.438250 E_time=81.214313 ID=k960913.3>
<Section Type=Story S_time=81.214313 E_time=207.317250 ID=k960913.4
.
.
.
</Episode>

--------------------------------------------------------------------------

.srt - Speech Recogniser Transcription (contrived example):  
       Output of speech recogniser for a .sph recorded waveform file 
       which will be used as input for retrieval.  Each file must contain 
       an <Episode> tag and properly interleaved <Section> tags taken from 
       the corresponding .ndx file.  Each <Word> tag contains the
       start-time and end-time (in seconds with two decimal places)
       and the recognized word.

<Episode Filename=k960913.wav Program="NPR_Marketplace" 
Scribe="NIST_Reconciled" Date="960913:1830" Version=1 Version_Date=961213>
.
.
.
<Section Type=Filler S_time=75.438250 E_time=81.214313 ID=k960913.3>
<Word S_time=75.52 E_time=75.87>his</Word>
<Word S_time=75.87 E_time=75.36>friday'S</Word>
<Word S_time=76.36 E_time=76.82>september</Word>
<Word S_time=76.82 E_time=77.47>thirteenth</Word>
<Word S_time=77.47 E_time=77.65>i'm</Word>
<Word S_time=77.65 E_time=77.88>david</Word>
<Word S_time=77.88 E_time=78.12>bran</Word>
<Word S_time=78.12 E_time=78.48>cat</Word>
<Word S_time=78.48 E_time=78.56>she</Word>
<Word S_time=78.56 E_time=78.66>toe</Word>
<Word S_time=78.66 E_time=78.89>here'S</Word>
<Word S_time=78.89 E_time=79.04>some</Word>
<Word S_time=79.04 E_time=79.12>of</Word>
<Word S_time=79.12 E_time=79.30>what'S</Word>
<Word S_time=79.30 E_time=79.73>happening</Word>
<Word S_time=79.73 E_time=79.84>in</Word>
<Word S_time=79.84 E_time=80.28>business</Word>
<Word S_time=80.28 E_time=80.38>in</Word>
<Word S_time=80.38 E_time=80.44>the</Word>
<Word S_time=80.44 E_time=80.82>world</Word>
</Section >
<Section Type=Story S_time=81.214313 E_time=207.317250 ID=k960913.4
 Topic="Archer Daniels Midland Price-Fixing Probe">
<Word S_time=81.34 E_time=82.05>agricultural</Word>
<Word S_time=82.05 E_time=82.49>produce</Word>
<Word S_time=82.49 E_time=82.91>giant</Word>
<Word S_time=82.91 E_time=83.27>archer</Word>
<Word S_time=83.27 E_time=83.78>daniels</Word>
<Word S_time=83.78 E_time=84.20>middle</Word>
<Word S_time=84.20 E_time=84.33>is</Word>
<Word S_time=84.33 E_time=84.59>often</Word>
<Word S_time=84.59 E_time=85.21>described</Word>
<Word S_time=85.21 E_time=85.35>as</Word>
<Word S_time=85.35 E_time=85.85>politically</Word>
<Word S_time=85.85 E_time=86.17>well</Word>
<Word S_time=86.17 E_time=86.95>connected</Word>
<Word S_time=86.96 E_time=87.19>any</Word>
<Word S_time=87.19 E_time=87.82>connections</Word>
<Word S_time=87.82 E_time=87.98>not</Word>
<Word S_time=87.98 E_time=88.15>wash</Word>
<Word S_time=88.15 E_time=88.50>standing</Word>
<Word S_time=88.72 E_time=88.80>the</Word>
<Word S_time=88.80 E_time=89.00>fed</Word>
<Word S_time=89.00 E_time=89.21>rail</Word>
<Word S_time=89.21 E_time=89.69>government</Word>
<Word S_time=89.69 E_time=89.80>is</Word>
<Word S_time=89.80 E_time=90.31>pursuing</Word>
<Word S_time=90.31 E_time=90.39>a</Word>
<Word S_time=90.39 E_time=90.76>pro</Word>
<Word S_time=90.76 E_time=90.93>into</Word>
<Word S_time=90.93 E_time=91.19>weather</Word>
<Word S_time=91.19 E_time=91.32>the</Word>
<Word S_time=91.32 E_time=91.74>company</Word>
<Word S_time=91.74 E_time=92.34>conspired</Word>
<Word S_time=92.34 E_time=92.42>two</Word>
<Word S_time=92.42 E_time=92.75>affect</Word>
<Word S_time=92.75 E_time=92.85>the</Word>
<Word S_time=92.85 E_time=93.29>price</Word>
<Word S_time=93.29 E_time=93.37>of</Word>
<Word S_time=93.37 E_time=93.46>the</Word>
<Word S_time=93.46 E_time=93.76>key</Word>
<Word S_time=93.76 E_time=94.20>altitude</Word>
<Word S_time=94.27 E_time=94.41>for</Word>
<Word S_time=94.41 E_time=94.71>live</Word>
<Word S_time=94.75 E_time=94.95>stop</Word>
<Word S_time=95.11 E_time=95.46>feet</Word>
.
.
.
</Section >
.
.
.
</Episode>

==========================================================================
	       APPENDIX B: SDR Corpus Filters
==========================================================================

srt2ltt.pl -  This filter transforms the Speech Recognizer Transcription
	      (SRT) format with word times into the Lexical TREC 
              Transcription (LTT) form. This resulting simplified 
              form of the speech recogniser transcription can be
              used for retrieval if word times are not desired. 

<Episode Filename=k960913.wav Program="NPR_Marketplace" 
Scribe="NIST_Reconciled" Date="960913:1830" Version=1 Version_Date=961213>
.
.
.
<Section Type=Filler S_time=75.438250 E_time=81.214313 ID=k960913.3>
HIS FRIDAY'S SEPTEMBER THIRTEENTH I'M DAVID BRAN CAT SHE TOE HERE'S 
SOME OF WHAT'S HAPPENING IN BUSINESS IN THE WORLD 
</Section >
<Section Type=Story S_time=81.214313 E_time=207.317250 ID=k960913.4
Topic="Archer Daniels Midland Price-Fixing Probe">
AGRICULTURAL PRODUCE GIANT ARCHER DANIELS MIDDLE IS OFTEN DESCRIBED AS 
POLITICALLY WELL CONNECTED ANY CONNECTIONS NOT WASH STANDING THE FED RAIL 
GOVERNMENT IS PURSUING A PRO INTO WEATHER THE COMPANY CONSPIRED TWO AFFECT 
THE PRICE OF THE KEY ALTITUDE FOR LIVE STOP FEET 
.
.
.
</Section >
.
.
.
</Episode>

--------------------------------------------------------------------------

srt2ctm.pl - This filter transforms the Speech Recognizer Transcription
             (SRT) format into the CTM format used by the NIST SCLITE
             Speech Recognition Scoring Software.

--------------------------------------------------------------------------

ctm2srt.pl - This filter together with the corresponding SDR .ndx file 
             transforms the CTM format used by the NIST SCLITE
             Speech Recognition Scoring Software into the SDR
             Speech Recognizer Transcription (SRT) format.

--------------------------------------------------------------------------