SDR 2000 - Test Collection
Based on the LDC TDT-2 Corpus
- 4 sources (TV: ABC, CNN, Radio: PRI, VOA)
- February through June 1998 subset, 902 broadcasts
- 557.5 hours, 21,754 stories, 6,755 filler and commercial segments (~55 hours)
- Reference transcripts
- Human-annotated story boundaries
- Full broadcast word transcription
- News segments hand-transcribed (same as in ‘99)
- Commercials and non-news filler transcribed via NIST ROVER applied to 3 automatic recognizer transcript sets
- Word times provided by LIMSI forced alignment
- Automatic recognition of non-lexical information (commercials, repeats, gender, bandwidth, non-speech, signal energy, and combinations) provided by CU
-