This is the document set used in the TREC-5 confusion track. It consists of the 1994 edition of the Federal Register. The United States Government Printing Office (GPO) prints the Federal Register as a record of the transactions of the government. One issue is published each business day and contains notices to Federal agencies and organizations, executive orders and proclamations, proposed rules and regulations, etc. The Federal Register was selected for these experiments because it is a large collection for which both hardcopy and electronic versions are readily available. The corpus contains 395MB of text divided into approximately 55,600 documents. There are three different text versions of the corpus. The "original" version is derived from the typesetting files provided by the GPO; this is regarded as the ground truth text version of the collection and was used to design the questions. The "degrade5" version is the output obtained by scanning the hardcopy version of the corpus. The estimated character error rate of this version of the corpus is 5%. The "degrade20" version was obtained by downsampling the original page images produced above and scanning the new images. The estimated character error rate of the downsampled images is 20%. NOTE: In January 1999, NIST issued page images and true text versions of part of the 1994 Federal Register as Standard Reference Database 25. While the same edition of the Federal Register is included here and in Database 25, the segmentation into documents is completely different. Standard Reference Database 25 is *not* simply a subset of the corpus used in the confusion track. Some input files were mistakenly truncated when producing the degraded versions of the text, so all three collections were restricted to the intersection of the three sets. A list of the documents omitted from the test set is given in the file "omitted_docs". One of the omitted documents was the target item for topic 29, so the official test set of topics excludes topic 29 (i.e., results in the track were computed over only 49 topics).