This is the document set used in the TREC-5 confusion track.
It consists of the 1994 edition of the Federal Register.
The United States Government Printing Office (GPO)
prints the Federal Register as a record of the transactions
of the government.  One issue is published each business day
and contains notices to Federal agencies and organizations,
executive orders and proclamations, proposed rules and regulations, etc.
The Federal Register was selected for these experiments because
it is a large collection for which both hardcopy and electronic versions
are readily available.  The corpus contains 395MB of text divided
into approximately 55,600 documents.

There are three different text versions of the corpus.  The "original"
version is derived from the typesetting files provided by the GPO;
this is regarded as the ground truth text version of the collection
and was used to design the questions.  The "degrade5" version
is the output obtained by scanning the hardcopy version of the corpus.
The estimated character error rate of this version of the corpus is 5%.
The "degrade20" version was obtained by downsampling the
original page images produced above and scanning the new images.
The estimated character error rate of the downsampled
images is 20%.

NOTE: In January 1999, NIST issued page images and true text versions of
part of the 1994 Federal Register as Standard Reference Database 25.
While the same edition of the Federal Register is included here and
in Database 25, the segmentation into documents is completely different.
Standard Reference Database 25 is *not* simply a subset of
the corpus used in the confusion track.

Some input files were mistakenly truncated when producing the degraded
versions of the text, so all three collections were restricted to the
intersection of the three sets.  A list of the documents omitted
from the test set is given in the file "omitted_docs".  One
of the omitted documents was the target item for topic 29,
so the official test set of topics excludes topic 29 (i.e., results
in the track were computed over only 49 topics).