TREC Washington Post Corpus
The TREC Washington Post Corpus contains 608,180 news articles and blog
posts from January 2012 through August 2017. The articles are
stored in JSON format, and include:
Compressed, the tarball is about 1.5GB; decompressed
the data is 6.9GB.
- date of publication
- kicker (a section header)
- article text broken into paragraphs
- links to embedded images and multimedia
NEW: We have found that there are duplicate document
identifiers in the data. Contact NIST to obtain version 2 of the
data and scripts to rectify the issue.
You can get this dataset by sending a request to NIST with the
signed organizational agreement linked below.
- Organizational agreement
- This agreement must be signed by the person responsible for the data
at your organization, and sent to NIST.
- Individual agreement
- This agreement must be signed by all researchers using the TREC
Washington Post Collection at your organization, and kept on file at your organization.
Getting the corpus
- Download and print the Organizational and Individual agreement
- Send a scan of the Organizational form to NIST to:
In your email include the following:
Subject: request for Washington Post corpus
- Complete and keep the individual agreement form on file at your organization.
- Subject to our approval, a download URL, login, and password will
be sent via email. Please allow seven business days for a response.