TREC Washington Post Corpus

The TREC Washington Post Corpus contains 671,947 news articles and blog posts from January 2012 through December 2019. The articles are stored in JSON format, and include:

  • title
  • byline
  • date of publication
  • kicker (a section header)
  • article text broken into paragraphs
  • links to embedded images and multimedia (for 2012-2017 documents)
Compressed, the tarball is about 1.8GB; decompressed the data is 6.8GB.

NEW: We have removed exact and near-duplicate documents from the collection. Please get version 3 below!

You can get this dataset by sending a request to NIST with the signed organizational agreement linked below.

Organizational agreement
This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
Individual agreement
This agreement must be signed by all researchers using the TREC Washington Post Collection at your organization, and kept on file at your organization.

Getting the corpus

  1. Download and print the Organizational and Individual agreement forms above.
  2. Send a scan of the Organizational form to NIST to:
    In your email include the following:
    Subject: request for Washington Post corpus
  3. Complete and keep the individual agreement form on file at your organization.
  4. Subject to our approval, a download URL, login, and password will be sent via email. Please allow seven business days for a response.

This page created on November 8, 2017
Last updated on Thursday, 23-Apr-2020 15:08:16 MDT