Reuters Corpora

In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community.

In Fall of 2004, NIST took over distribution of RCV1 and any future Reuters Corpora. You can now get these datasets by sending a request to NIST and by signing the agreements below.

What's available

RCV1

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 (Release date 2000-11-03, Format version 1, correction level 0)

This is distributed on two CDs and contains about 810,000 Reuters, English Language News stories. It requires about 2.5 GB for storage of the uncompressed files.

RCV2

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0)

This is distributed on one CD and contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.

The stories in the Reuters Corpus are under the copyright of Reuters Ltd, and their use is governed by the following agreements:

Organizational agreement
This agreement must be signed by someone at your organization, and is kept on file at NIST.
Individual agreement
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file by the contact person indicated in the organizational agreement.

Getting the corpus

To request a copy of the Reuters Corpus,

  1. Download and print out the organizational and individual agreement forms above.
  2. Fill out the organizational form, send it by mail or fax to the address or fax number indicated on the form.
  3. Send an email to reuters-request@nist.gov letting us know your form has been mailed/faxed. Include the complete address (where you want the CDs mailed to) in this message. Also indicate if you are requesting RCV1, RCV2 or both.
  4. Complete and keep the individual agreement form at your organization.
Subject to our approval, you will then receive the corpus CDs by mail.

If you have already obtained RCV1 and want to get RCV2, send email to reuters-request@nist.gov. Please provide the name of your organization and indicate that you have RCV1 and are interested in getting RCV2. An Organization agreement must be on file at NIST for you to receive RCV2.

Reuters Corpus Resources

The article,

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

provides an extensive description of RCV1 and its category coding. Several on-line appendices, including tokenized versions of the collection, can be found via David Lewis' website.





This page created on November 2, 2004
Last updated on Wednesday, 06-Sep-06 07:39:32
Contact: reuters-request@nist.gov