Reuters Corpora

In 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community.

In Fall of 2004, NIST took over distribution of RCV1 and any future Reuters Corpora. You can now get these datasets by sending a request to NIST and by signing the agreements below.

What's available

RCV1

Reuters Corpus, Volume 1, English language, 1996-08-20 to 1997-08-19 (Release date 2000-11-03, Format version 1, correction level 0)

This is distributed on two CDs and contains about 810,000 Reuters, English Language News stories. It requires about 2.5 GB for storage of the uncompressed files.

RCV2

Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to 1997-08-19 (Release date 2005-05-31, Format version 1, correction level 0)

This is distributed on one CD and contains over 487,000 Reuters News stories in thirteen languages (Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish). These stories are contemporaneous with RCV1, but some languages do not cover the entire time period.

The stories in the Reuters Corpus are under the copyright of Reuters Ltd, and their use is governed by the following agreements:

Organizational agreement
This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
Individual agreement
This agreement must be signed by all researchers using the Reuters Corpus at your organization, and kept on file at your organization.

Getting the corpus

  1. Download and print the Organizational and Individual agreement forms above.
  2. Send the Organizational form to NIST by one of the methods listed below:

    send a scanned pdf file
    Complete the Reuters Organizational form and send a pdf file of the form to:
    reuters-request@nist.gov
    In your email include the following:
    Subject: request for Reuters corpus
    In the body of message include: your name, your complete postal address, and if you are requesting RCV1, RCV2 or both.
    (do not include other correspondence in this message)
    mail forms
    ATTN: TREC Data Supplier, Reuters corpus request
    National Institute of Standards and Technology
    100 Bureau Drive
    Stop 8940
    Gaithersburg, MD 20899-8940 USA

    fax forms
    ATTN: Lori Buckland, NIST
    Fax # 301-975-5287

    When you mail or fax forms to NIST, also send an email to reuters-request@nist.gov that includes: your name, your complete postal address and if you are requesting RCV1, RCV2 or both.

  3. Complete and keep the individual agreement form on file at your organization.
  4. Subject to our approval, you will receive the corpus CDs by mail.

If you have already obtained RCV1 and want to get RCV2, send email to reuters-request@nist.gov. Please provide the name of your organization, the month/year you requested RCV1, and that you are interested in receiving RCV2. An Organizational agreement must be on file at NIST for you to receive RCV2.



Reuters Corpus Resources

The article,

Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf

provides an extensive description of RCV1 and its category coding. Several on-line appendices, including tokenized versions of the collection, can be found via David Lewis' website.





This page created on November 2, 2004
Last updated on Thursday, 30-Apr-2009 14:15:20 EDT
Contact: reuters-request@nist.gov