|
Reuters Corpora (RCV1, RCV2, TRC2)
In 2000, Reuters Ltd made available a large collection of Reuters News
stories for use in research and development of natural language
processing, information retrieval, and machine learning systems. This
corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly
larger than the older, well-known Reuters-21578 collection heavily
used in the text classification community.
In Fall of 2004, NIST took over distribution of RCV1 and any future
Reuters Corpora. You can now get these datasets by sending a request
to NIST and by signing the agreements below.
What's available
RCV1
|
Reuters Corpus, Volume 1, English language, 1996-08-20 to
1997-08-19 (Release date 2000-11-03, Format version 1, correction
level 0)
This is distributed via web download and contains about 810,000 Reuters,
English Language News stories. It requires about 2.5 GB for storage of
the uncompressed files.
|
RCV2
|
Reuters Corpus, Volume 2, Multilingual Corpus, 1996-08-20 to
1997-08-19 (Release date 2005-05-31, Format version 1, correction
level 0)
This is distributed via web download and contains over 487,000 Reuters
News stories in thirteen languages (Dutch, French, German, Chinese,
Japanese, Russian, Portuguese, Spanish, Latin American Spanish,
Italian, Danish, Norwegian, and Swedish). The stories are NOT
PARALLEL, but are written by local reporters in each language. These
stories are contemporaneous with RCV1, but some languages do not cover
the entire time period.
|
TRC2
|
Thomson Reuters Text Research Collection (TRC2)
The TRC2 corpus comprises 1,800,370 news stories covering the period from 2008-01-01 00:00:03 to 2009-02-28 23:54:14 or 2,871,075,221 bytes, and was initially made available to participants of the 2009 blog track at the Text Retrieval Conference (TREC), to supplement the BLOGS08 corpus (that contains results of a large blog crawl carried out at the University of Glasgow). TRC2 is distributed via web download.
|
The stories in the Reuters Corpus are under the copyright of
Reuters Ltd and/or Thomson Reuters, and their use is governed by the following agreements:
- Organizational agreement
- This agreement must be signed by the person responsible for the data
at your organization, and sent to NIST.
- Individual agreement
- This agreement must be signed by all researchers using the Reuters
Corpus at your organization, and kept on file at your organization.
Getting the corpus
- Download and print the Organizational and Individual agreement
forms above.
- Send the Organizational form to NIST by one of the methods listed below:
send a scanned pdf file
Complete the Reuters Organizational form and send a pdf file of the form to:
[email protected]
In your email include the following:
Subject: request for Reuters corpus
In the body of message include: your name, your complete postal address, and if you are requesting RCV1, RCV2, TRC2 or all three.
(do not include other correspondence in this message)
- Complete and keep the individual agreement form on file at your organization.
- Subject to our approval, a download URL, login, and password via email. Please allow seven business days for a response.
If you have already obtained some of the Reuters corpora, and wish to obtain others, send email to
[email protected].
Please provide the name of your organization, the month/year you requested RCV1/2/TRC2,
and the corpus you are interested in receiving. An Organizational agreement must be on
file at NIST.
Reuters Corpora Resources
The article,
Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine
Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf.
provides an extensive description of RCV1 and its category coding.
Several on-line appendices, including tokenized versions of the
collection, can be found via David
Lewis' website.
|
|