Reuters CorporaIn 2000, Reuters Ltd made available a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems. This corpus, known as "Reuters Corpus, Volume 1" or RCV1, is significantly larger than the older, well-known Reuters-21578 collection heavily used in the text classification community. In Fall of 2004, NIST took over distribution of RCV1 and any future Reuters Corpora. You can now get these datasets by sending a request to NIST and by signing the agreements below. What's available
The stories in the Reuters Corpus are under the copyright of Reuters Ltd, and their use is governed by the following agreements:
Getting the corpusTo request a copy of the Reuters Corpus,
If you have already obtained RCV1 and want to get RCV2, send email to reuters-request@nist.gov. Please provide the name of your organization and indicate that you have RCV1 and are interested in getting RCV2. An Organization agreement must be on file at NIST for you to receive RCV2. Reuters Corpus ResourcesThe article, Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. RCV1: A New Benchmark Collection for Text Categorization Research. Journal of Machine Learning Research, 5:361-397, 2004. http://www.jmlr.org/papers/volume5/lewis04a/lewis04a.pdf. provides an extensive description of RCV1 and its category coding. Several on-line appendices, including tokenized versions of the collection, can be found via David Lewis' website. |
|
This page created on November 2, 2004 Last updated on Wednesday, 06-Sep-06 07:39:32 Contact: reuters-request@nist.gov |
|