As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included.
The Tweets2011 corpus is unusual in that what you get is a list of tweet identifiers, and the actual tweets are downloaded directly from Twitter, using the open-source twitter-tools. However, to obtain the lists of tweets to be downloaded (i.e. the "tweet lists"), a data usage agreement must be signed. Once signed, the agreement must be emailed back to NIST, who will provide you with a username/password to download the tweet lists (in the form of a .tar.gz file).
Obtaining the collection
Download and sign the TREC 2011 Microblog Dataset Usage Agreement. Please note that this agreement requires you to also act within the terms of the Twitter terms of service, and in particular you agree not to redistribute the data and to delete tweets that are marked deleted in the future. The twitter-corpus-tools provides support for removing deleted tweets from your copy of the corpus.
Email the signed agreement, as a PDF file, to Lori Buckland <firstname.lastname@example.org>. In the body of your email,
We will respond to your request with a URL, a username, and a password with which you can download the tweet lists. Please allow seven business days for a response.
Once you have downloaded and decompressed the tweet lists from NIST, you should obtain and run the twitter-corpus-tools corpus downloader. For further instructions on downloading and using the twitter-corpus-tools corpus downloader.
This page created on August 30, 2011
Last updated on Tuesday, 05-Mar-2013 16:02:20 EST