| Tweets2011As part of the TREC 2011 microblog track, Twitter provided identifiers for approximately 16 million tweets sampled between January 23rd and February 8th, 2011. The corpus is designed to be a reusable, representative sample of the twittersphere - i.e. both important and spam tweets are included. The Tweets2011 corpus is unusual in that what you get is a list of tweet identifiers, and the actual tweets are downloaded directly from Twitter, using the open-source twitter-tools. However, to obtain the lists of tweets to be downloaded (i.e. the "tweet lists"), a data usage agreement must be signed. Once signed, the agreement must be emailed back to NIST, who will provide you with a username/password to download the tweet lists (in the form of a .tar.gz file). Obtaining the collectionDownload and sign the TREC 2011 Microblog Dataset Usage Agreement. Please note that this agreement requires you to also act within the terms of the Twitter terms of service, and in particular you agree not to redistribute the data and to delete tweets that are marked deleted in the future. The twitter-tools provides support for removing deleted tweets from your copy of the corpus. Email the signed agreement, as a PDF file, to Angela Ellis <[email protected]>. In the body of your email, 
 We will respond to your request with a URL, a username, and a password with which you can download the tweet lists. Please allow seven business days for a response. Once you have downloaded and decompressed the tweet lists from NIST, you should obtain and run the corpus downloader. For further instructions on downloading and using the twitter-tools corpus downloader, see twitter-tools. You MUST NOT re-distribute the tweet lists or the corpus obtained by using the tweet lists, as this breaks both the Tweets2011 corpus license agreement and the Twitter Terms of Use. Note that it can take several days to download your copy of the Tweets2011. | 
| This page created on August 30, 2011 Last updated on Contact: [email protected] |   |