BLOG06 README ============= Contents of BLOG06 Test Collection ---------------------------------- BLOG06 consists of a crawl of 100,649 RSS and Atom feeds, over an 11 week period (a total of 77 days). The collection consists of one directory for each day of the collection. In each day's directory, there are the following files: 1. feeds-*.gz These contain the feeds downloaded on a particular day. Each file contains 1000 feeds, compressed by gzip. The tags used in the feeds-*.gz files are as follows. - The start and end of each feed is delimited by DOC tags. - The FEEDNO identifies the feed number of that feed. Example: BLOG06-feed-000001 - The FEEDURL identifies the URL from which the feed was downloaded - BLOGHPNO and BLOGHPURL denote the corresponding homepage of the feed. These will be empty if the blog homepage was not downloaded on that day. Example BLOGHPNO: BLOG06-bloghp-000001 - PERMALINKS denote the URLs and DOCNOs of the permalinks documents that were newly found in this feed - DOCHDR contains similar information to previous TREC Web collections, including the HTTP headers received from the webserver. 2. bloghps-*.gz These contain the HTML homepages for the feeds downloaded on a particular day. Each file contains 1000 homepages, compressed by gzip. The documents are identified by their BLOGHPNO and BLOGURL. The tags used are identical to the feeds files. 3. permalinks-*.gz These contain the HTML permalink documents newly found on the given day. Each document was fetched at least 2 weeks after the permalink was discovered in a feed. The tags used in the permalinks-*.gz files are as follows: - The start and end of each permalink document is delimited by DOC tags. - The DOCNO identifies the permalink document. It contains three components: the date the link was discovered, the file number, and the offset in the uncompressed file. Example: BLOG06-20051206-000-0000000000 is the first document in file 000 of the first day of the collection - The corresponding FEEDNO and FEEDURL of the feed that this permalink came from. - The corresponding BLOGHPNO and BLOGHPURL of the homepage of this blog, if it was crawled - PERMALINK is the URL of this permalink document - DATE_XML is the date of issue of the permalink, as stated in the RSS or Atom feed. As such tags are optional in the feeds, this information is not always present. Should you choose to use this information, you should make your own decision on how to supplement it when it is not present for a document. - DOCHDR contains similar information to previous TREC Web collections, including the HTTP headers received from the webserver. 4. peramalinks.txt.gz This file contains the list of feeds, and associated homepages and permalink documents for the day of the collection. 5. redirects_feeds_homes.gz A mapping (URL->URL) of HTTP redirects experienced on this day of the feeds and homepages crawl. extra directory --------------- The extra folder contains auxiliary files, that might be of use when using this collection. It contains the following files: README - this file docno2url.gz - mapping from permalinks DOCNOs to URLs docno2date.gz - mapping from permalinks DOCNOs to DATE_XML (see above). feedno2url.gz - mapping from feeds FEEDNOs to URLs bloghpno2url.gz - mapping from blog homepage BLOGHPNOs to URLs redirects_permalinks.gz - mapping (URL->URL) of redirects experienced while fetching the permalinks. md5sums - checksums of the gzip files in the collection. Verify your collection using the command md5sum --check extras/md5sums URLs are of the form: http://server_name/path# DOCNOs are of the form: BLOG06-20051206-000-0000000000 FEEDNOs are of the form: BLOG06-feed-000001 BLOGHPNO are of the form: BLOG06-bloghp-000001 Disclaimer ---------- While all reasonable attempts have been made to build this collection, some days of data are missing due to technical matters outwith our control. In some of these scenarios, the days following missing days may include the missing feeds. ----------------------------------------------- Information Retrieval Group - Test Collections University of Glasgow test_collections@dcs.gla.ac.uk