Text REtrieval Conference (TREC) Knowledge Base Acceleration Track

Knowledge Base Acceleration Track
TREC home Data home

The document set used in the TREC Knowledge Base Acceleration track is available in Amazon Web Services (AWS) S3 (see below for details). The corpus is governed by a User Agreement that must be signed to receive a decryption key for the corpus. The document collections are available to anyone. TREC KBA annotations and queries from previous years are available to everyone. The TREC KBA queries and annotations for the current year are only available to people registered for this current year's TREC.

To obtain the key, your organization must complete the TREC KBA Organizational User Agreement. The TREC KBA Individual User Agreement form is to be completed and retained by your organization.

Organizational User Agreement: This agreement must be signed by the person responsible for the data at your organization, and sent to NIST.
Individual User Agreement: This agreement must be signed by all researchers using the data at your organization, and kept on file at your organization.

Send a scanned pdf file of your completed Organization Application to use the TREC KBA Information-Retrieval Text Research Collections to trec (at) nist (dot) gov. In your email include the following:

Subject: TREC KBA Form
In the body of message include:
- your name
- your organization name
- the fact that you are requesting the decryption key for the KBA corpus

If you are unable to send a pdf file contact the TREC address for other options.

Requests for data are handled in the order they are received.

TREC KBA Corpora

The TREC KBA StreamCorpora are available in Amazon Web Services (AWS) S3:

We recommend using Amazon's Elastic Compute Cloud (EC2) and Elastic Map Reduce (EMR) tools to process the corpus. While this will cost you compute charges, Amazon is hosting the corpus for free in this bucket:
```
s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/ 
```
A useful tool for interacting with the corpus in S3 is http://s3tools.org/s3cmd.

You can also retrieve the corpus using wget. For example, you can retrieve the 2013 version of the corpus using the commands below. See the TREC KBA discussion forum and the StreamCorpus discussion forum for more details.

## Fetch the list of directory names -- date-hour strings
wget 
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/dir-names.txt

## Use GNU parallel to make multiple wget requests in parallel.
## The  --continue flag makes this restartable.

cat dir-names.txt | parallel -j 10 --eta 'wget --recursive --continue 
--no-host-directories --no-parent --reject "index.html*" 
http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/{}/index.html'

TREC KBA topics, relevance judgments, and scripts

See the Data Sets at TREC KBA data web site

Last updated:
Date created:July 17, 2011
[email protected]