Knowledge Base Acceleration Track |
---|
TREC home Data home |
The document set used in the TREC Knowledge Base Acceleration track is available in Amazon Web Services (AWS) S3 (see below for details). The corpus is governed by a User Agreement that must be signed to receive a decryption key for the corpus. The document collections are available to anyone. TREC KBA annotations and queries from previous years are available to everyone. The TREC KBA queries and annotations for the current year are only available to people registered for this current year's TREC.
To obtain the key, your organization must complete the TREC KBA Organizational User Agreement. The TREC KBA Individual User Agreement form is to be completed and retained by your organization.
Send a scanned pdf file of your completed Organization Application to use the TREC KBA Information-Retrieval Text Research Collections to trec (at) nist (dot) gov. In your email include the following:
If you are unable to send a pdf file contact the TREC address for other options.
Requests for data are handled in the order they are received.
The TREC KBA StreamCorpora are available in Amazon Web Services (AWS) S3:
s3://aws-publicdatasets/trec/kba/kba-stream-corpus-2012/A useful tool for interacting with the corpus in S3 is http://s3tools.org/s3cmd.
## Fetch the list of directory names -- date-hour strings wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/dir-names.txt ## Use GNU parallel to make multiple wget requests in parallel. ## The --continue flag makes this restartable. cat dir-names.txt | parallel -j 10 --eta 'wget --recursive --continue --no-host-directories --no-parent --reject "index.html*" http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0/{}/index.html'