2010 Chemical IR Track

Return to the TREC home page TREC home Return to the TREC
Data home page Data home          National Institute of Standards and Technology Home Page

The following data are available for research use provided the creators are acknowledged. Please cite: Mihai Lupu, John Tait, Jimmy Huang, and Jianhan Zhu. TREC-CHEM 2010 : Notebook Report. Proceedings of TREC 2010. NIST Special Publication SP 500-294. 2011. trec.nist.gov/pubs/trec19/papers/CHEM.OVERVIEW.pdf.

Data

All the files present on this page are also are listed here.

Scientific Articles

The TREC-CHEM'10 collection includes all scientific articles that were in TREC-CHEM'09, but adds the images that were missing from those articles (where these images exist), as well as new articles from different publishers and from PubMed Central. In total, there are 176,528 articles (xml or nxml files).

Below, you will see a list of archives, listed by source, and for each source the set of archives is organized by file type. This way, if you are only interested in text files, you can download only the XML or NXML files, if you can process images you can download TIF, JPG or GIFs, if you can do PDF then download those as well, and so forth.

Scientific Articles from Hindawi Publishing

File

Size

Checksum

Comments

Hindawi_jpg.tar.gz 38M Hindawi_jpg.tar.gz.md5  
Hindawi_pdf.tar.gz 201M Hindawi_pdf.tar.gz.md5  
Hindawi_xml.tar.gz 4.0M Hindawi_xml.tar.gz.md5  


Scientific Articles from the Interantional Union of Crystallography

File

Size

Checksum

Comments

IUCrJnls_cif.tar.gz 18M IUCrJnls_cif.tar.gz.md5  The CIF filetype represents a "Crystallographic Information File", as described here
IUCrJnls_gif.tar.gz 2.5M IUCrJnls_gif.tar.gz.md5  
IUCrJnls_hkl.tar.gz 210M IUCrJnls_hkl.tar.gz.md5  Measurements of reflections. See description here
IUCrJnls_html.tar.gz 36M IUCrJnls_html.tar.gz.md5  
IUCrJnls_pdf.tar.gz 2.4G IUCrJnls_pdf.tar.gz.md5  
IUCrJnls_rtv.tar.gz 1.5M IUCrJnls_rtv.tar.gz.md5  IUCr Rietveld powder data
IUCrJnls_tif.tar.gz 6.9G IUCrJnls_tif.tar.gz.md5  
IUCrJnls_xml.tar.gz 8.4M IUCrJnls_xml.tar.gz.md5  


Scientific Articles from Oxford Publishing (not already in PMC)

File

Size

Checksum

Comments

jpeg_OxfordNAR.tar.gz 3.8M jpeg_OxfordNAR.tar.gz.md5  
sgml_OxfordNAR.tar.gz 820K sgml_OxfordNAR.tar.gz.md5  Older articles from Oxford Publishers are in SGML rather than XML
xml_OxfordNAR.tar.gz 608K xml_OxfordNAR.tar.gz.md5  Newer articles


Scientific Articles from PubMed Central (open access)

File

Size

Checksum

Comments

bmp_PMC.tar.gz 14M bmp_PMC.tar.gz.md5  
cdx_PMC.tar.gz 6.5M cdx_PMC.tar.gz.md5  Chemical Structure Exchange Files
cif_PMC.tar.gz 132K cif_PMC.tar.gz.md5  Crystallographic Information File
eps_PMC.tar.gz 2.0G eps_PMC.tar.gz.md5  
html_PMC.tar.gz 15M html_PMC.tar.gz.md5  
jpg_PMC.tar.gz 31G jpg_PMC.tar.gz.md5  
nxml_PMC.tar.gz 1.8G nxml_PMC.tar.gz.md5  These are the main files for this part of the PMC collection
pdf_PMC.tar.gz 89G pdf_PMC.tar.gz.md5  
png_PMC.tar.gz 330M png_PMC.tar.gz.md5  
tif_PMC.tar.gz 17G tif_PMC.tar.gz.md5  


Scientific Articles from the Royal Society of Chemistry

File

Size

Checksum

Comments

tif_RSC.tar.gz 94G tif_RSC.tar.gz.md5  
xml_RSC.tar.gz 604M xml_RSC.tar.gz.md5  


Scientific Articles from Molecular Diversity Preservation International (MDPI) (not already in PMC)

File

Size

Checksum

Comments

pdf_MDPI.tar.gz 15M pdf_MDPI.tar.gz.md5  
tif_MDPI.tar.gz 70M tif_MDPI.tar.gz.md5  
xml_MDPI.tar.gz 440K xml_MDPI.tar.gz.md5  


Patent Data

The TREC-CHEM'10 collection includes all patent documents from the EPO, USPTO and WIPO which have been classified at least once in category C of IPC or in class A61K of the same IPC. All documents included in this collection are more than just bibliographical data (i.e. they contain claims plus something else - abstract or description or both). The unique document identifiers of this collection are the full UCID, not just the country-number pair used in 2009. In total, there are 1,277,467 xml files (UCIDs)

File

Size

Checksum

Comments

tif_US020060.tar.gz 7.2G tif_US020060.tar.gz.md5  
tif_US020080.tar.gz 3.2G tif_US020080.tar.gz.md5  
tif_US000006.tar.gz 22G tif_US000006.tar.gz.md5  
tif_US00000D.tar.gz 72K tif_US00000D.tar.gz.md5  
tif_US000000.tar.gz 8.1M tif_US000000.tar.gz.md5  
tif_WO001978.tar.gz 108K tif_WO001978.tar.gz.md5  
tif_WO001980.tar.gz 14M tif_WO001980.tar.gz.md5  
tif_WO001981.tar.gz 9.5M tif_WO001981.tar.gz.md5  
tif_WO001983.tar.gz 20M tif_WO001983.tar.gz.md5  
tif_WO001985.tar.gz 29M tif_WO001985.tar.gz.md5  
tif_WO001986.tar.gz 37M tif_WO001986.tar.gz.md5  
tif_WO001988.tar.gz 52M tif_WO001988.tar.gz.md5  
tif_WO001989.tar.gz 90M tif_WO001989.tar.gz.md5  
tif_WO001991.tar.gz 157M tif_WO001991.tar.gz.md5  
tif_WO001993.tar.gz 301M tif_WO001993.tar.gz.md5  
tif_WO001994.tar.gz 344M tif_WO001994.tar.gz.md5  
tif_WO001996.tar.gz 397M tif_WO001996.tar.gz.md5  
tif_WO001997.tar.gz 518M tif_WO001997.tar.gz.md5  
tif_WO001999.tar.gz 840M tif_WO001999.tar.gz.md5  
tif_WO002001.tar.gz 1.7G tif_WO002001.tar.gz.md5  
tif_WO002002.tar.gz 2.7G tif_WO002002.tar.gz.md5  
tif_WO002004.tar.gz 2.8G tif_WO002004.tar.gz.md5  
tif_WO002005.tar.gz 2.8G tif_WO002005.tar.gz.md5  
tif_WO002007.tar.gz 2.0G tif_WO002007.tar.gz.md5  
nb_US020040.tar.gz 844K nb_US020040.tar.gz.md5  Mathematica Notebook File
nb_US020010.tar.gz 108K nb_US020010.tar.gz.md5  
nb_US020030.tar.gz 924K nb_US020030.tar.gz.md5  
nb_US020050.tar.gz 916K nb_US020050.tar.gz.md5  
nb_US020070.tar.gz 580K nb_US020070.tar.gz.md5  
nb_US000006.tar.gz 1.6M nb_US000006.tar.gz.md5  
nb_US000007.tar.gz 580K nb_US000007.tar.gz.md5  
nb_US00000R.tar.gz 16K nb_US00000R.tar.gz.md5  
xmlSEQ_US020080.tar.gz 19M xmlSEQ_US020080.tar.gz.md5  Genetic sequences. NOTE: file extention is 'xml',just like the main files
xmlSEQ_US020050.tar.gz 68M xmlSEQ_US020050.tar.gz.md5  
cdx_US020040.tar.gz 801M cdx_US020040.tar.gz.md5  
cdx_US020020.tar.gz 263M cdx_US020020.tar.gz.md5  
cdx_US020050.tar.gz 824M cdx_US020050.tar.gz.md5  
cdx_US020060.tar.gz 841M cdx_US020060.tar.gz.md5  
cdx_US020080.tar.gz 339M cdx_US020080.tar.gz.md5  
cdx_US000006.tar.gz 1.2G cdx_US000006.tar.gz.md5  
cdx_US000000.tar.gz 16K cdx_US000000.tar.gz.md5  
mol_US020040.tar.gz 273M mol_US020040.tar.gz.md5  Molecular Structure Data files
mol_US020010.tar.gz 16M mol_US020010.tar.gz.md5  
mol_US020030.tar.gz 174M mol_US020030.tar.gz.md5  
mol_US020050.tar.gz 271M mol_US020050.tar.gz.md5  
mol_US020070.tar.gz 276M mol_US020070.tar.gz.md5  
mol_US000006.tar.gz 432M mol_US000006.tar.gz.md5  
mol_US000007.tar.gz 298M mol_US000007.tar.gz.md5  
mol_US00000R.tar.gz 3.1M mol_US00000R.tar.gz.md5  
xmlSEQ_US000006.tar.gz 21M xmlSEQ_US000006.tar.gz.md5  
xmlSEQ_US00000R.tar.gz 192K xmlSEQ_US00000R.tar.gz.md5  
xmlSEQ_US020060.tar.gz 48M xmlSEQ_US020060.tar.gz.md5  
tif_EP000000.tar.gz 3.8G tif_EP000000.tar.gz.md5  
tif_US020040.tar.gz 11G tif_US020040.tar.gz.md5  
tif_US020010.tar.gz 861M tif_US020010.tar.gz.md5  
tif_US020020.tar.gz 6.5G tif_US020020.tar.gz.md5  
tif_US020030.tar.gz 35G tif_US020030.tar.gz.md5  
tif_US020050.tar.gz 9.5G tif_US020050.tar.gz.md5  
tif_US020070.tar.gz 6.7G tif_US020070.tar.gz.md5  
tif_US000007.tar.gz 13G tif_US000007.tar.gz.md5  
tif_US00000R.tar.gz 72M tif_US00000R.tar.gz.md5  
tif_WO001979.tar.gz 4.0M tif_WO001979.tar.gz.md5  
tif_WO001982.tar.gz 15M tif_WO001982.tar.gz.md5  
tif_WO001984.tar.gz 28M tif_WO001984.tar.gz.md5  
tif_WO001987.tar.gz 34M tif_WO001987.tar.gz.md5  
tif_WO001990.tar.gz 79M tif_WO001990.tar.gz.md5  
tif_WO001992.tar.gz 253M tif_WO001992.tar.gz.md5  
tif_WO001995.tar.gz 318M tif_WO001995.tar.gz.md5  
tif_WO001998.tar.gz 619M tif_WO001998.tar.gz.md5  
tif_WO002000.tar.gz 950M tif_WO002000.tar.gz.md5  
tif_WO002003.tar.gz 3.4G tif_WO002003.tar.gz.md5  
tif_WO002006.tar.gz 1.9G tif_WO002006.tar.gz.md5  
tif_WO002008.tar.gz 1.3G tif_WO002008.tar.gz.md5  
nb_US020020.tar.gz 464K nb_US020020.tar.gz.md5  
nb_US020060.tar.gz 552K nb_US020060.tar.gz.md5  
nb_US020080.tar.gz 228K nb_US020080.tar.gz.md5  
nb_US000000.tar.gz 4.0K nb_US000000.tar.gz.md5  
cdx_US020010.tar.gz 39M cdx_US020010.tar.gz.md5  
cdx_US020030.tar.gz 512M cdx_US020030.tar.gz.md5  
cdx_US020070.tar.gz 788M cdx_US020070.tar.gz.md5  
cdx_US000007.tar.gz 875M cdx_US000007.tar.gz.md5  
cdx_US00000R.tar.gz 8.9M cdx_US00000R.tar.gz.md5  
mol_US020020.tar.gz 97M mol_US020020.tar.gz.md5  
mol_US020060.tar.gz 275M mol_US020060.tar.gz.md5  
mol_US020080.tar.gz 118M mol_US020080.tar.gz.md5  
mol_US000000.tar.gz 8.0K mol_US000000.tar.gz.md5  
xmlSEQ_US000000.tar.gz 24K xmlSEQ_US000000.tar.gz.md5  
xmlSEQ_US000007.tar.gz 62M xmlSEQ_US000007.tar.gz.md5  
xmlSEQ_US020070.tar.gz 40M xmlSEQ_US020070.tar.gz.md5  
xml_US000000.tar.gz 5.3M xml_US000000.tar.gz.md5  
xml_US000003.tar.gz 151M xml_US000003.tar.gz.md5  
xml_US000004.tar.gz 49M xml_US000004.tar.gz.md5  
xml_WO001991.tar.gz 69M xml_WO001991.tar.gz.md5  
xml_US000005.tar.gz 52M xml_US000005.tar.gz.md5  
xml_US000006.tar.gz 41M xml_US000006.tar.gz.md5  
xml_US000007.tar.gz 18M xml_US000007.tar.gz.md5  
xml_US00000D.tar.gz 4.0K xml_US00000D.tar.gz.md5  
xml_US020010.tar.gz 50M xml_US020010.tar.gz.md5  
xml_US020020.tar.gz 75M xml_US020020.tar.gz.md5  
xml_US020030.tar.gz 112M xml_US020030.tar.gz.md5  
xml_US020040.tar.gz 55M xml_US020040.tar.gz.md5  
xml_US020050.tar.gz 80M xml_US020050.tar.gz.md5  
xml_US020060.tar.gz 62M xml_US020060.tar.gz.md5  
xml_US020070.tar.gz 1.4M xml_US020070.tar.gz.md5  
xml_US020080.tar.gz 61M xml_US020080.tar.gz.md5  
xml_US00000R.tar.gz 30M xml_US00000R.tar.gz.md5  
xml_WO001978.tar.gz 56K xml_WO001978.tar.gz.md5  
xml_WO001980.tar.gz 4.7M xml_WO001980.tar.gz.md5  
xml_WO001982.tar.gz 6.5M xml_WO001982.tar.gz.md5  
xml_WO001984.tar.gz 8.7M xml_WO001984.tar.gz.md5  
xml_WO001986.tar.gz 17M xml_WO001986.tar.gz.md5  
xml_WO001988.tar.gz 2.4M xml_WO001988.tar.gz.md5  
xml_WO001990.tar.gz 40M xml_WO001990.tar.gz.md5  
xml_WO001992.tar.gz 24M xml_WO001992.tar.gz.md5  
xml_WO001994.tar.gz 92M xml_WO001994.tar.gz.md5  
xml_WO001996.tar.gz 77M xml_WO001996.tar.gz.md5  
xml_WO001998.tar.gz 87M xml_WO001998.tar.gz.md5  
xml_WO002000.tar.gz 81M xml_WO002000.tar.gz.md5  
xml_WO002002.tar.gz 90M xml_WO002002.tar.gz.md5  
xml_WO002004.tar.gz 67M xml_WO002004.tar.gz.md5  
xml_WO002006.tar.gz 43M xml_WO002006.tar.gz.md5  
xml_WO001979.tar.gz 1.6M xml_WO001979.tar.gz.md5  
xml_WO001981.tar.gz 5.2M xml_WO001981.tar.gz.md5  
xml_WO001983.tar.gz 6.9M xml_WO001983.tar.gz.md5  
xml_WO001985.tar.gz 12M xml_WO001985.tar.gz.md5  
xml_WO001987.tar.gz 18M xml_WO001987.tar.gz.md5  
xml_WO001989.tar.gz 35M xml_WO001989.tar.gz.md5  
xml_WO001993.tar.gz 82M xml_WO001993.tar.gz.md5  
xml_WO001995.tar.gz 49M xml_WO001995.tar.gz.md5  
xml_WO001997.tar.gz 87M xml_WO001997.tar.gz.md5  
xml_WO001999.tar.gz 83M xml_WO001999.tar.gz.md5  
xml_WO002001.tar.gz 62M xml_WO002001.tar.gz.md5  
xml_WO002003.tar.gz 71M xml_WO002003.tar.gz.md5  
xml_WO002005.tar.gz 41M xml_WO002005.tar.gz.md5  
xml_WO002007.tar.gz 1.7M xml_WO002007.tar.gz.md5  
xml_WO002008.tar.gz 58M xml_WO002008.tar.gz.md5  
xml_EP000000.tar.gz 25M xml_EP000000.tar.gz.md5  
xml_EP000001.tar.gz 17M xml_EP000001.tar.gz.md5  


Topics

PATopics.tar.gz (md5) contains the 1000 topics of the Prior Art Task. Here is a list of the files within the archive. All the topics are patent application documents which are already in the collection.

PASmallTopics.tar.gz (md5) contains the 100 topics of the Small Prior Art Task. Here is a list of the files within the archive. Unlike 2009, these are NOT the first 100 of the full set.

TSTopics2010.zip (md5) contains topics TS-18 through TS-47. Topics TS-46 and TS-47 are structure search topics and are introduced here as a preview of what was used TREC 2011. They were not used in the evaluation of participating systems in 2010.



Documentation

Most of the xml files from the scientific articles follow DTDs from PubMed Central. The archive of these DTDs is available here. Processing these files may require the use of catalogs. This file is a Catalog Manager properties file containing links to the DTDs in the archive.

The scientific articles from the Royal Society of Chemistry follow a different DTD, available here. Here are some guidelines for these articles: RSC Guidelines

The Patent documents follow (mostly) the same DTDs as 2009. They are available here. The Field-by-field_Content_Description.pdf file describes the content of the patent documents.

Unique Identifiers: DOI numbers in the case of Scientific Articles, UCID in the case of patents. Note that this year we use the full UCID for patents, not just the country-document_number pair like in 2009.

PA training qrels

Based on the qrels from 2009, we have also created a set of qrels where instead of refering to a document based on the patent number (e.g. EP-123456), we expand this to all actual documents that are available in the collection (e.g. EP-123456-A1, EP-123456-B1, etc). This new set of qrels is available here.


Last updated: Thursday, 30-Aug-2018 07:30:29 MDT
Date created: August 30, 2018
trec@nist.gov