Most of the xml files from the scientific articles follow DTDs from PubMed Central. The archive of these DTDs is available here. Processing these files may require the use of catalogs. This file is a Catalog Manager properties file containing links to the DTDs in the archive.
The scientific articles from the Royal Society of Chemistry follow a different DTD, available here. Here are some guidelines for these articles: RSC Guidelines
The Patent documents follow (mostly) the same DTDs as 2009. They are available here. The Field-by-field_Content_Description.pdf file describes the content of the patent documents.
Unique Identifiers: DOI numbers in the case of Scientific Articles, UCID in the case of patents. Note that this year we use the full UCID for patents, not just the country-document_number pair like in 2009.
PA training qrels
Based on the qrels from 2009, we have also created a set of qrels where instead of refering to a document based on the patent number (e.g. EP-123456), we expand this to all actual documents that are available in the collection (e.g. EP-123456-A1, EP-123456-B1, etc). This new set of qrels is available here.
|