Data - Non-English Documents

Return to the TREC home page TREC home Return to the TREC Data home page Data home          National Institute of Standards and Technology Home Page

Below are titles of non-English collections that have been used for TREC purposes. In many cases, these collections have been created for a specific task, and are not as broadly available or supported as the English collections.

Topics and relevance assessments are also available.

The Arabic collection consists of a collection of articles selected from the Agence France Presse (AFP) Arabic newswire. Collection LDC2001T55 (Arabic Newswire Part 1) must be obtained from the Linguistic Data Consortium.

The Chinese collection consists of a collection of articles selected from the Peoples Daily newspaper and the Xinhua newswire. Collection LDC2000T52 (TREC Mandarin) must be obtained from the Linguistic Data Consortium. Do not use LDC95T13 (Mandarin Chinese News Text), this is a different version.

The Spanish collection consists of a Mexican newspaper from Monterey (El Norte) and additional text from the 1994 newswire from Agence France Presse. LD2000T51 Spanish News Text must be obtained from the Linguistic Data Consortium.

The set of documents used in the TRECs 6-8 Cross-Language track consisting of documents from Schweìzerìsche Depeschenagentur (English, German, French, and Italian) are no longer available.
Last updated: Monday, 15-Apr-2019 10:19:08 EDT
Date created: Tuesday, 01-Aug-00
trec@nist.gov