HARD 2004 Topics and Annotations
|Item Name:||HARD 2004 Topics and Annotations|
|Author(s):||Stephanie Strassel, Meghan Glenn|
|LDC Catalog No.:||LDC2005T29|
|Release Date:||December 20, 2005|
|Application(s):||automatic content extraction, information detection, information extraction, information retrieval, topic detection and tracking|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2005T29 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Strassel, Stephanie, and Meghan Glenn. HARD 2004 Topics and Annotations LDC2005T29. Web Download. Philadelphia: Linguistic Data Consortium, 2005.|
The HARD 2004 Text Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 225 million tokens of English text.
This corpus contains source data for the 2004 TREC HARD (High Accuracy Retrieval from Documents) Evaluation. HARD 2004 was a track within the NIST Text REtrieval Conference (TREC), with the objective of achieving high accuracy retrieval from documents by leveraging additional information about the searcher and/or the search context, through techniques like passage retrieval and the use of targeted interaction with the searcher.
The current corpus was previously distributed to HARD Participants as LDC2004E30. The topics and annotations that correspond to this release are distributed as HARD 2004 Topics and Annotations (LDC2005T29). This corpus was created with support from the DARPA TIDES Program and LDC.
The corpus comprises eight English newswire and web text sources from January - December 2003. The sources and their volumes of data appear in the table below:
|Source||Code||Stories||Total Tokens||Avg. Token/Story|
|Agence France Presse - English||AFE||226,515||71,829,978||317|
|Associated Press Newswire||APE||237,067||93,294,584||393|
|Central News Agency Taiwan - English||CNE||3,674||797,194||217|
|Los Angeles Times/Washington Post||LAT||18,287||12,576,721||687|
|New York Times||NYT||28,190||16,673,028||591|
|Ummah Press - English||UME||2,607||782,064||299|
|Xinhua News Agency - English||XIE||117,854||24,016,670||203|
Files are organized by source on a daily basis. Each file contains multiple documents identified by unique document IDs, in the form "SRCyyyymmdd.nnnn", where 'nnnn' is a sequential number starting from "0001" for each source/day. In addition, each document has some or all of the following components:
- Keyword (optional), surrounded by tags
- Date/time (optional), surrounded by tags
- Headline, surrounded by tags
- Main part, surrounded by tags. Tags are used within this part to identify paragraph boundaries.
For more information please visit the HARD Project website.
For an example of the data in this corpus, please view this sample (TXT).
None at this time.