English Gigaword Third Edition
|Item Name:||English Gigaword Third Edition|
|Author(s):||David Graff, Junbo Kong, Ke Chen, Kazuaki Maeda|
|LDC Catalog No.:||LDC2007T07|
|Release Date:||May 17, 2007|
|Application(s):||natural language processing, language modeling, information retrieval|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2007T07 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David, et al. English Gigaword Third Edition LDC2007T07. Web Download. Philadelphia: Linguistic Data Consortium, 2007.|
The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the third edition of the English Gigaword Corpus.
This edition includes all of the contents in the previous edition (LDC2005T12) as well as new data from the same five sources presented there covering 24-month period of January 2005 through December 2006. Also, a sixth data source (the Los Angeles Times/Washington Post newswire service) has been added in this edition.
The six distinct international sources of English newswire included in this edition are the following:
|Agence France-Presse, English Service||(afp_eng)|
|Associated Press Worldstream, English Service||(apw_eng)|
|Central News Agency of Taiwan, English Service||(cna_eng)|
|Los Angeles Times/Washington Post Newswire Service||(ltw_eng)|
|New York Times Newswire Service||(nyt_eng)|
|Xinhua News Agency, English Service||(xin_eng)|
The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("eng") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the new ISO 639-3 standard.
The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name.
As with other Gigaword releases, some of the content in the this corpus has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus, as well as earlier editions of Gigaword English.
New in the Third Edition
- New newswire data contents from January 2005 to December 2006 have been added for all of the five newswire sources that were represented in the first edition.
- A new source, the Los Angeles Times/Washington Post newswire service, has been added.
- A small handful of corrections to older APW data have been made to remove a few non-English stories, clean up some character "noise", and rectify the encoding for a few non-ASCII characters.
- The CNA content introduced in Gigaword English 2nd Edition has been completely updated to repair data corruptions caused by occasional character encoding problems; as a result of the update, there may be differences in the inventory and/or ID strings of DOC elements in this portion of the corpus, relative to the previous edition. (The nature of encoding problems is explained below under "SOURCE SPECIFIC PROPERTIES".)
- Many of the files (141 out of 722) include a small number of UTF-8 "wide" characters, typically accented letters found in proper names and borrowed words (some sources also use special punctuation marks, non-breaking spaces, etc).
Apart from the replacement/update of all CNA files, the data content of the 2nd edition has been included in the present release without modification.
For an example of the data in this corpus, please review this text file.
The New York Times newswire text archive in this corpus contains some articles in Spanish. A scan of the 149 monthly data files under "nyt_eng" yielded 2517 DOC elements with the 'type="story"' attribute where the story content was in Spanish.
The scan also disclosed 421 DOC elements with the 'type="story"' attribute where the text content was in fact not a news story.
Two additional files to the online documentation for this corpus identify those occurrences.
This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.