American National Corpus (ANC) Second Release
|Item Name:||American National Corpus (ANC) Second Release|
|Author(s):||Randi Reppen, Nancy Ide, Keith Suderman|
|LDC Catalog No.:||LDC2005T35|
|Release Date:||December 15, 2005|
|Data Source(s):||journal articles, news magazine, newswire, telephone speech, varied, web collection|
|Project(s):||American National Corpus (ANC), Talkbank|
|Application(s):||natural language processing|
American National Corpus 2nd Release - Open
American National Corpus 2nd Release - Restricted
|Online Documentation:||LDC2005T35 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Reppen, Randi, Nancy Ide, and Keith Suderman. American National Corpus (ANC) Second Release LDC2005T35. Web Download. Philadelphia: Linguistic Data Consortium, 2005.|
American National Corpus (ANC) Second Release was developed by various contributors and contains approximately 22 million words of American English text from multiple genres with various annotation such as part-of-speech (POS) tagging.
The American National Corpus (ANC) project fosters the development of a corpus comparable to the British National Corpus (BNC), covering American English. Corpus-analytic work has demonstrated that the BNC is inappropriate for the study of American English, due to the numerous differences in use of the language. The ANC is being developed with help from a consortium of American English dictionary publishers and companies interested in language processing that was formed in 1999. Consortium members are providing materials for inclusion in the corpus, and provided initial financial support for the project.
The availability of a corpus of American English will significantly contribute to language and linguistic research, the development of language understanding computer applications (e.g., language translation and search and retrieval software), and the compilation of reference works such as dictionaries and thesauri. It will also provide a rich national resource for use in education at all levels.
In addition to the more than 10 million words added in the Second Release, this corpus contains a new corrected and validated version of the 11 million word ANC First Release and software for searching and retrieving multiple stand-off annotations.
ANC Second Release contains texts from the following sources (* denotes new source in the Second Release):
- Transcribed telephone speech
- The New York Times
- Berlitz Travel Guides
- Slate Magazine
- ICIC Corpus of Fundraising Texts *
- The Michigan Corpus of Academic Spoken English (MICASE) *
- Various non-fiction
- Various fiction *
- Various medical research articles *
- Anonymized posts to the Phoenix Board/Buffistas.org *
The corpus includes the data as a UTF-16 encoded file plus annotations of the documents such as automatic POS tagging with two different types of tagsets, automatic noun and verb phrase identification, and stuctural information at the paragraph and sentence level. The goal of the ANC is to ultimately contain a core corpus of at least 100 million words, including both written and spoken data (transcripts) comparable across genres to the BNC.
ANC Second Release contains data governed under two types of licenses, an open license and a restricted license. Both the Open License Agreement and the Restricted License Agreement need to be signed in order to receive ANC Second Release, and the data must be used in accordance with the agreement by which it is governed.
Additional documentation and information is available at the ANC web site.
None at this time.
The publication of this corpus was facilitated by funding extended by the TalkBank project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-98009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.