Arabic Treebank: Part 2 v 2.0
|Item Name:||Arabic Treebank: Part 2 v 2.0|
|Author(s):||Mohamed Maamouri, Ann Bies, Tim Buckwalter, Hubert Jin|
|LDC Catalog No.:||LDC2004T02|
|Release Date:||January 30, 2004|
|Application(s):||automatic content extraction, cross-lingual information retrieval, information detection, natural language processing|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2004T02 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Maamouri, Mohamed, et al. Arabic Treebank: Part 2 v 2.0 LDC2004T02. Web Download. Philadelphia: Linguistic Data Consortium, 2004.|
Arabic Treebank: Part 2 v 2.0 was produced by Linguistic Data Consortium (LDC) and contains approximately 126,000 Arabic word tokens with part of speech (POS) annotation that includes complete vocalization with case endings, lemma IDs, and more specific POS tags for verbs and particles.
The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. This corpus is part three of that project.
Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level. Treebanks have become crucially important for the development of both data-driven and general linguistic research. This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work.
The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically.
The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1, 2, and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4.
|Part||Source||Stories||Total Tokens||Tokens After Clitic Separation||Arabic Word Tokens|
|1 (V 2.0)||Agence France Presse||734||140,265||168,123||N/A|
|1 (V 3.0 and 4.1)||Agence France Presse||734||145,386||166,068||123,795|
|2 (V 2.0)||Ummah Press||501||144,199||168,297||125,698|
|2 (V 3.1)||Ummah Press||501||144,199||169,319||125,709|
|3 (V 1.0 and 2.0)||An Nahar News Agency||600||340,281||400,213||293,035|
|3 (V 3.2)||An Nahar News Agency||599||339,710||402,291||292,554|
This corpus uses 501 Ummah Arabic News Text stories. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag. Of the 125,698 Arabic-only word tokens (prior to the separation of clitics), 124,740 (99.24%) were provided with an acceptable morphological analysis and POS tag by the morphological parser, and 958 (0.76%) were items that the morphological parser failed to analyze correctly.
This corpus has one subsequent version: Arabic Treebank: Part 2 v 3.1 (LDC2011T09).
Please view the following samples:
None at this time.