go-inoue's picture
Update README.md
f84338e verified
---
license: mit
language:
- ar
tags:
- dependency-parsing
- arabic
- dialects
---
# CamelParser-Dialects
**CamelParser-Dialects** is a neural dependency parsing model for **dialectal Arabic** and Modern Standard Arabic (MSA), designed under the **CATiB dependency formalism**.
It is based on the **biaffine attention parser** architecture introduced by Dozat and Manning (2017), implemented using [SuPar](https://github.com/yzhangcs/parser).
The model leverages **CamelBERT-MIX**, a pretrained language model trained on a large and diverse Arabic corpus.
Full details are available in our paper:
**"Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights"**
---
## πŸ“Š Model Variants and LAS (Labeled Attachment Score) on TEST
||Checkpoint|Training Data|MSA|EGY|GLF|AVG|
|:-----:|:-----|:--------|:--------|:--------|:--------|:--------|
||`CAMeL-Lab/camelparser-dialects-MSA`|CamelTB, PATB|87.3|73.0|73.3|77.9|
||`CAMeL-Lab/camelparser-dialects-EGY`|ARZTB|79.2|83.9|68.7|77.3|
||`CAMeL-Lab/camelparser-dialects-GLF`|CamelTB-Gumar|65.4|58.7|73.8|66.0|
||`CAMeL-Lab/camelparser-dialects-MSA-EGY`|CamelTB, PATB, ARZTB|87.1|84.4|70.1|79.8|
|β˜‘οΈ|`CAMeL-Lab/camelparser-dialects-MSA-GLF`|CamelTB, PATB, CamelTB-Gumar|87.2|74.4|81.0|80.9|
||`CAMeL-Lab/camelparser-dialects-EGY-GLF`|ARZTB, CamelTB-Gumar|80.0|83.8|79.4|81.1|
||`CAMeL-Lab/camelparser-dialects-MSA-EGY-GLF`|CamelTB, PATB, ARZTB, CamelTB-Gumar|87.2|84.2|80.3|83.9|
The recommended checkpoint is the **all-variety model (`MSA-EGY-GLF`)**, which provides the best overall cross-dialect performance.
---
## 🧠 Model Architecture
- **Encoder**: CamelBERT-MIX
- **Parser**: Deep biaffine attention (Dozat & Manning, 2017)
- **Framework**: [SuPar](https://github.com/yzhangcs/parser)
- **Formalism**: CATiB dependency scheme
---
## πŸ“š Training Data
The models are trained on combinations of the following treebanks:
- **CamelTB** (MSA): [camel_treebank_1.1.zip](https://sites.google.com/nyu.edu/camel-treebank/resources)
- **PATB** (Penn Arabic Treebank): [LDC2010T13](https://catalog.ldc.upenn.edu/LDC2010T13), [LDC2011T09](https://catalog.ldc.upenn.edu/LDC2011T09), [LDC2010T08](https://catalog.ldc.upenn.edu/LDC2010T08)
- **ARZTB** (Egyptian Arabic Treebank): [LDC2018T23](https://catalog.ldc.upenn.edu/LDC2018T23)
- **CamelTB-Gumar** (Gulf Arabic): [`CamelTB-Gumar.1.0.zip`](https://forms.gle/54WSUt7Z9m9vk6p69)
---
## πŸš€ Intended Use
This model is intended for:
- Dependency parsing of Arabic text
- Linguistic analysis of dialectal Arabic
---
## πŸ”§ Usage
For usage instructions and code, please refer to the official repository:
πŸ‘‰ https://github.com/CAMeL-Lab/camel_parser_dialects
## πŸ“– Citation
If you use this model, please cite:
```bibtex
@inproceedings{Elshabrawy:2026:camelparser-dialects,
title = "{Parsing Arabic Dialects Revisited: New Benchmarks, Models, and Insights}",
author = {Ahmed Elshabrawy and
Go Inoue and
Muhammed AbuOdeh and
Nizar Habash} ,
booktitle = {Proceedings of The 7th Workshop on Open-Source Arabic Corpora and Processing Tools (OSACT)},
year = "2026",
address = "Palma, Spain"
}
```