license: apache-2.0
language:
- ar
- arz
- ary
base_model:
- UBC-NLP/MARBERTv2
pipeline_tag: text-classification
library_name: transformers
๐ท๏ธ Model Card: marbertv2-arabic-written-dialect-classifier
Model Name: marbertv2-arabic-written-dialect-classifier
Base Model: UBC-NLP/MARBERTv2
Languages: Arabic (dialectal + MSA)
Author: IbrahimAmin
๐ Model Description
This model is fine-tuned from MARBERTv2, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions:
- MAGHREB (North African dialects)
- LEV (Levantine dialects)
- MSA (Modern Standard Arabic)
- GLF (Gulf dialects)
- EGY (Egyptian Arabic)
It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing.
๐ Labels
The model predicts one of the following five classes:
"id2label": {
"0": "MAGHREB",
"1": "LEV",
"2": "MSA",
"3": "GLF",
"4": "EGY"
}
๐ Training Data
The model was trained about 850,000+ Arabic sentences from 9 different publicly available datasets, covering a wide variety of written Arabic dialects.
Distribution by Dialect:
| Dialect | Count |
|---|---|
| GLF | 253,553 |
| LEV | 243,025 |
| MAGHREB | 140,887 |
| EGY | 105,226 |
| MSA | 83,231 |
โ๏ธ Training Details
- Architecture: MARBERTv2 (BERT-based)
- Task: Text Classification (Dialect Identification)
- Objective: Multi-class classification with softmax over 5 dialect classes
- Tokenizer:
UBC-NLP/MARBERTv2
๐ Datasets Used
Below is a detailed overview of the datasets used in training and/or considered during development:
| Dataset | Brief Description | Annotation strategy | Provided Labels | Current SOTA Performance | Notes |
|---|---|---|---|---|---|
| MADAR Subtask-1 (MADAR-6) | A Collection of parallel sentences (BTEC) covering the dialects of 5 cities from the Arab World and MSA in the travel domain (10,000 sentences per city) |
Manual | 5 Arab Cities + MSA | 92.5% Accuracy | ADDED |
| MADAR Subtask-1 (MADAR-26) | A Collection of parallel sentences (BTEC) covering the dialects of 25 cities from the Arab World and MSA in the travel domain (2,000 sentences per city) |
Manual | 25 Arab Cities + MSA | 67.32% F1-Score | ADDED |
| DART | 25K tweets that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects |
Manual | 5 Arab Regions | UNK | ADDED |
| ArSarcasm v1 | 10,547 tweets from ASTD and SemEval datasets for Sarcasm detection with the dilaect information added in |
Manual | 4 Arab Regions + MSA | UNK | ADDED |
| ArSarcasm v2 | ArSarcasm-v2 dataset contains 15,548 Tweets and is an extension of the original ArSarcasm dataset (Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets) |
Manual | 4 Arab Regions + MSA | UNK | ADDED |
| IADD | Five publicly available corpora were identified, analyzed and filtered to build IADD (AOC, DART, PADIC, SHAMI and TSAC) |
________ | 5 Regions and 9 Countries | UNK | ADDED |
| QADI | 540k tweets (30k per country on average) with a total of 8.8M words |
Automatic | 18 Arab Countries | 60.6% | ADDED |
| AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabeโ from EGY |
Manual | 3 Arab Regions + MSA | UNK | ADDED |
| NADI-2020 | 25,957 Tweets from 100 Arab provinces and 21 Arab countries |
Automatic | 100 Prov. and 21 Coun. | 6.39%, 26.78% | ADDED |
๐ก Usage
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
text = "ุงูุฏููุง ู
ุด ู
ุณุชุงููุฉ ุชุฌุฑู ูุฏูุ ุฎุฏ ููุชู ูุงุณุชู
ุชุน ุจุงูุญุงุฌุฉ ุงูุจุณูุทุฉ"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
pred = torch.argmax(outputs.logits, dim=1).item()
print(f"Predicted Dialect: {model.config.id2label[pred]}")
โจ Acknowledgements
- MARBERTv2 team at UBC-NLP
- Contributors of the Arabic dialect datasets used in training