| | --- |
| | license: apache-2.0 |
| | language: |
| | - ar |
| | - arz |
| | - ary |
| | base_model: |
| | - UBC-NLP/MARBERTv2 |
| | pipeline_tag: text-classification |
| | library_name: transformers |
| | datasets: |
| | - iabufarha/ar_sarcasm |
| | - Abdelrahman-Rezk/Arabic_Dialect_Identification |
| | - asas-ai/DART |
| | - arbml/ArSarcasm_v2 |
| | - evageon/IADD |
| | --- |
| | |
| | # ✍🏻 MARBERTv2 Arabic Written Dialect Classifier |
| |
|
| | ## Model Overview |
| |
|
| | This model is a fine-tuned version of [`UBC-NLP/MARBERTv2`](https://huggingface.co/UBC-NLP/MARBERTv2) for **Arabic written dialect classification**. It identifies Modern Standard Arabic (MSA) and 4 regional Arabic dialects from raw text. |
| |
|
| | This model is intended for use in tasks such as dialect identification, linguistic research, and dialect-aware natural language processing systems. |
| |
|
| | --- |
| |
|
| | ## 📌 Model Details |
| |
|
| | This model is fine-tuned from **MARBERTv2**, a transformer-based language model optimized for Arabic, on a multi-dialect classification task. It distinguishes among five major written Arabic dialect regions: |
| |
|
| | - **MAGHREB** (North African dialects) |
| | - **LEV** (Levantine dialects) |
| | - **MSA** (Modern Standard Arabic) |
| | - **GLF** (Gulf dialects) |
| | - **EGY** (Egyptian Arabic) |
| |
|
| | It is intended for dialect identification in short Arabic text snippets from various sources including social media, forums, and informal writing. |
| |
|
| | --- |
| |
|
| | ## 📊 Labels (`id2label`) |
| |
|
| | The model predicts one of the following five classes: |
| |
|
| | ```json |
| | { |
| | "0": "MAGHREB", // Maghreb dialect (Northwest Africa: Morocco, Algeria, Tunisia, etc.) |
| | "1": "LEV", // Levantine dialect (Lebanon, Syria, Jordan, Palestine) |
| | "2": "MSA", // Modern Standard Arabic |
| | "3": "GLF", // Gulf dialect (Saudi Arabia, UAE, Kuwait, etc.) |
| | "4": "EGY", // Egyptian dialect |
| | } |
| | ``` |
| |
|
| | --- |
| |
|
| | ## 📚 Training Data |
| |
|
| | The model was trained about **850,000+** Arabic sentences from **9 different publicly available datasets**, covering a wide variety of written Arabic dialects. |
| |
|
| | ### Distribution by Dialect: |
| |
|
| | | Dialect | Count | |
| | |-----------|----------| |
| | | GLF | 253,553 | |
| | | LEV | 243,025 | |
| | | MAGHREB | 140,887 | |
| | | EGY | 105,226 | |
| | | MSA | 83,231 | |
| |
|
| | --- |
| |
|
| | ## ⚙️ Training Details |
| |
|
| | - **Architecture:** MARBERTv2 (BERT-based) |
| | - **Task:** Text Classification (Dialect Identification) |
| | - **Objective:** Multi-class classification with softmax over 5 dialect classes |
| | - **Tokenizer:** `UBC-NLP/MARBERTv2` |
| |
|
| | --- |
| |
|
| | ## 📂 Datasets Used |
| |
|
| | Below is a detailed overview of the datasets used in training and/or considered during development: |
| |
|
| | | **Dataset** | **Brief Description** | **Annotation strategy** | **Provided Labels** | **Current SOTA Performance** | |
| | | :------------------------: | :-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | :---------------------: | :-----------------------: | :--------------------------: | |
| | | MADAR Subtask-1 (MADAR-6) | A Collection of `parallel sentences (BTEC)` covering the dialects of `5 cities from the Arab World and MSA` in the travel domain `(10,000 sentences per city)` | Manual | 5 Arab Cities + MSA | 92.5% Accuracy | |
| | | MADAR Subtask-1 (MADAR-26) | A Collection of `parallel sentences (BTEC)` covering the dialects of `25 cities from the Arab World and MSA` in the travel domain `(2,000 sentences per city)` | Manual | 25 Arab Cities + MSA | 67.32% F1-Score | |
| | | DART | `25K tweets` that are annotated via crowdsourcing and it is well-balanced over five main groups of Arabic dialects | Manual | 5 Arab Regions | UNK | |
| | | ArSarcasm v1 | `10,547 tweets` from `ASTD and SemEval datasets` for Sarcasm detection with the dilaect information added in | Manual | 4 Arab Regions + MSA | UNK | |
| | | ArSarcasm v2 | ArSarcasm-v2 dataset contains `15,548 Tweets` and is an extension of the original ArSarcasm dataset `(Consists of ArScarcasm v1 along with portions of DAICT corpus and some new tweets)` | Manual | 4 Arab Regions + MSA | UNK | |
| | | IADD | `Five publicly available corpora` were identified, analyzed and filtered to build IADD `(AOC, DART, PADIC, SHAMI and TSAC)` | ________ | 5 Regions and 9 Countries | UNK | |
| | | QADI | `540k tweets` (30k per country on average) with a total of 8.8M words | Automatic | 18 Arab Countries | 60.6% | |
| | | AOC | The Arabic Online Commentary dataset is based on reader commentary from the online versions of three Arabic newspapers:`AlGhad from JOR, Al-Riyadh from KSA, and Al-Youm Al-Sabe’ from EGY` | Manual | 3 Arab Regions + MSA | UNK | |
| | | NADI-2020 | `25,957 Tweets` from 100 Arab provinces and 21 Arab countries | Automatic | 100 Prov. and 21 Coun. | 6.39% - 26.78% | |
| |
|
| | --- |
| |
|
| | ## 💡 Usage |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForSequenceClassification |
| | import torch |
| | |
| | model_name = "IbrahimAmin/marbertv2-arabic-written-dialect-classifier" |
| | tokenizer = AutoTokenizer.from_pretrained(model_name) |
| | model = AutoModelForSequenceClassification.from_pretrained(model_name) |
| | |
| | text = "الدنيا مش مستاهلة تجري كده، خد وقتك واستمتع بالحاجة البسيطة" |
| | inputs = tokenizer(text, return_tensors="pt") |
| | |
| | # Run inference |
| | with torch.inference_mode(): |
| | logits = model(**inputs).logits |
| | |
| | pred = torch.argmax(logits, dim=-1).item() |
| | |
| | print(f"Predicted Dialect: {model.config.id2label[pred]}") |
| | ``` |
| |
|
| | --- |
| |
|
| | ## ✨ Acknowledgements |
| |
|
| | - MARBERTv2 team at UBC-NLP |
| | - Contributors of the Arabic dialect datasets used in training |
| |
|
| | --- |
| |
|
| | ## 📝 Citation |
| |
|
| | If you use this model in your research or application, please cite: |
| |
|
| | ```bibtex |
| | @misc{ibrahimamin_marbertv2_arabic_written_dialect_classifier, |
| | author = {Ibrahim Amin}, |
| | title = {MARBERTv2 Arabic Written Dialect Classifier}, |
| | year = {2025}, |
| | publisher = {Hugging Face}, |
| | howpublished = {\url{https://huggingface.co/IbrahimAmin/marbertv2-arabic-written-dialect-classifier}}, |
| | } |
| | ``` |