| --- |
| language: |
| - ar |
| base_model: |
| - UBC-NLP/MARBERTv2 |
| --- |
| |
| --- |
| language: |
| - ar |
| license: apache-2.0 |
| base_model: UBC-NLP/MARBERTv2 |
| tags: |
| - arabic |
| - egyptian-arabic |
| - masked-language-modeling |
| - bert |
| - dialect |
| - nlp |
| pipeline_tag: fill-mask |
| --- |
|
|
| # MasriBERT โ Egyptian Arabic Language Model |
|
|
| MasriBERT is a domain-adapted BERT model for **Egyptian Arabic (Masri/Ammiya)**, produced by continued MLM pre-training of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on **MASRISET** โ a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary. |
|
|
| It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis โ with a specific focus on conversational and call-center language. |
|
|
| --- |
|
|
| ## Model Details |
|
|
| | Attribute | Value | |
| |---|---| |
| | **Base Model** | `UBC-NLP/MARBERTv2` | |
| | **Architecture** | BERT (12 layers, 768 hidden, 12 heads) | |
| | **Task** | Masked Language Modeling (MLM) | |
| | **Language** | Egyptian Arabic (`ar-EG`) | |
| | **Training Corpus** | MASRISET โ 1.3M+ rows | |
| | **Final Eval Loss** | 4.523 (best checkpoint) | |
| | **Final Perplexity** | 92.98 | |
| | **Training Epochs** | 3 | |
|
|
| --- |
|
|
| ## Training Data โ MASRISET |
|
|
| MASRISET was assembled and cleaned specifically for this project. It combines the following sources: |
|
|
| **HuggingFace Datasets** |
| - `hard` โ Egyptian Arabic sentiment/review data |
| - `ar_res_reviews` โ Arabic restaurant reviews |
| - `arbml/TEAD` โ Arabic tweet corpus |
|
|
| **Kaggle โ [Two Million Rows Egyptian Datasets](https://www.kaggle.com/datasets/mostafanofal/two-million-rows-egyptian-datasets)** |
| - `AOC_youm7_comments` + `RestOf_AOC_youm7_comments` โ Al-Youm Al-Sabaa news comments |
| - `Egyptian Tweets` โ Egyptian Twitter corpus |
| - `TaghreedT` โ Egyptian tweet collection |
| - `TE_Telecom` + `TE_Tweets` โ Telecom Egypt customer interactions |
|
|
| All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample. |
|
|
| ### Text Cleaning Pipeline |
|
|
| The following normalization was applied uniformly across all sources: |
|
|
| - Removed URLs, email addresses, @mentions, and hashtag symbols |
| - **Alef normalization**: `ุฅุฃุขุง โ ุง` |
| - **Alef maqsura**: `ู โ ู` |
| - **Hamza variants**: `ุค, ุฆ โ ุก` |
| - Removed all Arabic tashkeel (diacritics) |
| - Capped repeated characters at 2 (e.g. `ู
ุดุดุดุดู โ ู
ุดู`) |
| - Removed English characters |
| - Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings) |
| - Minimum 5 words per sample enforced post-cleaning |
|
|
| --- |
|
|
| ## Training Configuration |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Block size | 64 tokens | |
| | MLM probability | 0.20 (20%) | |
| | Masking strategy | Whole Word Masking | |
| | Peak learning rate | 2e-5 | |
| | LR schedule | Linear decay with warmup (6%) | |
| | Batch size | 16 per device | |
| | Gradient accumulation | 2 steps (effective batch = 32) | |
| | Weight decay | 0.01 | |
| | Precision | FP16 | |
| | Eval / Save interval | Every 2,500 steps | |
| | Early stopping patience | 5 evaluations | |
|
|
| Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive. |
|
|
| ### Eval Loss Curve |
|
|
| | Step | Eval Loss | |
| |---|---| |
| | 30,000 | 4.645 | |
| | 32,500 | 4.633 | |
| | 35,000 | 4.614 | |
| | 40,000 | 4.588 | |
| | 42,500 | 4.567 | |
| | 47,500 | 4.540 | |
| | **57,500** | **4.523 โ best** | |
| | Final (57,915) | 4.532 | |
|
|
| --- |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3) |
| |
| # Egyptian sarcasm example |
| results = unmasker("ุชุณูู
ุงูุฏูู
ุจุฌุฏุ ุงูุดุญูุฉ ูุตูุช [MASK] ุฎุงูุต ูุงูุนุงุฏุฉ.") |
| for r in results: |
| print(r['token_str'], round(r['score'], 4)) |
| ``` |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForMaskedLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert") |
| model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert") |
| ``` |
|
|
| For downstream classification tasks (emotion, sentiment, sarcasm), load with `AutoModel` and attach your classification head: |
|
|
| ```python |
| from transformers import AutoModel |
| |
| encoder = AutoModel.from_pretrained("T0KII/masribert") |
| ``` |
|
|
| --- |
|
|
| ## Intended Downstream Tasks |
|
|
| This model was trained as a backbone for the following tasks in the **Kalamna** Egyptian Arabic AI pipeline: |
|
|
| - **Emotion Classification** โ Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings |
| - **Sarcasm Detection** โ Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony) |
| - **Sentiment Analysis** โ Positive / Negative / Neutral classification for customer interaction data |
|
|
| --- |
|
|
| ## Important Notes |
|
|
| **LayerNorm naming warning**: When loading this model you will see warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ the warning is cosmetic and can be safely ignored. |
|
|
| **Best checkpoint**: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original MARBERTv2 paper: |
|
|
| ```bibtex |
| @inproceedings{abdul-mageed-etal-2021-arbert, |
| title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic", |
| author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah", |
| booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", |
| year = "2021" |
| } |
| ``` |
|
|
| --- |
|
|
| ## License |
|
|
| Apache 2.0 โ inherited from the base model. See [MARBERTv2 license](https://huggingface.co/UBC-NLP/MARBERTv2) for details. |