| --- |
| language: |
| - ar |
| license: unknown |
| base_model: |
| - T0KII/masribert |
| - UBC-NLP/MARBERTv2 |
| tags: |
| - arabic |
| - egyptian-arabic |
| - masked-language-modeling |
| - bert |
| - dialect |
| - nlp |
| pipeline_tag: fill-mask |
| --- |
| |
| # MasriBERT v2 โ Egyptian Arabic Language Model |
|
|
| MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** โ the primary register of customer-facing NLP applications. |
|
|
| It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language. |
|
|
| ## What Changed from v1 |
|
|
| | | MasriBERT v1 | MasriBERT v2 | |
| |---|---|---| |
| | Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) | |
| | Training corpus | MASRISET (1.3M rows โ tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ forums, dialogue) | |
| | Data register | Social media / news | Conversational / instructional dialogue | |
| | Training steps | ~57,915 | ~21,500 (resumed from step 20,000) | |
| | Final eval loss | 4.523 | **2.773** | |
| | Final perplexity | 92.98 | **16.00** | |
| | Training platform | Google Colab (A100) | Kaggle (T4 / P100) | |
|
|
| The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ v1 โ v2). |
|
|
| ## Training Corpus |
|
|
| Two sources were used, targeting conversational Egyptian Arabic: |
|
|
| **faisalq/EFC-mini โ Egyptian Forums Corpus** |
| Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ closely mirroring customer behavior. |
|
|
| **MBZUAI-Paris/Egyptian-SFT-Mixture โ Egyptian Dialogue** |
| Supervised fine-tuning dialogue data in Egyptian Arabic โ instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training. |
|
|
| Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning. |
|
|
| After deduplication: **1,946,195 rows โ 1,868,414 chunks of 64 tokens** |
|
|
| ## Text Cleaning Pipeline |
|
|
| Same normalization as v1, applied uniformly: |
|
|
| - Removed URLs, email addresses, @mentions, and hashtag symbols |
| - Alef normalization: ุฅุฃุขุง โ ุง |
| - Alef maqsura: ู โ ู |
| - Hamza variants: ุค, ุฆ โ ุก |
| - Removed all Arabic tashkeel (diacritics) |
| - Capped repeated characters at 2 (e.g. ูููููู โ ูู) |
| - Removed English characters |
| - Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining) |
| - Minimum 5 words per sample enforced post-cleaning |
|
|
| ## Training Configuration |
|
|
| | Hyperparameter | Value | |
| |---|---| |
| | Block size | 64 tokens | |
| | MLM probability | 0.20 (20%) | |
| | Masking strategy | Token-level (whole word masking disabled โ tokenizer incompatibility) | |
| | Peak learning rate | 2e-5 | |
| | Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) | |
| | LR schedule | Linear decay, no warmup on resume | |
| | Batch size | 64 per device | |
| | Gradient accumulation | 2 steps (effective batch = 128) | |
| | Weight decay | 0.01 | |
| | Precision | FP16 | |
| | Eval / Save interval | Every 500 steps | |
| | Early stopping patience | 3 evaluations | |
| | Train blocks | 1,849,729 | |
| | Eval blocks | 18,685 | |
|
|
| Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub. |
|
|
| ## Eval Loss Curve |
|
|
| | Step | Eval Loss | |
| |---|---| |
| | 500 | 3.830 | |
| | 1,000 | 3.599 | |
| | 2,000 | 3.336 | |
| | 5,000 | 3.066 | |
| | 8,500 | 2.945 | |
| | 20,500 | 2.773 | |
| | 21,000 | 2.783 | |
| | **21,500** | **2.773 โ best** | |
|
|
| ## Usage |
|
|
| ```python |
| from transformers import pipeline |
| |
| unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3) |
| |
| results = unmasker("ุงูุง ู
ุด ุฑุงุถู ุนู ุงูุฎุฏู
ุฉ ุฏู [MASK] ุจุฌุฏ.") |
| for r in results: |
| print(r['token_str'], round(r['score'], 4)) |
| ``` |
|
|
| ```python |
| from transformers import AutoTokenizer, AutoModelForMaskedLM |
| |
| tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2") |
| model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2") |
| ``` |
|
|
| For downstream classification tasks (emotion, sentiment, sarcasm): |
|
|
| ```python |
| from transformers import AutoModel |
| |
| encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2") |
| # Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state |
| ``` |
|
|
| ## Known Warnings |
|
|
| **LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ the warning is cosmetic and can be safely ignored. |
|
|
| ## Intended Downstream Tasks |
|
|
| This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline: |
|
|
| - **Emotion Classification** โ Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral) |
| - **Sarcasm Detection** โ Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony) |
| - **Sentiment Analysis** โ Positive / Negative / Neutral classification for customer interaction data |
|
|
| ## Model Lineage |
|
|
| ``` |
| UBC-NLP/MARBERTv2 |
| โโโ T0KII/masribert (v1 โ MLM on MASRISET, 57K steps) |
| โโโ T0KII/MASRIBERTv2 (v2 โ MLM on EFC + SFT, 21.5K steps) |
| ``` |
|
|
| ## Citation |
|
|
| If you use this model, please cite the original MARBERTv2 paper: |
|
|
| ```bibtex |
| @inproceedings{abdul-mageed-etal-2021-arbert, |
| title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic", |
| author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah", |
| booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", |
| year = "2021" |
| } |
| ``` |