--- language: - ar base_model: - UBC-NLP/MARBERTv2 --- --- language: - ar license: apache-2.0 base_model: UBC-NLP/MARBERTv2 tags: - arabic - egyptian-arabic - masked-language-modeling - bert - dialect - nlp pipeline_tag: fill-mask --- # MasriBERT — Egyptian Arabic Language Model MasriBERT is a domain-adapted BERT model for **Egyptian Arabic (Masri/Ammiya)**, produced by continued MLM pre-training of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on **MASRISET** — a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary. It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis — with a specific focus on conversational and call-center language. --- ## Model Details | Attribute | Value | |---|---| | **Base Model** | `UBC-NLP/MARBERTv2` | | **Architecture** | BERT (12 layers, 768 hidden, 12 heads) | | **Task** | Masked Language Modeling (MLM) | | **Language** | Egyptian Arabic (`ar-EG`) | | **Training Corpus** | MASRISET — 1.3M+ rows | | **Final Eval Loss** | 4.523 (best checkpoint) | | **Final Perplexity** | 92.98 | | **Training Epochs** | 3 | --- ## Training Data — MASRISET MASRISET was assembled and cleaned specifically for this project. It combines the following sources: **HuggingFace Datasets** - `hard` — Egyptian Arabic sentiment/review data - `ar_res_reviews` — Arabic restaurant reviews - `arbml/TEAD` — Arabic tweet corpus **Kaggle — [Two Million Rows Egyptian Datasets](https://www.kaggle.com/datasets/mostafanofal/two-million-rows-egyptian-datasets)** - `AOC_youm7_comments` + `RestOf_AOC_youm7_comments` — Al-Youm Al-Sabaa news comments - `Egyptian Tweets` — Egyptian Twitter corpus - `TaghreedT` — Egyptian tweet collection - `TE_Telecom` + `TE_Tweets` — Telecom Egypt customer interactions All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample. ### Text Cleaning Pipeline The following normalization was applied uniformly across all sources: - Removed URLs, email addresses, @mentions, and hashtag symbols - **Alef normalization**: `إأآا → ا` - **Alef maqsura**: `ى → ي` - **Hamza variants**: `ؤ, ئ → ء` - Removed all Arabic tashkeel (diacritics) - Capped repeated characters at 2 (e.g. `مششششي → مشي`) - Removed English characters - Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings) - Minimum 5 words per sample enforced post-cleaning --- ## Training Configuration | Hyperparameter | Value | |---|---| | Block size | 64 tokens | | MLM probability | 0.20 (20%) | | Masking strategy | Whole Word Masking | | Peak learning rate | 2e-5 | | LR schedule | Linear decay with warmup (6%) | | Batch size | 16 per device | | Gradient accumulation | 2 steps (effective batch = 32) | | Weight decay | 0.01 | | Precision | FP16 | | Eval / Save interval | Every 2,500 steps | | Early stopping patience | 5 evaluations | Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive. ### Eval Loss Curve | Step | Eval Loss | |---|---| | 30,000 | 4.645 | | 32,500 | 4.633 | | 35,000 | 4.614 | | 40,000 | 4.588 | | 42,500 | 4.567 | | 47,500 | 4.540 | | **57,500** | **4.523 ← best** | | Final (57,915) | 4.532 | --- ## Usage ```python from transformers import pipeline unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3) # Egyptian sarcasm example results = unmasker("تسلم ايدكم بجد، الشحنة وصلت [MASK] خالص كالعادة.") for r in results: print(r['token_str'], round(r['score'], 4)) ``` ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert") model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert") ``` For downstream classification tasks (emotion, sentiment, sarcasm), load with `AutoModel` and attach your classification head: ```python from transformers import AutoModel encoder = AutoModel.from_pretrained("T0KII/masribert") ``` --- ## Intended Downstream Tasks This model was trained as a backbone for the following tasks in the **Kalamna** Egyptian Arabic AI pipeline: - **Emotion Classification** — Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings - **Sarcasm Detection** — Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony) - **Sentiment Analysis** — Positive / Negative / Neutral classification for customer interaction data --- ## Important Notes **LayerNorm naming warning**: When loading this model you will see warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored. **Best checkpoint**: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly. --- ## Citation If you use this model, please cite the original MARBERTv2 paper: ```bibtex @inproceedings{abdul-mageed-etal-2021-arbert, title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic", author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", year = "2021" } ``` --- ## License Apache 2.0 — inherited from the base model. See [MARBERTv2 license](https://huggingface.co/UBC-NLP/MARBERTv2) for details.