--- language: - ar license: unknown base_model: - T0KII/masribert - UBC-NLP/MARBERTv2 tags: - arabic - egyptian-arabic - masked-language-modeling - bert - dialect - nlp pipeline_tag: fill-mask --- # MasriBERT v2 — Egyptian Arabic Language Model MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** — the primary register of customer-facing NLP applications. It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language. ## What Changed from v1 | | MasriBERT v1 | MasriBERT v2 | |---|---|---| | Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) | | Training corpus | MASRISET (1.3M rows — tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows — forums, dialogue) | | Data register | Social media / news | Conversational / instructional dialogue | | Training steps | ~57,915 | ~21,500 (resumed from step 20,000) | | Final eval loss | 4.523 | **2.773** | | Final perplexity | 92.98 | **16.00** | | Training platform | Google Colab (A100) | Kaggle (T4 / P100) | The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 → v1 → v2). ## Training Corpus Two sources were used, targeting conversational Egyptian Arabic: **faisalq/EFC-mini — Egyptian Forums Corpus** Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions — closely mirroring customer behavior. **MBZUAI-Paris/Egyptian-SFT-Mixture — Egyptian Dialogue** Supervised fine-tuning dialogue data in Egyptian Arabic — instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training. Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning. After deduplication: **1,946,195 rows → 1,868,414 chunks of 64 tokens** ## Text Cleaning Pipeline Same normalization as v1, applied uniformly: - Removed URLs, email addresses, @mentions, and hashtag symbols - Alef normalization: إأآا → ا - Alef maqsura: ى → ي - Hamza variants: ؤ, ئ → ء - Removed all Arabic tashkeel (diacritics) - Capped repeated characters at 2 (e.g. هههههه → هه) - Removed English characters - Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining) - Minimum 5 words per sample enforced post-cleaning ## Training Configuration | Hyperparameter | Value | |---|---| | Block size | 64 tokens | | MLM probability | 0.20 (20%) | | Masking strategy | Token-level (whole word masking disabled — tokenizer incompatibility) | | Peak learning rate | 2e-5 | | Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) | | LR schedule | Linear decay, no warmup on resume | | Batch size | 64 per device | | Gradient accumulation | 2 steps (effective batch = 128) | | Weight decay | 0.01 | | Precision | FP16 | | Eval / Save interval | Every 500 steps | | Early stopping patience | 3 evaluations | | Train blocks | 1,849,729 | | Eval blocks | 18,685 | Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub. ## Eval Loss Curve | Step | Eval Loss | |---|---| | 500 | 3.830 | | 1,000 | 3.599 | | 2,000 | 3.336 | | 5,000 | 3.066 | | 8,500 | 2.945 | | 20,500 | 2.773 | | 21,000 | 2.783 | | **21,500** | **2.773 ← best** | ## Usage ```python from transformers import pipeline unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3) results = unmasker("انا مش راضي عن الخدمة دي [MASK] بجد.") for r in results: print(r['token_str'], round(r['score'], 4)) ``` ```python from transformers import AutoTokenizer, AutoModelForMaskedLM tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2") model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2") ``` For downstream classification tasks (emotion, sentiment, sarcasm): ```python from transformers import AutoModel encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2") # Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state ``` ## Known Warnings **LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored. ## Intended Downstream Tasks This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline: - **Emotion Classification** — Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral) - **Sarcasm Detection** — Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony) - **Sentiment Analysis** — Positive / Negative / Neutral classification for customer interaction data ## Model Lineage ``` UBC-NLP/MARBERTv2 └── T0KII/masribert (v1 — MLM on MASRISET, 57K steps) └── T0KII/MASRIBERTv2 (v2 — MLM on EFC + SFT, 21.5K steps) ``` ## Citation If you use this model, please cite the original MARBERTv2 paper: ```bibtex @inproceedings{abdul-mageed-etal-2021-arbert, title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic", author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics", year = "2021" } ```