T0KII
/

MASRIBERTv2

+---
+language:
+- ar
+license: unknown
+base_model:
+- T0KII/masribert
+- UBC-NLP/MARBERTv2
+tags:
+- arabic
+- egyptian-arabic
+- masked-language-modeling
+- bert
+- dialect
+- nlp
+pipeline_tag: fill-mask
+---
+# MasriBERT v2 — Egyptian Arabic Language Model
+MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** — the primary register of customer-facing NLP applications.
+It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.
+## What Changed from v1
+| | MasriBERT v1 | MasriBERT v2 |
+|---|---|---|
+| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
+| Training corpus | MASRISET (1.3M rows — tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows — forums, dialogue) |
+| Data register | Social media / news | Conversational / instructional dialogue |
+| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
+| Final eval loss | 4.523 | **2.773** |
+| Final perplexity | 92.98 | **16.00** |
+| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |
+The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 → v1 → v2).
+## Training Corpus
+Two sources were used, targeting conversational Egyptian Arabic:
+**faisalq/EFC-mini — Egyptian Forums Corpus**
+Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions — closely mirroring customer behavior.
+**MBZUAI-Paris/Egyptian-SFT-Mixture — Egyptian Dialogue**
+Supervised fine-tuning dialogue data in Egyptian Arabic — instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.
+Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.
+After deduplication: **1,946,195 rows → 1,868,414 chunks of 64 tokens**
+## Text Cleaning Pipeline
+Same normalization as v1, applied uniformly:
+- Removed URLs, email addresses, @mentions, and hashtag symbols
+- Alef normalization: إأآا → ا
+- Alef maqsura: ى → ي
+- Hamza variants: ؤ, ئ → ء
+- Removed all Arabic tashkeel (diacritics)
+- Capped repeated characters at 2 (e.g. هههههه → هه)
+- Removed English characters
+- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
+- Minimum 5 words per sample enforced post-cleaning
+## Training Configuration
+| Hyperparameter | Value |
+|---|---|
+| Block size | 64 tokens |
+| MLM probability | 0.20 (20%) |
+| Masking strategy | Token-level (whole word masking disabled — tokenizer incompatibility) |
+| Peak learning rate | 2e-5 |
+| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
+| LR schedule | Linear decay, no warmup on resume |
+| Batch size | 64 per device |
+| Gradient accumulation | 2 steps (effective batch = 128) |
+| Weight decay | 0.01 |
+| Precision | FP16 |
+| Eval / Save interval | Every 500 steps |
+| Early stopping patience | 3 evaluations |
+| Train blocks | 1,849,729 |
+| Eval blocks | 18,685 |
+Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.
+## Eval Loss Curve
+| Step | Eval Loss |
+|---|---|
+| 500 | 3.830 |
+| 1,000 | 3.599 |
+| 2,000 | 3.336 |
+| 5,000 | 3.066 |
+| 8,500 | 2.945 |
+| 20,500 | 2.773 |
+| 21,000 | 2.783 |
+| **21,500** | **2.773 ← best** |
+## Usage
+```python
+from transformers import pipeline
+unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)
+results = unmasker("انا مش راضي عن الخدمة دي [MASK] بجد.")
+for r in results:
+    print(r['token_str'], round(r['score'], 4))
+```
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
+model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
+```
+For downstream classification tasks (emotion, sentiment, sarcasm):
+```python
+from transformers import AutoModel
+encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
+# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
+```
+## Known Warnings
+**LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.
+## Intended Downstream Tasks
+This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:
+- **Emotion Classification** — Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
+- **Sarcasm Detection** — Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
+- **Sentiment Analysis** — Positive / Negative / Neutral classification for customer interaction data
+## Model Lineage
+```
+UBC-NLP/MARBERTv2
+    └── T0KII/masribert  (v1 — MLM on MASRISET, 57K steps)
+            └── T0KII/MASRIBERTv2  (v2 — MLM on EFC + SFT, 21.5K steps)
+```
+## Citation
+If you use this model, please cite the original MARBERTv2 paper:
+```bibtex
+@inproceedings{abdul-mageed-etal-2021-arbert,
+    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
+    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
+    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
+    year = "2021"
+}
+```