---
language:
- ar
base_model:
- UBC-NLP/MARBERTv2
---

---
language:
- ar
license: apache-2.0
base_model: UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---

# MasriBERT — Egyptian Arabic Language Model

MasriBERT is a domain-adapted BERT model for **Egyptian Arabic (Masri/Ammiya)**, produced by continued MLM pre-training of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on **MASRISET** — a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis — with a specific focus on conversational and call-center language.

---

## Model Details

| Attribute | Value |
|---|---|
| **Base Model** | `UBC-NLP/MARBERTv2` |
| **Architecture** | BERT (12 layers, 768 hidden, 12 heads) |
| **Task** | Masked Language Modeling (MLM) |
| **Language** | Egyptian Arabic (`ar-EG`) |
| **Training Corpus** | MASRISET — 1.3M+ rows |
| **Final Eval Loss** | 4.523 (best checkpoint) |
| **Final Perplexity** | 92.98 |
| **Training Epochs** | 3 |

---

## Training Data — MASRISET

MASRISET was assembled and cleaned specifically for this project. It combines the following sources:

**HuggingFace Datasets**
- `hard` — Egyptian Arabic sentiment/review data
- `ar_res_reviews` — Arabic restaurant reviews
- `arbml/TEAD` — Arabic tweet corpus

**Kaggle — [Two Million Rows Egyptian Datasets](https://www.kaggle.com/datasets/mostafanofal/two-million-rows-egyptian-datasets)**
- `AOC_youm7_comments` + `RestOf_AOC_youm7_comments` — Al-Youm Al-Sabaa news comments
- `Egyptian Tweets` — Egyptian Twitter corpus
- `TaghreedT` — Egyptian tweet collection
- `TE_Telecom` + `TE_Tweets` — Telecom Egypt customer interactions

All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.

### Text Cleaning Pipeline

The following normalization was applied uniformly across all sources:

- Removed URLs, email addresses, @mentions, and hashtag symbols
- **Alef normalization**: `إأآا → ا`
- **Alef maqsura**: `ى → ي`
- **Hamza variants**: `ؤ, ئ → ء`
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. `مششششي → مشي`)
- Removed English characters
- Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
- Minimum 5 words per sample enforced post-cleaning

---

## Training Configuration

| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Whole Word Masking |
| Peak learning rate | 2e-5 |
| LR schedule | Linear decay with warmup (6%) |
| Batch size | 16 per device |
| Gradient accumulation | 2 steps (effective batch = 32) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 2,500 steps |
| Early stopping patience | 5 evaluations |

Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.

### Eval Loss Curve

| Step | Eval Loss |
|---|---|
| 30,000 | 4.645 |
| 32,500 | 4.633 |
| 35,000 | 4.614 |
| 40,000 | 4.588 |
| 42,500 | 4.567 |
| 47,500 | 4.540 |
| **57,500** | **4.523 ← best** |
| Final (57,915) | 4.532 |

---

## Usage

```python
from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)

# Egyptian sarcasm example
results = unmasker("تسلم ايدكم بجد، الشحنة وصلت [MASK] خالص كالعادة.")
for r in results:
    print(r['token_str'], round(r['score'], 4))
```

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")
```

For downstream classification tasks (emotion, sentiment, sarcasm), load with `AutoModel` and attach your classification head:

```python
from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/masribert")
```

---

## Intended Downstream Tasks

This model was trained as a backbone for the following tasks in the **Kalamna** Egyptian Arabic AI pipeline:

- **Emotion Classification** — Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
- **Sarcasm Detection** — Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** — Positive / Negative / Neutral classification for customer interaction data

---

## Important Notes

**LayerNorm naming warning**: When loading this model you will see warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

**Best checkpoint**: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.

---

## Citation

If you use this model, please cite the original MARBERTv2 paper:

```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}
```

---

## License

Apache 2.0 — inherited from the base model. See [MARBERTv2 license](https://huggingface.co/UBC-NLP/MARBERTv2) for details.