File size: 6,280 Bytes

a4eedc4

---
language:
- ar
license: unknown
base_model:
- T0KII/masribert
- UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---

# MasriBERT v2 — Egyptian Arabic Language Model

MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** — the primary register of customer-facing NLP applications.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.

## What Changed from v1

| | MasriBERT v1 | MasriBERT v2 |
|---|---|---|
| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
| Training corpus | MASRISET (1.3M rows — tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows — forums, dialogue) |
| Data register | Social media / news | Conversational / instructional dialogue |
| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
| Final eval loss | 4.523 | **2.773** |
| Final perplexity | 92.98 | **16.00** |
| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |

The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 → v1 → v2).

## Training Corpus

Two sources were used, targeting conversational Egyptian Arabic:

**faisalq/EFC-mini — Egyptian Forums Corpus**
Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions — closely mirroring customer behavior.

**MBZUAI-Paris/Egyptian-SFT-Mixture — Egyptian Dialogue**
Supervised fine-tuning dialogue data in Egyptian Arabic — instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.

Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.

After deduplication: **1,946,195 rows → 1,868,414 chunks of 64 tokens**

## Text Cleaning Pipeline

Same normalization as v1, applied uniformly:

- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization: إأآا → ا
- Alef maqsura: ى → ي
- Hamza variants: ؤ, ئ → ء
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. هههههه → هه)
- Removed English characters
- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
- Minimum 5 words per sample enforced post-cleaning

## Training Configuration

| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Token-level (whole word masking disabled — tokenizer incompatibility) |
| Peak learning rate | 2e-5 |
| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
| LR schedule | Linear decay, no warmup on resume |
| Batch size | 64 per device |
| Gradient accumulation | 2 steps (effective batch = 128) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 500 steps |
| Early stopping patience | 3 evaluations |
| Train blocks | 1,849,729 |
| Eval blocks | 18,685 |

Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.

## Eval Loss Curve

| Step | Eval Loss |
|---|---|
| 500 | 3.830 |
| 1,000 | 3.599 |
| 2,000 | 3.336 |
| 5,000 | 3.066 |
| 8,500 | 2.945 |
| 20,500 | 2.773 |
| 21,000 | 2.783 |
| **21,500** | **2.773 ← best** |

## Usage

```python
from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)

results = unmasker("انا مش راضي عن الخدمة دي [MASK] بجد.")
for r in results:
    print(r['token_str'], round(r['score'], 4))
```

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
```

For downstream classification tasks (emotion, sentiment, sarcasm):

```python
from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
```

## Known Warnings

**LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

## Intended Downstream Tasks

This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:

- **Emotion Classification** — Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
- **Sarcasm Detection** — Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** — Positive / Negative / Neutral classification for customer interaction data

## Model Lineage

```
UBC-NLP/MARBERTv2
    └── T0KII/masribert  (v1 — MLM on MASRISET, 57K steps)
            └── T0KII/MASRIBERTv2  (v2 — MLM on EFC + SFT, 21.5K steps)
```

## Citation

If you use this model, please cite the original MARBERTv2 paper:

```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}
```