MasriBERT v2 โ Egyptian Arabic Language Model
MasriBERT v2 is a continued MLM pre-training of MasriBERT v1 (itself built on UBC-NLP/MARBERTv2) on a new, higher-quality Egyptian Arabic corpus emphasizing conversational and dialogue register โ the primary register of customer-facing NLP applications.
It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.
What Changed from v1
| MasriBERT v1 | MasriBERT v2 | |
|---|---|---|
| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
| Training corpus | MASRISET (1.3M rows โ tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ forums, dialogue) |
| Data register | Social media / news | Conversational / instructional dialogue |
| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
| Final eval loss | 4.523 | 2.773 |
| Final perplexity | 92.98 | 16.00 |
| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |
The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ v1 โ v2).
Training Corpus
Two sources were used, targeting conversational Egyptian Arabic:
faisalq/EFC-mini โ Egyptian Forums Corpus Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ closely mirroring customer behavior.
MBZUAI-Paris/Egyptian-SFT-Mixture โ Egyptian Dialogue Supervised fine-tuning dialogue data in Egyptian Arabic โ instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.
Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.
After deduplication: 1,946,195 rows โ 1,868,414 chunks of 64 tokens
Text Cleaning Pipeline
Same normalization as v1, applied uniformly:
- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization: ุฅุฃุขุง โ ุง
- Alef maqsura: ู โ ู
- Hamza variants: ุค, ุฆ โ ุก
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. ูููููู โ ูู)
- Removed English characters
- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
- Minimum 5 words per sample enforced post-cleaning
Training Configuration
| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Token-level (whole word masking disabled โ tokenizer incompatibility) |
| Peak learning rate | 2e-5 |
| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
| LR schedule | Linear decay, no warmup on resume |
| Batch size | 64 per device |
| Gradient accumulation | 2 steps (effective batch = 128) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 500 steps |
| Early stopping patience | 3 evaluations |
| Train blocks | 1,849,729 |
| Eval blocks | 18,685 |
Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.
Eval Loss Curve
| Step | Eval Loss |
|---|---|
| 500 | 3.830 |
| 1,000 | 3.599 |
| 2,000 | 3.336 |
| 5,000 | 3.066 |
| 8,500 | 2.945 |
| 20,500 | 2.773 |
| 21,000 | 2.783 |
| 21,500 | 2.773 โ best |
Usage
from transformers import pipeline
unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)
results = unmasker("ุงูุง ู
ุด ุฑุงุถู ุนู ุงูุฎุฏู
ุฉ ุฏู [MASK] ุจุฌุฏ.")
for r in results:
print(r['token_str'], round(r['score'], 4))
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
For downstream classification tasks (emotion, sentiment, sarcasm):
from transformers import AutoModel
encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
Known Warnings
LayerNorm naming: Loading this model produces warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ the warning is cosmetic and can be safely ignored.
Intended Downstream Tasks
This model is the backbone for the following tasks in the Kalamna Egyptian Arabic AI call-center pipeline:
- Emotion Classification โ Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
- Sarcasm Detection โ Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- Sentiment Analysis โ Positive / Negative / Neutral classification for customer interaction data
Model Lineage
UBC-NLP/MARBERTv2
โโโ T0KII/masribert (v1 โ MLM on MASRISET, 57K steps)
โโโ T0KII/MASRIBERTv2 (v2 โ MLM on EFC + SFT, 21.5K steps)
Citation
If you use this model, please cite the original MARBERTv2 paper:
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021"
}
- Downloads last month
- 60