language: - ar license: apache-2.0 base_model: UBC-NLP/MARBERTv2 tags: - arabic - egyptian-arabic - masked-language-modeling - bert - dialect - nlp pipeline_tag: fill-mask
MasriBERT โ Egyptian Arabic Language Model
MasriBERT is a domain-adapted BERT model for Egyptian Arabic (Masri/Ammiya), produced by continued MLM pre-training of UBC-NLP/MARBERTv2 on MASRISET โ a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.
It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis โ with a specific focus on conversational and call-center language.
Model Details
| Attribute | Value |
|---|---|
| Base Model | UBC-NLP/MARBERTv2 |
| Architecture | BERT (12 layers, 768 hidden, 12 heads) |
| Task | Masked Language Modeling (MLM) |
| Language | Egyptian Arabic (ar-EG) |
| Training Corpus | MASRISET โ 1.3M+ rows |
| Final Eval Loss | 4.523 (best checkpoint) |
| Final Perplexity | 92.98 |
| Training Epochs | 3 |
Training Data โ MASRISET
MASRISET was assembled and cleaned specifically for this project. It combines the following sources:
HuggingFace Datasets
hardโ Egyptian Arabic sentiment/review dataar_res_reviewsโ Arabic restaurant reviewsarbml/TEADโ Arabic tweet corpus
Kaggle โ Two Million Rows Egyptian Datasets
AOC_youm7_comments+RestOf_AOC_youm7_commentsโ Al-Youm Al-Sabaa news commentsEgyptian Tweetsโ Egyptian Twitter corpusTaghreedTโ Egyptian tweet collectionTE_Telecom+TE_Tweetsโ Telecom Egypt customer interactions
All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.
Text Cleaning Pipeline
The following normalization was applied uniformly across all sources:
- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization:
ุฅุฃุขุง โ ุง - Alef maqsura:
ู โ ู - Hamza variants:
ุค, ุฆ โ ุก - Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g.
ู ุดุดุดุดู โ ู ุดู) - Removed English characters
- Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
- Minimum 5 words per sample enforced post-cleaning
Training Configuration
| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Whole Word Masking |
| Peak learning rate | 2e-5 |
| LR schedule | Linear decay with warmup (6%) |
| Batch size | 16 per device |
| Gradient accumulation | 2 steps (effective batch = 32) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 2,500 steps |
| Early stopping patience | 5 evaluations |
Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.
Eval Loss Curve
| Step | Eval Loss |
|---|---|
| 30,000 | 4.645 |
| 32,500 | 4.633 |
| 35,000 | 4.614 |
| 40,000 | 4.588 |
| 42,500 | 4.567 |
| 47,500 | 4.540 |
| 57,500 | 4.523 โ best |
| Final (57,915) | 4.532 |
Usage
from transformers import pipeline
unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)
# Egyptian sarcasm example
results = unmasker("ุชุณูู
ุงูุฏูู
ุจุฌุฏุ ุงูุดุญูุฉ ูุตูุช [MASK] ุฎุงูุต ูุงูุนุงุฏุฉ.")
for r in results:
print(r['token_str'], round(r['score'], 4))
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")
For downstream classification tasks (emotion, sentiment, sarcasm), load with AutoModel and attach your classification head:
from transformers import AutoModel
encoder = AutoModel.from_pretrained("T0KII/masribert")
Intended Downstream Tasks
This model was trained as a backbone for the following tasks in the Kalamna Egyptian Arabic AI pipeline:
- Emotion Classification โ Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
- Sarcasm Detection โ Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- Sentiment Analysis โ Positive / Negative / Neutral classification for customer interaction data
Important Notes
LayerNorm naming warning: When loading this model you will see warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ the warning is cosmetic and can be safely ignored.
Best checkpoint: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.
Citation
If you use this model, please cite the original MARBERTv2 paper:
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021"
}
License
Apache 2.0 โ inherited from the base model. See MARBERTv2 license for details.
- Downloads last month
- 90