You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

language: - ar license: apache-2.0 base_model: UBC-NLP/MARBERTv2 tags: - arabic - egyptian-arabic - masked-language-modeling - bert - dialect - nlp pipeline_tag: fill-mask

MasriBERT — Egyptian Arabic Language Model

MasriBERT is a domain-adapted BERT model for Egyptian Arabic (Masri/Ammiya), produced by continued MLM pre-training of UBC-NLP/MARBERTv2 on MASRISET — a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis — with a specific focus on conversational and call-center language.

Model Details

Attribute	Value
Base Model	`UBC-NLP/MARBERTv2`
Architecture	BERT (12 layers, 768 hidden, 12 heads)
Task	Masked Language Modeling (MLM)
Language	Egyptian Arabic (`ar-EG`)
Training Corpus	MASRISET — 1.3M+ rows
Final Eval Loss	4.523 (best checkpoint)
Final Perplexity	92.98
Training Epochs	3

Training Data — MASRISET

MASRISET was assembled and cleaned specifically for this project. It combines the following sources:

HuggingFace Datasets

hard — Egyptian Arabic sentiment/review data
ar_res_reviews — Arabic restaurant reviews
arbml/TEAD — Arabic tweet corpus

Kaggle — Two Million Rows Egyptian Datasets

AOC_youm7_comments + RestOf_AOC_youm7_comments — Al-Youm Al-Sabaa news comments
Egyptian Tweets — Egyptian Twitter corpus
TaghreedT — Egyptian tweet collection
TE_Telecom + TE_Tweets — Telecom Egypt customer interactions

All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.

Text Cleaning Pipeline

The following normalization was applied uniformly across all sources:

Removed URLs, email addresses, @mentions, and hashtag symbols
Alef normalization: إأآا → ا
Alef maqsura: ى → ي
Hamza variants: ؤ, ئ → ء
Removed all Arabic tashkeel (diacritics)
Capped repeated characters at 2 (e.g. مششششي → مشي)
Removed English characters
Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
Minimum 5 words per sample enforced post-cleaning

Training Configuration

Hyperparameter	Value
Block size	64 tokens
MLM probability	0.20 (20%)
Masking strategy	Whole Word Masking
Peak learning rate	2e-5
LR schedule	Linear decay with warmup (6%)
Batch size	16 per device
Gradient accumulation	2 steps (effective batch = 32)
Weight decay	0.01
Precision	FP16
Eval / Save interval	Every 2,500 steps
Early stopping patience	5 evaluations

Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.

Eval Loss Curve

Step	Eval Loss
30,000	4.645
32,500	4.633
35,000	4.614
40,000	4.588
42,500	4.567
47,500	4.540
57,500	4.523 ← best
Final (57,915)	4.532

Usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)

# Egyptian sarcasm example
results = unmasker("تسلم ايدكم بجد، الشحنة وصلت [MASK] خالص كالعادة.")
for r in results:
    print(r['token_str'], round(r['score'], 4))

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")

For downstream classification tasks (emotion, sentiment, sarcasm), load with AutoModel and attach your classification head:

from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/masribert")

Intended Downstream Tasks

This model was trained as a backbone for the following tasks in the Kalamna Egyptian Arabic AI pipeline:

Emotion Classification — Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
Sarcasm Detection — Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
Sentiment Analysis — Positive / Negative / Neutral classification for customer interaction data

Important Notes

LayerNorm naming warning: When loading this model you will see warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

Best checkpoint: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.

Citation

If you use this model, please cite the original MARBERTv2 paper:

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}

License

Apache 2.0 — inherited from the base model. See MARBERTv2 license for details.

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for T0KII/masribert

Base model

UBC-NLP/MARBERTv2

Finetuned

(33)

this model

Finetunes

1 model