MasriBERT v2 — Egyptian Arabic Language Model

MasriBERT v2 is a continued MLM pre-training of MasriBERT v1 (itself built on UBC-NLP/MARBERTv2) on a new, higher-quality Egyptian Arabic corpus emphasizing conversational and dialogue register — the primary register of customer-facing NLP applications.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.

What Changed from v1

	MasriBERT v1	MasriBERT v2
Base model	UBC-NLP/MARBERTv2	T0KII/masribert (v1)
Training corpus	MASRISET (1.3M rows — tweets, reviews, news comments)	EFC + SFT Mixture (1.95M rows — forums, dialogue)
Data register	Social media / news	Conversational / instructional dialogue
Training steps	~57,915	~21,500 (resumed from step 20,000)
Final eval loss	4.523	2.773
Final perplexity	92.98	16.00
Training platform	Google Colab (A100)	Kaggle (T4 / P100)

The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 → v1 → v2).

Training Corpus

Two sources were used, targeting conversational Egyptian Arabic:

faisalq/EFC-mini — Egyptian Forums Corpus Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions — closely mirroring customer behavior.

MBZUAI-Paris/Egyptian-SFT-Mixture — Egyptian Dialogue Supervised fine-tuning dialogue data in Egyptian Arabic — instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.

Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.

After deduplication: 1,946,195 rows → 1,868,414 chunks of 64 tokens

Text Cleaning Pipeline

Same normalization as v1, applied uniformly:

Removed URLs, email addresses, @mentions, and hashtag symbols
Alef normalization: إأآا → ا
Alef maqsura: ى → ي
Hamza variants: ؤ, ئ → ء
Removed all Arabic tashkeel (diacritics)
Capped repeated characters at 2 (e.g. هههههه → هه)
Removed English characters
Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
Minimum 5 words per sample enforced post-cleaning

Training Configuration

Hyperparameter	Value
Block size	64 tokens
MLM probability	0.20 (20%)
Masking strategy	Token-level (whole word masking disabled — tokenizer incompatibility)
Peak learning rate	2e-5
Resume learning rate	6.16e-6 (corrected for linear decay at step 20,000)
LR schedule	Linear decay, no warmup on resume
Batch size	64 per device
Gradient accumulation	2 steps (effective batch = 128)
Weight decay	0.01
Precision	FP16
Eval / Save interval	Every 500 steps
Early stopping patience	3 evaluations
Train blocks	1,849,729
Eval blocks	18,685

Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.

Eval Loss Curve

Step	Eval Loss
500	3.830
1,000	3.599
2,000	3.336
5,000	3.066
8,500	2.945
20,500	2.773
21,000	2.783
21,500	2.773 ← best

Usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)

results = unmasker("انا مش راضي عن الخدمة دي [MASK] بجد.")
for r in results:
    print(r['token_str'], round(r['score'], 4))

from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")

For downstream classification tasks (emotion, sentiment, sarcasm):

from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state

Known Warnings

LayerNorm naming: Loading this model produces warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

Intended Downstream Tasks

This model is the backbone for the following tasks in the Kalamna Egyptian Arabic AI call-center pipeline:

Emotion Classification — Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
Sarcasm Detection — Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
Sentiment Analysis — Positive / Negative / Neutral classification for customer interaction data

Model Lineage

UBC-NLP/MARBERTv2
    └── T0KII/masribert  (v1 — MLM on MASRISET, 57K steps)
            └── T0KII/MASRIBERTv2  (v2 — MLM on EFC + SFT, 21.5K steps)

Citation

If you use this model, please cite the original MARBERTv2 paper:

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}

Downloads last month: 2

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for T0KII/MASRIBERTv2

Base model

UBC-NLP/MARBERTv2

Finetuned

T0KII/masribert

Finetuned

(1)

this model