MasriBERT v2 โ€” Egyptian Arabic Language Model

MasriBERT v2 is a continued MLM pre-training of MasriBERT v1 (itself built on UBC-NLP/MARBERTv2) on a new, higher-quality Egyptian Arabic corpus emphasizing conversational and dialogue register โ€” the primary register of customer-facing NLP applications.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.

What Changed from v1

MasriBERT v1 MasriBERT v2
Base model UBC-NLP/MARBERTv2 T0KII/masribert (v1)
Training corpus MASRISET (1.3M rows โ€” tweets, reviews, news comments) EFC + SFT Mixture (1.95M rows โ€” forums, dialogue)
Data register Social media / news Conversational / instructional dialogue
Training steps ~57,915 ~21,500 (resumed from step 20,000)
Final eval loss 4.523 2.773
Final perplexity 92.98 16.00
Training platform Google Colab (A100) Kaggle (T4 / P100)

The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ†’ v1 โ†’ v2).

Training Corpus

Two sources were used, targeting conversational Egyptian Arabic:

faisalq/EFC-mini โ€” Egyptian Forums Corpus Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ€” closely mirroring customer behavior.

MBZUAI-Paris/Egyptian-SFT-Mixture โ€” Egyptian Dialogue Supervised fine-tuning dialogue data in Egyptian Arabic โ€” instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.

Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.

After deduplication: 1,946,195 rows โ†’ 1,868,414 chunks of 64 tokens

Text Cleaning Pipeline

Same normalization as v1, applied uniformly:

  • Removed URLs, email addresses, @mentions, and hashtag symbols
  • Alef normalization: ุฅุฃุขุง โ†’ ุง
  • Alef maqsura: ู‰ โ†’ ูŠ
  • Hamza variants: ุค, ุฆ โ†’ ุก
  • Removed all Arabic tashkeel (diacritics)
  • Capped repeated characters at 2 (e.g. ู‡ู‡ู‡ู‡ู‡ู‡ โ†’ ู‡ู‡)
  • Removed English characters
  • Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
  • Minimum 5 words per sample enforced post-cleaning

Training Configuration

Hyperparameter Value
Block size 64 tokens
MLM probability 0.20 (20%)
Masking strategy Token-level (whole word masking disabled โ€” tokenizer incompatibility)
Peak learning rate 2e-5
Resume learning rate 6.16e-6 (corrected for linear decay at step 20,000)
LR schedule Linear decay, no warmup on resume
Batch size 64 per device
Gradient accumulation 2 steps (effective batch = 128)
Weight decay 0.01
Precision FP16
Eval / Save interval Every 500 steps
Early stopping patience 3 evaluations
Train blocks 1,849,729
Eval blocks 18,685

Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.

Eval Loss Curve

Step Eval Loss
500 3.830
1,000 3.599
2,000 3.336
5,000 3.066
8,500 2.945
20,500 2.773
21,000 2.783
21,500 2.773 โ† best

Usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)

results = unmasker("ุงู†ุง ู…ุด ุฑุงุถูŠ ุนู† ุงู„ุฎุฏู…ุฉ ุฏูŠ [MASK] ุจุฌุฏ.")
for r in results:
    print(r['token_str'], round(r['score'], 4))
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")

For downstream classification tasks (emotion, sentiment, sarcasm):

from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state

Known Warnings

LayerNorm naming: Loading this model produces warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.

Intended Downstream Tasks

This model is the backbone for the following tasks in the Kalamna Egyptian Arabic AI call-center pipeline:

  • Emotion Classification โ€” Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
  • Sarcasm Detection โ€” Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
  • Sentiment Analysis โ€” Positive / Negative / Neutral classification for customer interaction data

Model Lineage

UBC-NLP/MARBERTv2
    โ””โ”€โ”€ T0KII/masribert  (v1 โ€” MLM on MASRISET, 57K steps)
            โ””โ”€โ”€ T0KII/MASRIBERTv2  (v2 โ€” MLM on EFC + SFT, 21.5K steps)

Citation

If you use this model, please cite the original MARBERTv2 paper:

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}
Downloads last month
60
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for T0KII/MASRIBERTv2

Finetuned
T0KII/masribert
Finetuned
(1)
this model