You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.


language: - ar license: apache-2.0 base_model: UBC-NLP/MARBERTv2 tags: - arabic - egyptian-arabic - masked-language-modeling - bert - dialect - nlp pipeline_tag: fill-mask

MasriBERT โ€” Egyptian Arabic Language Model

MasriBERT is a domain-adapted BERT model for Egyptian Arabic (Masri/Ammiya), produced by continued MLM pre-training of UBC-NLP/MARBERTv2 on MASRISET โ€” a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis โ€” with a specific focus on conversational and call-center language.


Model Details

Attribute Value
Base Model UBC-NLP/MARBERTv2
Architecture BERT (12 layers, 768 hidden, 12 heads)
Task Masked Language Modeling (MLM)
Language Egyptian Arabic (ar-EG)
Training Corpus MASRISET โ€” 1.3M+ rows
Final Eval Loss 4.523 (best checkpoint)
Final Perplexity 92.98
Training Epochs 3

Training Data โ€” MASRISET

MASRISET was assembled and cleaned specifically for this project. It combines the following sources:

HuggingFace Datasets

  • hard โ€” Egyptian Arabic sentiment/review data
  • ar_res_reviews โ€” Arabic restaurant reviews
  • arbml/TEAD โ€” Arabic tweet corpus

Kaggle โ€” Two Million Rows Egyptian Datasets

  • AOC_youm7_comments + RestOf_AOC_youm7_comments โ€” Al-Youm Al-Sabaa news comments
  • Egyptian Tweets โ€” Egyptian Twitter corpus
  • TaghreedT โ€” Egyptian tweet collection
  • TE_Telecom + TE_Tweets โ€” Telecom Egypt customer interactions

All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.

Text Cleaning Pipeline

The following normalization was applied uniformly across all sources:

  • Removed URLs, email addresses, @mentions, and hashtag symbols
  • Alef normalization: ุฅุฃุขุง โ†’ ุง
  • Alef maqsura: ู‰ โ†’ ูŠ
  • Hamza variants: ุค, ุฆ โ†’ ุก
  • Removed all Arabic tashkeel (diacritics)
  • Capped repeated characters at 2 (e.g. ู…ุดุดุดุดูŠ โ†’ ู…ุดูŠ)
  • Removed English characters
  • Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
  • Minimum 5 words per sample enforced post-cleaning

Training Configuration

Hyperparameter Value
Block size 64 tokens
MLM probability 0.20 (20%)
Masking strategy Whole Word Masking
Peak learning rate 2e-5
LR schedule Linear decay with warmup (6%)
Batch size 16 per device
Gradient accumulation 2 steps (effective batch = 32)
Weight decay 0.01
Precision FP16
Eval / Save interval Every 2,500 steps
Early stopping patience 5 evaluations

Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.

Eval Loss Curve

Step Eval Loss
30,000 4.645
32,500 4.633
35,000 4.614
40,000 4.588
42,500 4.567
47,500 4.540
57,500 4.523 โ† best
Final (57,915) 4.532

Usage

from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)

# Egyptian sarcasm example
results = unmasker("ุชุณู„ู… ุงูŠุฏูƒู… ุจุฌุฏุŒ ุงู„ุดุญู†ุฉ ูˆุตู„ุช [MASK] ุฎุงู„ุต ูƒุงู„ุนุงุฏุฉ.")
for r in results:
    print(r['token_str'], round(r['score'], 4))
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")

For downstream classification tasks (emotion, sentiment, sarcasm), load with AutoModel and attach your classification head:

from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/masribert")

Intended Downstream Tasks

This model was trained as a backbone for the following tasks in the Kalamna Egyptian Arabic AI pipeline:

  • Emotion Classification โ€” Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
  • Sarcasm Detection โ€” Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
  • Sentiment Analysis โ€” Positive / Negative / Neutral classification for customer interaction data

Important Notes

LayerNorm naming warning: When loading this model you will see warnings about missing/unexpected keys (LayerNorm.weight / LayerNorm.bias vs LayerNorm.gamma / LayerNorm.beta). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.

Best checkpoint: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.


Citation

If you use this model, please cite the original MARBERTv2 paper:

@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}

License

Apache 2.0 โ€” inherited from the base model. See MARBERTv2 license for details.

Downloads last month
90
Safetensors
Model size
0.2B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for T0KII/masribert

Finetuned
(34)
this model
Finetunes
1 model