masribert / README.md
T0KII's picture
Update README.md
371146d verified
---
language:
- ar
base_model:
- UBC-NLP/MARBERTv2
---
---
language:
- ar
license: apache-2.0
base_model: UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---
# MasriBERT โ€” Egyptian Arabic Language Model
MasriBERT is a domain-adapted BERT model for **Egyptian Arabic (Masri/Ammiya)**, produced by continued MLM pre-training of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on **MASRISET** โ€” a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.
It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis โ€” with a specific focus on conversational and call-center language.
---
## Model Details
| Attribute | Value |
|---|---|
| **Base Model** | `UBC-NLP/MARBERTv2` |
| **Architecture** | BERT (12 layers, 768 hidden, 12 heads) |
| **Task** | Masked Language Modeling (MLM) |
| **Language** | Egyptian Arabic (`ar-EG`) |
| **Training Corpus** | MASRISET โ€” 1.3M+ rows |
| **Final Eval Loss** | 4.523 (best checkpoint) |
| **Final Perplexity** | 92.98 |
| **Training Epochs** | 3 |
---
## Training Data โ€” MASRISET
MASRISET was assembled and cleaned specifically for this project. It combines the following sources:
**HuggingFace Datasets**
- `hard` โ€” Egyptian Arabic sentiment/review data
- `ar_res_reviews` โ€” Arabic restaurant reviews
- `arbml/TEAD` โ€” Arabic tweet corpus
**Kaggle โ€” [Two Million Rows Egyptian Datasets](https://www.kaggle.com/datasets/mostafanofal/two-million-rows-egyptian-datasets)**
- `AOC_youm7_comments` + `RestOf_AOC_youm7_comments` โ€” Al-Youm Al-Sabaa news comments
- `Egyptian Tweets` โ€” Egyptian Twitter corpus
- `TaghreedT` โ€” Egyptian tweet collection
- `TE_Telecom` + `TE_Tweets` โ€” Telecom Egypt customer interactions
All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.
### Text Cleaning Pipeline
The following normalization was applied uniformly across all sources:
- Removed URLs, email addresses, @mentions, and hashtag symbols
- **Alef normalization**: `ุฅุฃุขุง โ†’ ุง`
- **Alef maqsura**: `ู‰ โ†’ ูŠ`
- **Hamza variants**: `ุค, ุฆ โ†’ ุก`
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. `ู…ุดุดุดุดูŠ โ†’ ู…ุดูŠ`)
- Removed English characters
- Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
- Minimum 5 words per sample enforced post-cleaning
---
## Training Configuration
| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Whole Word Masking |
| Peak learning rate | 2e-5 |
| LR schedule | Linear decay with warmup (6%) |
| Batch size | 16 per device |
| Gradient accumulation | 2 steps (effective batch = 32) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 2,500 steps |
| Early stopping patience | 5 evaluations |
Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.
### Eval Loss Curve
| Step | Eval Loss |
|---|---|
| 30,000 | 4.645 |
| 32,500 | 4.633 |
| 35,000 | 4.614 |
| 40,000 | 4.588 |
| 42,500 | 4.567 |
| 47,500 | 4.540 |
| **57,500** | **4.523 โ† best** |
| Final (57,915) | 4.532 |
---
## Usage
```python
from transformers import pipeline
unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)
# Egyptian sarcasm example
results = unmasker("ุชุณู„ู… ุงูŠุฏูƒู… ุจุฌุฏุŒ ุงู„ุดุญู†ุฉ ูˆุตู„ุช [MASK] ุฎุงู„ุต ูƒุงู„ุนุงุฏุฉ.")
for r in results:
print(r['token_str'], round(r['score'], 4))
```
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")
```
For downstream classification tasks (emotion, sentiment, sarcasm), load with `AutoModel` and attach your classification head:
```python
from transformers import AutoModel
encoder = AutoModel.from_pretrained("T0KII/masribert")
```
---
## Intended Downstream Tasks
This model was trained as a backbone for the following tasks in the **Kalamna** Egyptian Arabic AI pipeline:
- **Emotion Classification** โ€” Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
- **Sarcasm Detection** โ€” Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** โ€” Positive / Negative / Neutral classification for customer interaction data
---
## Important Notes
**LayerNorm naming warning**: When loading this model you will see warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.
**Best checkpoint**: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.
---
## Citation
If you use this model, please cite the original MARBERTv2 paper:
```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021"
}
```
---
## License
Apache 2.0 โ€” inherited from the base model. See [MARBERTv2 license](https://huggingface.co/UBC-NLP/MARBERTv2) for details.