Update README.md

371146d verified 18 days ago

6.07 kB

	---
	language:
	- ar
	base_model:
	- UBC-NLP/MARBERTv2
	---

	---
	language:
	- ar
	license: apache-2.0
	base_model: UBC-NLP/MARBERTv2
	tags:
	- arabic
	- egyptian-arabic
	- masked-language-modeling
	- bert
	- dialect
	- nlp
	pipeline_tag: fill-mask
	---

	# MasriBERT — Egyptian Arabic Language Model

	MasriBERT is a domain-adapted BERT model for Egyptian Arabic (Masri/Ammiya), produced by continued MLM pre-training of [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2) on MASRISET — a curated corpus of 1.3M+ Egyptian Arabic text samples drawn from social media, customer reviews, and news commentary.

	It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis — with a specific focus on conversational and call-center language.

	---

	## Model Details

	\| Attribute \| Value \|
	\|---\|---\|
	\| Base Model \| `UBC-NLP/MARBERTv2` \|
	\| Architecture \| BERT (12 layers, 768 hidden, 12 heads) \|
	\| Task \| Masked Language Modeling (MLM) \|
	\| Language \| Egyptian Arabic (`ar-EG`) \|
	\| Training Corpus \| MASRISET — 1.3M+ rows \|
	\| Final Eval Loss \| 4.523 (best checkpoint) \|
	\| Final Perplexity \| 92.98 \|
	\| Training Epochs \| 3 \|

	---

	## Training Data — MASRISET

	MASRISET was assembled and cleaned specifically for this project. It combines the following sources:

	HuggingFace Datasets
	- `hard` — Egyptian Arabic sentiment/review data
	- `ar_res_reviews` — Arabic restaurant reviews
	- `arbml/TEAD` — Arabic tweet corpus

	Kaggle — [Two Million Rows Egyptian Datasets](https://www.kaggle.com/datasets/mostafanofal/two-million-rows-egyptian-datasets)
	- `AOC_youm7_comments` + `RestOf_AOC_youm7_comments` — Al-Youm Al-Sabaa news comments
	- `Egyptian Tweets` — Egyptian Twitter corpus
	- `TaghreedT` — Egyptian tweet collection
	- `TE_Telecom` + `TE_Tweets` — Telecom Egypt customer interactions

	All sources were deduplicated, cleaned, and filtered to a minimum of 5 tokens per sample.

	### Text Cleaning Pipeline

	The following normalization was applied uniformly across all sources:

	- Removed URLs, email addresses, @mentions, and hashtag symbols
	- Alef normalization: `إأآا → ا`
	- Alef maqsura: `ى → ي`
	- Hamza variants: `ؤ, ئ → ء`
	- Removed all Arabic tashkeel (diacritics)
	- Capped repeated characters at 2 (e.g. `مششششي → مشي`)
	- Removed English characters
	- Preserved emojis (MARBERTv2 was pretrained on tweets and has native emoji embeddings)
	- Minimum 5 words per sample enforced post-cleaning

	---

	## Training Configuration

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Block size \| 64 tokens \|
	\| MLM probability \| 0.20 (20%) \|
	\| Masking strategy \| Whole Word Masking \|
	\| Peak learning rate \| 2e-5 \|
	\| LR schedule \| Linear decay with warmup (6%) \|
	\| Batch size \| 16 per device \|
	\| Gradient accumulation \| 2 steps (effective batch = 32) \|
	\| Weight decay \| 0.01 \|
	\| Precision \| FP16 \|
	\| Eval / Save interval \| Every 2,500 steps \|
	\| Early stopping patience \| 5 evaluations \|

	Training was conducted on Google Colab (NVIDIA A100) across 3 full epochs over 57,915 steps, with resumable checkpointing to Google Drive.

	### Eval Loss Curve

	\| Step \| Eval Loss \|
	\|---\|---\|
	\| 30,000 \| 4.645 \|
	\| 32,500 \| 4.633 \|
	\| 35,000 \| 4.614 \|
	\| 40,000 \| 4.588 \|
	\| 42,500 \| 4.567 \|
	\| 47,500 \| 4.540 \|
	\| 57,500 \| 4.523 ← best \|
	\| Final (57,915) \| 4.532 \|

	---

	## Usage

	```python
	from transformers import pipeline

	unmasker = pipeline("fill-mask", model="T0KII/masribert", top_k=3)

	# Egyptian sarcasm example
	results = unmasker("تسلم ايدكم بجد، الشحنة وصلت [MASK] خالص كالعادة.")
	for r in results:
	print(r['token_str'], round(r['score'], 4))
	```

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("T0KII/masribert")
	model = AutoModelForMaskedLM.from_pretrained("T0KII/masribert")
	```

	For downstream classification tasks (emotion, sentiment, sarcasm), load with `AutoModel` and attach your classification head:

	```python
	from transformers import AutoModel

	encoder = AutoModel.from_pretrained("T0KII/masribert")
	```

	---

	## Intended Downstream Tasks

	This model was trained as a backbone for the following tasks in the Kalamna Egyptian Arabic AI pipeline:

	- Emotion Classification — Multi-class emotion detection (anger, joy, frustration, sarcasm, etc.) using a stacked ensemble of Bi-LSTM + Bi-GRU + Random Forest on top of MasriBERT embeddings
	- Sarcasm Detection — Egyptian Arabic sarcasm, including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
	- Sentiment Analysis — Positive / Negative / Neutral classification for customer interaction data

	---

	## Important Notes

	LayerNorm naming warning: When loading this model you will see warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known HuggingFace naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

	Best checkpoint: The best eval loss (4.523) was recorded at step 57,500. The saved model corresponds to the epoch-3 final weights (eval loss 4.532). For maximum performance on downstream tasks, use the model as-is or fine-tune from it directly.

	---

	## Citation

	If you use this model, please cite the original MARBERTv2 paper:

	```bibtex
	@inproceedings{abdul-mageed-etal-2021-arbert,
	title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
	author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
	booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
	year = "2021"
	}
	```

	---

	## License

	Apache 2.0 — inherited from the base model. See [MARBERTv2 license](https://huggingface.co/UBC-NLP/MARBERTv2) for details.