Create README.md

a4eedc4 verified 15 days ago

6.28 kB

	---
	language:
	- ar
	license: unknown
	base_model:
	- T0KII/masribert
	- UBC-NLP/MARBERTv2
	tags:
	- arabic
	- egyptian-arabic
	- masked-language-modeling
	- bert
	- dialect
	- nlp
	pipeline_tag: fill-mask
	---

	# MasriBERT v2 — Egyptian Arabic Language Model

	MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing conversational and dialogue register — the primary register of customer-facing NLP applications.

	It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.

	## What Changed from v1

	\| \| MasriBERT v1 \| MasriBERT v2 \|
	\|---\|---\|---\|
	\| Base model \| UBC-NLP/MARBERTv2 \| T0KII/masribert (v1) \|
	\| Training corpus \| MASRISET (1.3M rows — tweets, reviews, news comments) \| EFC + SFT Mixture (1.95M rows — forums, dialogue) \|
	\| Data register \| Social media / news \| Conversational / instructional dialogue \|
	\| Training steps \| ~57,915 \| ~21,500 (resumed from step 20,000) \|
	\| Final eval loss \| 4.523 \| 2.773 \|
	\| Final perplexity \| 92.98 \| 16.00 \|
	\| Training platform \| Google Colab (A100) \| Kaggle (T4 / P100) \|

	The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 → v1 → v2).

	## Training Corpus

	Two sources were used, targeting conversational Egyptian Arabic:

	faisalq/EFC-mini — Egyptian Forums Corpus
	Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions — closely mirroring customer behavior.

	MBZUAI-Paris/Egyptian-SFT-Mixture — Egyptian Dialogue
	Supervised fine-tuning dialogue data in Egyptian Arabic — instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.

	Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.

	After deduplication: 1,946,195 rows → 1,868,414 chunks of 64 tokens

	## Text Cleaning Pipeline

	Same normalization as v1, applied uniformly:

	- Removed URLs, email addresses, @mentions, and hashtag symbols
	- Alef normalization: إأآا → ا
	- Alef maqsura: ى → ي
	- Hamza variants: ؤ, ئ → ء
	- Removed all Arabic tashkeel (diacritics)
	- Capped repeated characters at 2 (e.g. هههههه → هه)
	- Removed English characters
	- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
	- Minimum 5 words per sample enforced post-cleaning

	## Training Configuration

	\| Hyperparameter \| Value \|
	\|---\|---\|
	\| Block size \| 64 tokens \|
	\| MLM probability \| 0.20 (20%) \|
	\| Masking strategy \| Token-level (whole word masking disabled — tokenizer incompatibility) \|
	\| Peak learning rate \| 2e-5 \|
	\| Resume learning rate \| 6.16e-6 (corrected for linear decay at step 20,000) \|
	\| LR schedule \| Linear decay, no warmup on resume \|
	\| Batch size \| 64 per device \|
	\| Gradient accumulation \| 2 steps (effective batch = 128) \|
	\| Weight decay \| 0.01 \|
	\| Precision \| FP16 \|
	\| Eval / Save interval \| Every 500 steps \|
	\| Early stopping patience \| 3 evaluations \|
	\| Train blocks \| 1,849,729 \|
	\| Eval blocks \| 18,685 \|

	Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.

	## Eval Loss Curve

	\| Step \| Eval Loss \|
	\|---\|---\|
	\| 500 \| 3.830 \|
	\| 1,000 \| 3.599 \|
	\| 2,000 \| 3.336 \|
	\| 5,000 \| 3.066 \|
	\| 8,500 \| 2.945 \|
	\| 20,500 \| 2.773 \|
	\| 21,000 \| 2.783 \|
	\| 21,500 \| 2.773 ← best \|

	## Usage

	```python
	from transformers import pipeline

	unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)

	results = unmasker("انا مش راضي عن الخدمة دي [MASK] بجد.")
	for r in results:
	print(r['token_str'], round(r['score'], 4))
	```

	```python
	from transformers import AutoTokenizer, AutoModelForMaskedLM

	tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
	model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
	```

	For downstream classification tasks (emotion, sentiment, sarcasm):

	```python
	from transformers import AutoModel

	encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
	# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
	```

	## Known Warnings

	LayerNorm naming: Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded — the warning is cosmetic and can be safely ignored.

	## Intended Downstream Tasks

	This model is the backbone for the following tasks in the Kalamna Egyptian Arabic AI call-center pipeline:

	- Emotion Classification — Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
	- Sarcasm Detection — Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
	- Sentiment Analysis — Positive / Negative / Neutral classification for customer interaction data

	## Model Lineage

	```
	UBC-NLP/MARBERTv2
	└── T0KII/masribert (v1 — MLM on MASRISET, 57K steps)
	└── T0KII/MASRIBERTv2 (v2 — MLM on EFC + SFT, 21.5K steps)
	```

	## Citation

	If you use this model, please cite the original MARBERTv2 paper:

	```bibtex
	@inproceedings{abdul-mageed-etal-2021-arbert,
	title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
	author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
	booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
	year = "2021"
	}
	```