MASRIBERTv2 / README.md
T0KII's picture
Create README.md
a4eedc4 verified
---
language:
- ar
license: unknown
base_model:
- T0KII/masribert
- UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---
# MasriBERT v2 โ€” Egyptian Arabic Language Model
MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** โ€” the primary register of customer-facing NLP applications.
It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.
## What Changed from v1
| | MasriBERT v1 | MasriBERT v2 |
|---|---|---|
| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
| Training corpus | MASRISET (1.3M rows โ€” tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ€” forums, dialogue) |
| Data register | Social media / news | Conversational / instructional dialogue |
| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
| Final eval loss | 4.523 | **2.773** |
| Final perplexity | 92.98 | **16.00** |
| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |
The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ†’ v1 โ†’ v2).
## Training Corpus
Two sources were used, targeting conversational Egyptian Arabic:
**faisalq/EFC-mini โ€” Egyptian Forums Corpus**
Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ€” closely mirroring customer behavior.
**MBZUAI-Paris/Egyptian-SFT-Mixture โ€” Egyptian Dialogue**
Supervised fine-tuning dialogue data in Egyptian Arabic โ€” instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.
Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.
After deduplication: **1,946,195 rows โ†’ 1,868,414 chunks of 64 tokens**
## Text Cleaning Pipeline
Same normalization as v1, applied uniformly:
- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization: ุฅุฃุขุง โ†’ ุง
- Alef maqsura: ู‰ โ†’ ูŠ
- Hamza variants: ุค, ุฆ โ†’ ุก
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. ู‡ู‡ู‡ู‡ู‡ู‡ โ†’ ู‡ู‡)
- Removed English characters
- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
- Minimum 5 words per sample enforced post-cleaning
## Training Configuration
| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Token-level (whole word masking disabled โ€” tokenizer incompatibility) |
| Peak learning rate | 2e-5 |
| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
| LR schedule | Linear decay, no warmup on resume |
| Batch size | 64 per device |
| Gradient accumulation | 2 steps (effective batch = 128) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 500 steps |
| Early stopping patience | 3 evaluations |
| Train blocks | 1,849,729 |
| Eval blocks | 18,685 |
Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.
## Eval Loss Curve
| Step | Eval Loss |
|---|---|
| 500 | 3.830 |
| 1,000 | 3.599 |
| 2,000 | 3.336 |
| 5,000 | 3.066 |
| 8,500 | 2.945 |
| 20,500 | 2.773 |
| 21,000 | 2.783 |
| **21,500** | **2.773 โ† best** |
## Usage
```python
from transformers import pipeline
unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)
results = unmasker("ุงู†ุง ู…ุด ุฑุงุถูŠ ุนู† ุงู„ุฎุฏู…ุฉ ุฏูŠ [MASK] ุจุฌุฏ.")
for r in results:
print(r['token_str'], round(r['score'], 4))
```
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
```
For downstream classification tasks (emotion, sentiment, sarcasm):
```python
from transformers import AutoModel
encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
```
## Known Warnings
**LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.
## Intended Downstream Tasks
This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:
- **Emotion Classification** โ€” Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
- **Sarcasm Detection** โ€” Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** โ€” Positive / Negative / Neutral classification for customer interaction data
## Model Lineage
```
UBC-NLP/MARBERTv2
โ””โ”€โ”€ T0KII/masribert (v1 โ€” MLM on MASRISET, 57K steps)
โ””โ”€โ”€ T0KII/MASRIBERTv2 (v2 โ€” MLM on EFC + SFT, 21.5K steps)
```
## Citation
If you use this model, please cite the original MARBERTv2 paper:
```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021"
}
```