File size: 6,280 Bytes
a4eedc4 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 | ---
language:
- ar
license: unknown
base_model:
- T0KII/masribert
- UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---
# MasriBERT v2 โ Egyptian Arabic Language Model
MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** โ the primary register of customer-facing NLP applications.
It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.
## What Changed from v1
| | MasriBERT v1 | MasriBERT v2 |
|---|---|---|
| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
| Training corpus | MASRISET (1.3M rows โ tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ forums, dialogue) |
| Data register | Social media / news | Conversational / instructional dialogue |
| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
| Final eval loss | 4.523 | **2.773** |
| Final perplexity | 92.98 | **16.00** |
| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |
The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ v1 โ v2).
## Training Corpus
Two sources were used, targeting conversational Egyptian Arabic:
**faisalq/EFC-mini โ Egyptian Forums Corpus**
Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ closely mirroring customer behavior.
**MBZUAI-Paris/Egyptian-SFT-Mixture โ Egyptian Dialogue**
Supervised fine-tuning dialogue data in Egyptian Arabic โ instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.
Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.
After deduplication: **1,946,195 rows โ 1,868,414 chunks of 64 tokens**
## Text Cleaning Pipeline
Same normalization as v1, applied uniformly:
- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization: ุฅุฃุขุง โ ุง
- Alef maqsura: ู โ ู
- Hamza variants: ุค, ุฆ โ ุก
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. ูููููู โ ูู)
- Removed English characters
- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
- Minimum 5 words per sample enforced post-cleaning
## Training Configuration
| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Token-level (whole word masking disabled โ tokenizer incompatibility) |
| Peak learning rate | 2e-5 |
| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
| LR schedule | Linear decay, no warmup on resume |
| Batch size | 64 per device |
| Gradient accumulation | 2 steps (effective batch = 128) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 500 steps |
| Early stopping patience | 3 evaluations |
| Train blocks | 1,849,729 |
| Eval blocks | 18,685 |
Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.
## Eval Loss Curve
| Step | Eval Loss |
|---|---|
| 500 | 3.830 |
| 1,000 | 3.599 |
| 2,000 | 3.336 |
| 5,000 | 3.066 |
| 8,500 | 2.945 |
| 20,500 | 2.773 |
| 21,000 | 2.783 |
| **21,500** | **2.773 โ best** |
## Usage
```python
from transformers import pipeline
unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)
results = unmasker("ุงูุง ู
ุด ุฑุงุถู ุนู ุงูุฎุฏู
ุฉ ุฏู [MASK] ุจุฌุฏ.")
for r in results:
print(r['token_str'], round(r['score'], 4))
```
```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
```
For downstream classification tasks (emotion, sentiment, sarcasm):
```python
from transformers import AutoModel
encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
```
## Known Warnings
**LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ the warning is cosmetic and can be safely ignored.
## Intended Downstream Tasks
This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:
- **Emotion Classification** โ Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
- **Sarcasm Detection** โ Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** โ Positive / Negative / Neutral classification for customer interaction data
## Model Lineage
```
UBC-NLP/MARBERTv2
โโโ T0KII/masribert (v1 โ MLM on MASRISET, 57K steps)
โโโ T0KII/MASRIBERTv2 (v2 โ MLM on EFC + SFT, 21.5K steps)
```
## Citation
If you use this model, please cite the original MARBERTv2 paper:
```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
year = "2021"
}
``` |