File size: 6,280 Bytes
a4eedc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
language:
- ar
license: unknown
base_model:
- T0KII/masribert
- UBC-NLP/MARBERTv2
tags:
- arabic
- egyptian-arabic
- masked-language-modeling
- bert
- dialect
- nlp
pipeline_tag: fill-mask
---

# MasriBERT v2 โ€” Egyptian Arabic Language Model

MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** โ€” the primary register of customer-facing NLP applications.

It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.

## What Changed from v1

| | MasriBERT v1 | MasriBERT v2 |
|---|---|---|
| Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
| Training corpus | MASRISET (1.3M rows โ€” tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ€” forums, dialogue) |
| Data register | Social media / news | Conversational / instructional dialogue |
| Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
| Final eval loss | 4.523 | **2.773** |
| Final perplexity | 92.98 | **16.00** |
| Training platform | Google Colab (A100) | Kaggle (T4 / P100) |

The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ†’ v1 โ†’ v2).

## Training Corpus

Two sources were used, targeting conversational Egyptian Arabic:

**faisalq/EFC-mini โ€” Egyptian Forums Corpus**
Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ€” closely mirroring customer behavior.

**MBZUAI-Paris/Egyptian-SFT-Mixture โ€” Egyptian Dialogue**
Supervised fine-tuning dialogue data in Egyptian Arabic โ€” instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.

Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.

After deduplication: **1,946,195 rows โ†’ 1,868,414 chunks of 64 tokens**

## Text Cleaning Pipeline

Same normalization as v1, applied uniformly:

- Removed URLs, email addresses, @mentions, and hashtag symbols
- Alef normalization: ุฅุฃุขุง โ†’ ุง
- Alef maqsura: ู‰ โ†’ ูŠ
- Hamza variants: ุค, ุฆ โ†’ ุก
- Removed all Arabic tashkeel (diacritics)
- Capped repeated characters at 2 (e.g. ู‡ู‡ู‡ู‡ู‡ู‡ โ†’ ู‡ู‡)
- Removed English characters
- Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
- Minimum 5 words per sample enforced post-cleaning

## Training Configuration

| Hyperparameter | Value |
|---|---|
| Block size | 64 tokens |
| MLM probability | 0.20 (20%) |
| Masking strategy | Token-level (whole word masking disabled โ€” tokenizer incompatibility) |
| Peak learning rate | 2e-5 |
| Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
| LR schedule | Linear decay, no warmup on resume |
| Batch size | 64 per device |
| Gradient accumulation | 2 steps (effective batch = 128) |
| Weight decay | 0.01 |
| Precision | FP16 |
| Eval / Save interval | Every 500 steps |
| Early stopping patience | 3 evaluations |
| Train blocks | 1,849,729 |
| Eval blocks | 18,685 |

Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.

## Eval Loss Curve

| Step | Eval Loss |
|---|---|
| 500 | 3.830 |
| 1,000 | 3.599 |
| 2,000 | 3.336 |
| 5,000 | 3.066 |
| 8,500 | 2.945 |
| 20,500 | 2.773 |
| 21,000 | 2.783 |
| **21,500** | **2.773 โ† best** |

## Usage

```python
from transformers import pipeline

unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)

results = unmasker("ุงู†ุง ู…ุด ุฑุงุถูŠ ุนู† ุงู„ุฎุฏู…ุฉ ุฏูŠ [MASK] ุจุฌุฏ.")
for r in results:
    print(r['token_str'], round(r['score'], 4))
```

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
```

For downstream classification tasks (emotion, sentiment, sarcasm):

```python
from transformers import AutoModel

encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
# Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
```

## Known Warnings

**LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.

## Intended Downstream Tasks

This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:

- **Emotion Classification** โ€” Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
- **Sarcasm Detection** โ€” Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
- **Sentiment Analysis** โ€” Positive / Negative / Neutral classification for customer interaction data

## Model Lineage

```
UBC-NLP/MARBERTv2
    โ””โ”€โ”€ T0KII/masribert  (v1 โ€” MLM on MASRISET, 57K steps)
            โ””โ”€โ”€ T0KII/MASRIBERTv2  (v2 โ€” MLM on EFC + SFT, 21.5K steps)
```

## Citation

If you use this model, please cite the original MARBERTv2 paper:

```bibtex
@inproceedings{abdul-mageed-etal-2021-arbert,
    title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
    author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
    year = "2021"
}
```