T0KII commited on
Commit
a4eedc4
ยท
verified ยท
1 Parent(s): 0b182e2

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +159 -0
README.md ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - ar
4
+ license: unknown
5
+ base_model:
6
+ - T0KII/masribert
7
+ - UBC-NLP/MARBERTv2
8
+ tags:
9
+ - arabic
10
+ - egyptian-arabic
11
+ - masked-language-modeling
12
+ - bert
13
+ - dialect
14
+ - nlp
15
+ pipeline_tag: fill-mask
16
+ ---
17
+
18
+ # MasriBERT v2 โ€” Egyptian Arabic Language Model
19
+
20
+ MasriBERT v2 is a continued MLM pre-training of [MasriBERT v1](https://huggingface.co/T0KII/masribert) (itself built on [UBC-NLP/MARBERTv2](https://huggingface.co/UBC-NLP/MARBERTv2)) on a new, higher-quality Egyptian Arabic corpus emphasizing **conversational and dialogue register** โ€” the primary register of customer-facing NLP applications.
21
+
22
+ It is purpose-built as a backbone for downstream Egyptian Arabic NLP tasks including emotion classification, sarcasm detection, and sentiment analysis, with a specific focus on call-center and customer interaction language.
23
+
24
+ ## What Changed from v1
25
+
26
+ | | MasriBERT v1 | MasriBERT v2 |
27
+ |---|---|---|
28
+ | Base model | UBC-NLP/MARBERTv2 | T0KII/masribert (v1) |
29
+ | Training corpus | MASRISET (1.3M rows โ€” tweets, reviews, news comments) | EFC + SFT Mixture (1.95M rows โ€” forums, dialogue) |
30
+ | Data register | Social media / news | Conversational / instructional dialogue |
31
+ | Training steps | ~57,915 | ~21,500 (resumed from step 20,000) |
32
+ | Final eval loss | 4.523 | **2.773** |
33
+ | Final perplexity | 92.98 | **16.00** |
34
+ | Training platform | Google Colab (A100) | Kaggle (T4 / P100) |
35
+
36
+ The 5.8x perplexity improvement reflects both the richer training signal from conversational data and the cumulative MLM adaptation across all three training stages (MARBERTv2 โ†’ v1 โ†’ v2).
37
+
38
+ ## Training Corpus
39
+
40
+ Two sources were used, targeting conversational Egyptian Arabic:
41
+
42
+ **faisalq/EFC-mini โ€” Egyptian Forums Corpus**
43
+ Forum posts and comments from Egyptian Arabic internet forums. Long-form conversational text capturing how Egyptians write when explaining problems, complaining, and asking questions โ€” closely mirroring customer behavior.
44
+
45
+ **MBZUAI-Paris/Egyptian-SFT-Mixture โ€” Egyptian Dialogue**
46
+ Supervised fine-tuning dialogue data in Egyptian Arabic โ€” instruction/response pairs curated specifically for Egyptian dialect LLM training. Chat formatting was stripped to raw text before training.
47
+
48
+ Both sources were deduplicated (MD5 hash), shuffled with seed 42, and minimum 5-word samples enforced post-cleaning.
49
+
50
+ After deduplication: **1,946,195 rows โ†’ 1,868,414 chunks of 64 tokens**
51
+
52
+ ## Text Cleaning Pipeline
53
+
54
+ Same normalization as v1, applied uniformly:
55
+
56
+ - Removed URLs, email addresses, @mentions, and hashtag symbols
57
+ - Alef normalization: ุฅุฃุขุง โ†’ ุง
58
+ - Alef maqsura: ู‰ โ†’ ูŠ
59
+ - Hamza variants: ุค, ุฆ โ†’ ุก
60
+ - Removed all Arabic tashkeel (diacritics)
61
+ - Capped repeated characters at 2 (e.g. ู‡ู‡ู‡ู‡ู‡ู‡ โ†’ ู‡ู‡)
62
+ - Removed English characters
63
+ - Preserved emojis (MARBERTv2 has native emoji embeddings from tweet pretraining)
64
+ - Minimum 5 words per sample enforced post-cleaning
65
+
66
+ ## Training Configuration
67
+
68
+ | Hyperparameter | Value |
69
+ |---|---|
70
+ | Block size | 64 tokens |
71
+ | MLM probability | 0.20 (20%) |
72
+ | Masking strategy | Token-level (whole word masking disabled โ€” tokenizer incompatibility) |
73
+ | Peak learning rate | 2e-5 |
74
+ | Resume learning rate | 6.16e-6 (corrected for linear decay at step 20,000) |
75
+ | LR schedule | Linear decay, no warmup on resume |
76
+ | Batch size | 64 per device |
77
+ | Gradient accumulation | 2 steps (effective batch = 128) |
78
+ | Weight decay | 0.01 |
79
+ | Precision | FP16 |
80
+ | Eval / Save interval | Every 500 steps |
81
+ | Early stopping patience | 3 evaluations |
82
+ | Train blocks | 1,849,729 |
83
+ | Eval blocks | 18,685 |
84
+
85
+ Training was conducted on Kaggle (NVIDIA T4 / P100) across 2 epochs. Due to Kaggle's 12-hour session limit, training was split across two sessions with checkpoint resumption via HuggingFace Hub.
86
+
87
+ ## Eval Loss Curve
88
+
89
+ | Step | Eval Loss |
90
+ |---|---|
91
+ | 500 | 3.830 |
92
+ | 1,000 | 3.599 |
93
+ | 2,000 | 3.336 |
94
+ | 5,000 | 3.066 |
95
+ | 8,500 | 2.945 |
96
+ | 20,500 | 2.773 |
97
+ | 21,000 | 2.783 |
98
+ | **21,500** | **2.773 โ† best** |
99
+
100
+ ## Usage
101
+
102
+ ```python
103
+ from transformers import pipeline
104
+
105
+ unmasker = pipeline("fill-mask", model="T0KII/MASRIBERTv2", top_k=3)
106
+
107
+ results = unmasker("ุงู†ุง ู…ุด ุฑุงุถูŠ ุนู† ุงู„ุฎุฏู…ุฉ ุฏูŠ [MASK] ุจุฌุฏ.")
108
+ for r in results:
109
+ print(r['token_str'], round(r['score'], 4))
110
+ ```
111
+
112
+ ```python
113
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
114
+
115
+ tokenizer = AutoTokenizer.from_pretrained("T0KII/MASRIBERTv2")
116
+ model = AutoModelForMaskedLM.from_pretrained("T0KII/MASRIBERTv2")
117
+ ```
118
+
119
+ For downstream classification tasks (emotion, sentiment, sarcasm):
120
+
121
+ ```python
122
+ from transformers import AutoModel
123
+
124
+ encoder = AutoModel.from_pretrained("T0KII/MASRIBERTv2")
125
+ # Attach your classification head on top of encoder.pooler_output or encoder.last_hidden_state
126
+ ```
127
+
128
+ ## Known Warnings
129
+
130
+ **LayerNorm naming:** Loading this model produces warnings about missing/unexpected keys (`LayerNorm.weight` / `LayerNorm.bias` vs `LayerNorm.gamma` / `LayerNorm.beta`). This is a known naming compatibility issue between older MARBERTv2 checkpoint conventions and newer Transformers versions. The weights are correctly loaded โ€” the warning is cosmetic and can be safely ignored.
131
+
132
+ ## Intended Downstream Tasks
133
+
134
+ This model is the backbone for the following tasks in the **Kalamna** Egyptian Arabic AI call-center pipeline:
135
+
136
+ - **Emotion Classification** โ€” Multi-class emotion detection (anger, joy, sadness, fear, surprise, love, sympathy, neutral)
137
+ - **Sarcasm Detection** โ€” Egyptian Arabic sarcasm including culturally-specific patterns (religious phrase inversion, hyperbolic complaint, dialectal irony)
138
+ - **Sentiment Analysis** โ€” Positive / Negative / Neutral classification for customer interaction data
139
+
140
+ ## Model Lineage
141
+
142
+ ```
143
+ UBC-NLP/MARBERTv2
144
+ โ””โ”€โ”€ T0KII/masribert (v1 โ€” MLM on MASRISET, 57K steps)
145
+ โ””โ”€โ”€ T0KII/MASRIBERTv2 (v2 โ€” MLM on EFC + SFT, 21.5K steps)
146
+ ```
147
+
148
+ ## Citation
149
+
150
+ If you use this model, please cite the original MARBERTv2 paper:
151
+
152
+ ```bibtex
153
+ @inproceedings{abdul-mageed-etal-2021-arbert,
154
+ title = "{ARBERT} {\&} {MARBERT}: Deep Bidirectional Transformers for {A}rabic",
155
+ author = "Abdul-Mageed, Muhammad and Elmadany, AbdelRahim and Nagoudi, El Moatez Billah",
156
+ booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics",
157
+ year = "2021"
158
+ }
159
+ ```