AdhamAshraf commited on
Commit
586842a
ยท
verified ยท
1 Parent(s): 9046c7a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +315 -0
README.md CHANGED
@@ -1,3 +1,318 @@
1
  ---
 
 
2
  license: mit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - ar
4
  license: mit
5
+ base_model: aubmindlab/aragpt2-medium
6
+ tags:
7
+ - arabic
8
+ - egyptian
9
+ - dialect
10
+ - slang
11
+ - translation
12
+ - gpt-2
13
+ - aragpt
14
+ - seq2seq
15
+ - causal-lm
16
+ datasets:
17
+ - AdhamAshraf/egyptian-2-arabic
18
+ - AdhamAshraf/slanggpt-feedback-dataset
19
+ metrics:
20
+ - chrF
21
+ - BLEU
22
+ - perplexity
23
+ pipeline_tag: text-generation
24
+ library_name: transformers
25
  ---
26
+
27
+ # SlangGPT: Egyptian Arabic โ†’ Modern Standard Arabic (MSA)
28
+
29
+ **SlangGPT** is a fine-tuned **AraGPT-2-medium** model that translates **Egyptian Arabic slang/dialect** into **Modern Standard Arabic (MSA)**.
30
+
31
+ It is part of the broader SlangGPT project โ€” an end-to-end Arabic NLP system for dialect translation and translation verification.
32
+
33
+ ---
34
+
35
+ # ๐Ÿ“„ Project Resources
36
+
37
+ - **Paper:**
38
+ https://github.com/adhamashraf7788/SlangGPT/blob/main/report/main.pdf
39
+
40
+ - **Main Dataset:**
41
+ https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic
42
+
43
+ - **Feedback Dataset:**
44
+ https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
45
+
46
+ - **GitHub Repository:**
47
+ https://github.com/adhamashraf7788/SlangGPT
48
+
49
+ - **Interactive Demo (Hugging Face Space):**
50
+ https://huggingface.co/spaces/AdhamAshraf/SlangGPT
51
+
52
+ ---
53
+
54
+ # ๐Ÿง  Model Description
55
+
56
+ SlangGPT is a **decoder-only causal language model** built on top of:
57
+
58
+ - **Base model:** `aubmindlab/aragpt2-medium`
59
+
60
+ The model was fine-tuned on Egyptian Arabic โ†” MSA parallel text using conditional autoregressive training.
61
+
62
+ ## Prompt Format
63
+
64
+ ```text
65
+ dialect: {input} โ†” msa:
66
+ ```
67
+
68
+ The model generates the Modern Standard Arabic translation autoregressively.
69
+
70
+ ---
71
+
72
+ # โœจ Key Features
73
+
74
+ - **Input:** Egyptian Arabic slang/dialect
75
+ - **Output:** Modern Standard Arabic (MSA)
76
+ - **Architecture:** GPT-2 style decoder-only transformer
77
+ - **Tokenizer:** BPE tokenizer with 64k vocabulary
78
+ - **Context length:** 1024 tokens
79
+ - **Language:** Arabic
80
+
81
+ ---
82
+
83
+ # โš™๏ธ Training Configuration
84
+
85
+ | Parameter | Value |
86
+ |---|---|
87
+ | Batch size | 8 (effective 32) |
88
+ | Learning rate | 5e-5 |
89
+ | Scheduler | Cosine |
90
+ | Warmup | 10% |
91
+ | Gradient clipping | 1.0 |
92
+
93
+ ---
94
+
95
+ # ๐ŸŽ›๏ธ Inference Configuration
96
+
97
+ | Parameter | Value |
98
+ |---|---|
99
+ | Temperature | 0.7 |
100
+ | Top-k | 50 |
101
+ | Top-p | 0.92 |
102
+ | Repetition penalty | 1.3 |
103
+
104
+ ---
105
+
106
+ # ๐Ÿ“Š Quantitative Performance
107
+
108
+ | Metric | Base AraGPT-2 | SlangGPT |
109
+ |---|---|---|
110
+ | chrF | 10.62 | **29.08** |
111
+ | BLEU | 0.02 | **6.63** |
112
+ | chrF Improvement | โ€” | **+18.46 (+173%)** |
113
+
114
+ ### Metric Notes
115
+
116
+ - **chrF** measures character n-gram overlap.
117
+ - **BLEU** measures word n-gram precision.
118
+
119
+ ---
120
+
121
+ # ๐Ÿš€ Usage
122
+
123
+ ## 1. Install Dependencies
124
+
125
+ ```bash
126
+ pip install transformers torch
127
+ ```
128
+
129
+ ---
130
+
131
+ ## 2. Load Model and Tokenizer
132
+
133
+ ```python
134
+ from transformers import AutoTokenizer, AutoModelForCausalLM
135
+ import torch
136
+
137
+ model_name = "AdhamAshraf/SlangGPT"
138
+
139
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
140
+
141
+ if tokenizer.pad_token is None:
142
+ tokenizer.pad_token = tokenizer.eos_token
143
+
144
+ tokenizer.padding_side = "left"
145
+
146
+ model = AutoModelForCausalLM.from_pretrained(
147
+ model_name,
148
+ torch_dtype=torch.float16,
149
+ device_map="auto"
150
+ )
151
+
152
+ model.eval()
153
+ ```
154
+
155
+ ---
156
+
157
+ ## 3. Translation Function
158
+
159
+ ```python
160
+ def translate(egyptian_text):
161
+ prompt = f"dialect: {egyptian_text.strip()} โ†” msa:"
162
+
163
+ inputs = tokenizer(
164
+ prompt,
165
+ return_tensors="pt",
166
+ truncation=True,
167
+ max_length=64
168
+ )
169
+
170
+ inputs = {
171
+ k: v.to(model.device)
172
+ for k, v in inputs.items()
173
+ }
174
+
175
+ with torch.no_grad():
176
+ outputs = model.generate(
177
+ **inputs,
178
+ max_new_tokens=64,
179
+ do_sample=True,
180
+ temperature=0.7,
181
+ top_k=50,
182
+ top_p=0.92,
183
+ repetition_penalty=1.3,
184
+ pad_token_id=tokenizer.pad_token_id,
185
+ eos_token_id=tokenizer.eos_token_id,
186
+ )
187
+
188
+ full = tokenizer.decode(
189
+ outputs[0],
190
+ skip_special_tokens=True
191
+ )
192
+
193
+ if "msa:" in full:
194
+ return full.split("msa:")[-1].strip()
195
+
196
+ return full
197
+ ```
198
+
199
+ ---
200
+
201
+ ## 4. Example Usage
202
+
203
+ ```python
204
+ print(translate("ูŠู„ุง ููŠู†ุŸ"))
205
+ # ู‡ูŠุงุŒ ุฃูŠู† ุฃู†ุชุŸ
206
+
207
+ print(translate("ุฅู†ุช ุฑุงูŠุญ ููŠู†ุŸ"))
208
+ # ุฃูŠู† ุฃู†ุช ุฐุงู‡ุจุŸ
209
+
210
+ print(translate("ุนุงูŠุฒ ุงูƒู„"))
211
+ # ุฃุฑูŠุฏ ุงู„ุทุนุงู…
212
+ ```
213
+
214
+ ---
215
+
216
+ # ๐ŸŒ Interactive Web App
217
+
218
+ Try the live demo here:
219
+
220
+ https://huggingface.co/spaces/AdhamAshraf/SlangGPT
221
+
222
+ The Space allows users to:
223
+
224
+ - Translate Egyptian Arabic to MSA
225
+ - Submit feedback
226
+ - Rate translation quality
227
+ - Help improve future versions of SlangGPT
228
+
229
+ ---
230
+
231
+ # ๐Ÿ“Š Training Dataset
232
+
233
+ SlangGPT was fine-tuned using:
234
+
235
+ ## AdhamAshraf/egyptian-2-arabic
236
+
237
+ Dataset statistics:
238
+
239
+ | Property | Value |
240
+ |---|---|
241
+ | Total samples | 18,250 |
242
+ | Format | Parallel Egyptian โ†” MSA |
243
+ | Train split | 80% |
244
+ | Validation split | 10% |
245
+ | Test split | 10% |
246
+
247
+ ### Preprocessing Steps
248
+
249
+ - Diacritic removal
250
+ - Punctuation normalization
251
+ - English text filtering
252
+
253
+ The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.
254
+
255
+ ---
256
+
257
+ # ๐Ÿงช Evaluation & Feedback
258
+
259
+ The model was evaluated using:
260
+
261
+ - chrF
262
+ - BLEU
263
+
264
+ User feedback collected through the Gradio Space is publicly stored in:
265
+
266
+ https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
267
+
268
+ This feedback dataset supports:
269
+
270
+ - RLHF research
271
+ - Translation verification
272
+ - Reward model training
273
+ - Error analysis
274
+
275
+ ---
276
+
277
+ # ๐Ÿ“œ License
278
+
279
+ This project is released under the MIT License.
280
+
281
+ Free for academic and commercial use with attribution.
282
+
283
+ ---
284
+
285
+ # ๐Ÿ™ Acknowledgements
286
+
287
+ - AraGPT-2 by Antoun et al. (2021)
288
+ - Stanford CS224N framework and educational materials
289
+ - The Arabic NLP open-source community
290
+
291
+ ---
292
+
293
+ # ๐Ÿ“š Citation
294
+
295
+ ```bibtex
296
+ @software{slanggpt2026,
297
+ author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
298
+ title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
299
+ year = {2026},
300
+ url = {https://github.com/adhamashraf7788/SlangGPT}
301
+ }
302
+
303
+ @dataset{egyptian_2_arabic,
304
+ author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
305
+ title = {Egyptian Arabic Slang to Formal Arabic Dataset},
306
+ year = {2026},
307
+ publisher = {Hugging Face},
308
+ url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
309
+ }
310
+ ```
311
+
312
+ ---
313
+
314
+ # โ“ Questions & Issues
315
+
316
+ For bugs, issues, or feature requests:
317
+
318
+ https://github.com/adhamashraf7788/SlangGPT/issues