SlangGPT / README.md
AdhamAshraf's picture
Update README.md
f8a3806 verified
---
language:
- ar
license: mit
base_model: aubmindlab/aragpt2-medium
tags:
- arabic
- egyptian
- dialect
- slang
- translation
- gpt-2
- aragpt
- seq2seq
- causal-lm
datasets:
- AdhamAshraf/egyptian-2-arabic
- AdhamAshraf/slanggpt-feedback-dataset
metrics:
- chrF
- BLEU
- perplexity
pipeline_tag: text-generation
library_name: transformers
---
# SlangGPT: Egyptian Arabic โ†’ Modern Standard Arabic (MSA)
**SlangGPT** is a fine-tuned **AraGPT-2-medium** model that translates **Egyptian Arabic slang/dialect** into **Modern Standard Arabic (MSA)**.
It is part of the broader SlangGPT project โ€” an end-to-end Arabic NLP system for dialect translation and translation verification.
---
# ๐Ÿ“„ Project Resources
- **Paper:**
https://github.com/adhamashraf7788/SlangGPT/blob/main/report/SlangGPT_report.pdf
- **Main Dataset:**
https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic
- **Feedback Dataset:**
https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
- **GitHub Repository:**
https://github.com/adhamashraf7788/SlangGPT
- **Interactive Demo (Hugging Face Space):**
https://huggingface.co/spaces/AdhamAshraf/SlangGPT
---
# ๐Ÿง  Model Description
SlangGPT is a **decoder-only causal language model** built on top of:
- **Base model:** `aubmindlab/aragpt2-medium`
The model was fine-tuned on Egyptian Arabic โ†” MSA parallel text using conditional autoregressive training.
## Prompt Format
```text
dialect: {input} โ†” msa:
```
The model generates the Modern Standard Arabic translation autoregressively.
---
# โœจ Key Features
- **Input:** Egyptian Arabic slang/dialect
- **Output:** Modern Standard Arabic (MSA)
- **Architecture:** GPT-2 style decoder-only transformer
- **Tokenizer:** BPE tokenizer with 64k vocabulary
- **Context length:** 1024 tokens
- **Language:** Arabic
---
# โš™๏ธ Training Configuration
| Parameter | Value |
|---|---|
| Batch size | 8 (effective 32) |
| Learning rate | 5e-5 |
| Scheduler | Cosine |
| Warmup | 10% |
| Gradient clipping | 1.0 |
---
# ๐ŸŽ›๏ธ Inference Configuration
| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-k | 50 |
| Top-p | 0.92 |
| Repetition penalty | 1.3 |
---
# ๐Ÿ“Š Quantitative Performance
| Metric | Base AraGPT-2 | SlangGPT |
|---|---|---|
| chrF | 10.62 | **29.08** |
| BLEU | 0.02 | **6.63** |
| chrF Improvement | โ€” | **+18.46 (+173%)** |
### Metric Notes
- **chrF** measures character n-gram overlap.
- **BLEU** measures word n-gram precision.
---
# ๐Ÿš€ Usage
## 1. Install Dependencies
```bash
pip install transformers torch
```
---
## 2. Load Model and Tokenizer
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_name = "AdhamAshraf/SlangGPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)
model.eval()
```
---
## 3. Translation Function
```python
def translate(egyptian_text):
prompt = f"dialect: {egyptian_text.strip()} โ†” msa:"
inputs = tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=64
)
inputs = {
k: v.to(model.device)
for k, v in inputs.items()
}
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
do_sample=True,
temperature=0.7,
top_k=50,
top_p=0.92,
repetition_penalty=1.3,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
full = tokenizer.decode(
outputs[0],
skip_special_tokens=True
)
if "msa:" in full:
return full.split("msa:")[-1].strip()
return full
```
---
## 4. Example Usage
```python
print(translate("ูŠู„ุง ููŠู†ุŸ"))
# ู‡ูŠุงุŒ ุฃูŠู† ุฃู†ุชุŸ
print(translate("ุฅู†ุช ุฑุงูŠุญ ููŠู†ุŸ"))
# ุฃูŠู† ุฃู†ุช ุฐุงู‡ุจุŸ
print(translate("ุนุงูŠุฒ ุงูƒู„"))
# ุฃุฑูŠุฏ ุงู„ุทุนุงู…
```
---
# ๐ŸŒ Interactive Web App
Try the live demo here:
https://huggingface.co/spaces/AdhamAshraf/SlangGPT
The Space allows users to:
- Translate Egyptian Arabic to MSA
- Submit feedback
- Rate translation quality
- Help improve future versions of SlangGPT
---
# ๐Ÿ“Š Training Dataset
SlangGPT was fine-tuned using:
## AdhamAshraf/egyptian-2-arabic
Dataset statistics:
| Property | Value |
|---|---|
| Total samples | 18,250 |
| Format | Parallel Egyptian โ†” MSA |
| Train split | 80% |
| Validation split | 10% |
| Test split | 10% |
### Preprocessing Steps
- Diacritic removal
- Punctuation normalization
- English text filtering
The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.
---
# ๐Ÿงช Evaluation & Feedback
The model was evaluated using:
- chrF
- BLEU
User feedback collected through the Gradio Space is publicly stored in:
https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset
This feedback dataset supports:
- RLHF research
- Translation verification
- Reward model training
- Error analysis
---
# ๐Ÿ“œ License
This project is released under the MIT License.
Free for academic and commercial use with attribution.
---
# ๐Ÿ™ Acknowledgements
- AraGPT-2 by Antoun et al. (2021)
- Stanford CS224N framework and educational materials
- The Arabic NLP open-source community
---
# ๐Ÿ“š Citation
```bibtex
@software{slanggpt2026,
author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
year = {2026},
url = {https://github.com/adhamashraf7788/SlangGPT}
}
@dataset{egyptian_2_arabic,
author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
title = {Egyptian Arabic Slang to Formal Arabic Dataset},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
}
```
---
# โ“ Questions & Issues
For bugs, issues, or feature requests:
https://github.com/adhamashraf7788/SlangGPT/issues