---
language:
- ar
license: mit
base_model: aubmindlab/aragpt2-medium
tags:
- arabic
- egyptian
- dialect
- slang
- translation
- gpt-2
- aragpt
- seq2seq
- causal-lm
datasets:
- AdhamAshraf/egyptian-2-arabic
- AdhamAshraf/slanggpt-feedback-dataset
metrics:
- chrF
- BLEU
- perplexity
pipeline_tag: text-generation
library_name: transformers
---

# SlangGPT: Egyptian Arabic → Modern Standard Arabic (MSA)

**SlangGPT** is a fine-tuned **AraGPT-2-medium** model that translates **Egyptian Arabic slang/dialect** into **Modern Standard Arabic (MSA)**.

It is part of the broader SlangGPT project — an end-to-end Arabic NLP system for dialect translation and translation verification.

---

# 📄 Project Resources

- **Paper:**  
  https://github.com/adhamashraf7788/SlangGPT/blob/main/report/SlangGPT_report.pdf

- **Main Dataset:**  
  https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic

- **Feedback Dataset:**  
  https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

- **GitHub Repository:**  
  https://github.com/adhamashraf7788/SlangGPT

- **Interactive Demo (Hugging Face Space):**  
  https://huggingface.co/spaces/AdhamAshraf/SlangGPT

---

# 🧠 Model Description

SlangGPT is a **decoder-only causal language model** built on top of:

- **Base model:** `aubmindlab/aragpt2-medium`

The model was fine-tuned on Egyptian Arabic ↔ MSA parallel text using conditional autoregressive training.

## Prompt Format

```text
dialect: {input} ↔ msa:
```

The model generates the Modern Standard Arabic translation autoregressively.

---

# ✨ Key Features

- **Input:** Egyptian Arabic slang/dialect
- **Output:** Modern Standard Arabic (MSA)
- **Architecture:** GPT-2 style decoder-only transformer
- **Tokenizer:** BPE tokenizer with 64k vocabulary
- **Context length:** 1024 tokens
- **Language:** Arabic

---

# ⚙️ Training Configuration

| Parameter | Value |
|---|---|
| Batch size | 8 (effective 32) |
| Learning rate | 5e-5 |
| Scheduler | Cosine |
| Warmup | 10% |
| Gradient clipping | 1.0 |

---

# 🎛️ Inference Configuration

| Parameter | Value |
|---|---|
| Temperature | 0.7 |
| Top-k | 50 |
| Top-p | 0.92 |
| Repetition penalty | 1.3 |

---

# 📊 Quantitative Performance

| Metric | Base AraGPT-2 | SlangGPT |
|---|---|---|
| chrF | 10.62 | **29.08** |
| BLEU | 0.02 | **6.63** |
| chrF Improvement | — | **+18.46 (+173%)** |

### Metric Notes

- **chrF** measures character n-gram overlap.
- **BLEU** measures word n-gram precision.

---

# 🚀 Usage

## 1. Install Dependencies

```bash
pip install transformers torch
```

---

## 2. Load Model and Tokenizer

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "AdhamAshraf/SlangGPT"

tokenizer = AutoTokenizer.from_pretrained(model_name)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "left"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

model.eval()
```

---

## 3. Translation Function

```python
def translate(egyptian_text):
    prompt = f"dialect: {egyptian_text.strip()} ↔ msa:"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=64
    )

    inputs = {
        k: v.to(model.device)
        for k, v in inputs.items()
    }

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=64,
            do_sample=True,
            temperature=0.7,
            top_k=50,
            top_p=0.92,
            repetition_penalty=1.3,
            pad_token_id=tokenizer.pad_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )

    full = tokenizer.decode(
        outputs[0],
        skip_special_tokens=True
    )

    if "msa:" in full:
        return full.split("msa:")[-1].strip()

    return full
```

---

## 4. Example Usage

```python
print(translate("يلا فين؟"))
# هيا، أين أنت؟

print(translate("إنت رايح فين؟"))
# أين أنت ذاهب؟

print(translate("عايز اكل"))
# أريد الطعام
```

---

# 🌐 Interactive Web App

Try the live demo here:

https://huggingface.co/spaces/AdhamAshraf/SlangGPT

The Space allows users to:

- Translate Egyptian Arabic to MSA
- Submit feedback
- Rate translation quality
- Help improve future versions of SlangGPT

---

# 📊 Training Dataset

SlangGPT was fine-tuned using:

## AdhamAshraf/egyptian-2-arabic

Dataset statistics:

| Property | Value |
|---|---|
| Total samples | 18,250 |
| Format | Parallel Egyptian ↔ MSA |
| Train split | 80% |
| Validation split | 10% |
| Test split | 10% |

### Preprocessing Steps

- Diacritic removal
- Punctuation normalization
- English text filtering

The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.

---

# 🧪 Evaluation & Feedback

The model was evaluated using:

- chrF
- BLEU

User feedback collected through the Gradio Space is publicly stored in:

https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

This feedback dataset supports:

- RLHF research
- Translation verification
- Reward model training
- Error analysis

---

# 📜 License

This project is released under the MIT License.

Free for academic and commercial use with attribution.

---

# 🙏 Acknowledgements

- AraGPT-2 by Antoun et al. (2021)
- Stanford CS224N framework and educational materials
- The Arabic NLP open-source community

---

# 📚 Citation

```bibtex
@software{slanggpt2026,
  author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
  title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
  year = {2026},
  url = {https://github.com/adhamashraf7788/SlangGPT}
}

@dataset{egyptian_2_arabic,
  author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
  title = {Egyptian Arabic Slang to Formal Arabic Dataset},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
}
```

---

# ❓ Questions & Issues

For bugs, issues, or feature requests:

https://github.com/adhamashraf7788/SlangGPT/issues