SlangGPT / README.md

Update README.md

f8a3806 verified 3 days ago

6.29 kB

	---
	language:
	- ar
	license: mit
	base_model: aubmindlab/aragpt2-medium
	tags:
	- arabic
	- egyptian
	- dialect
	- slang
	- translation
	- gpt-2
	- aragpt
	- seq2seq
	- causal-lm
	datasets:
	- AdhamAshraf/egyptian-2-arabic
	- AdhamAshraf/slanggpt-feedback-dataset
	metrics:
	- chrF
	- BLEU
	- perplexity
	pipeline_tag: text-generation
	library_name: transformers
	---

	# SlangGPT: Egyptian Arabic → Modern Standard Arabic (MSA)

	SlangGPT is a fine-tuned AraGPT-2-medium model that translates Egyptian Arabic slang/dialect into Modern Standard Arabic (MSA).

	It is part of the broader SlangGPT project — an end-to-end Arabic NLP system for dialect translation and translation verification.

	---

	# 📄 Project Resources

	- Paper:
	https://github.com/adhamashraf7788/SlangGPT/blob/main/report/SlangGPT_report.pdf

	- Main Dataset:
	https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic

	- Feedback Dataset:
	https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

	- GitHub Repository:
	https://github.com/adhamashraf7788/SlangGPT

	- Interactive Demo (Hugging Face Space):
	https://huggingface.co/spaces/AdhamAshraf/SlangGPT

	---

	# 🧠 Model Description

	SlangGPT is a decoder-only causal language model built on top of:

	- Base model: `aubmindlab/aragpt2-medium`

	The model was fine-tuned on Egyptian Arabic ↔ MSA parallel text using conditional autoregressive training.

	## Prompt Format

	```text
	dialect: {input} ↔ msa:
	```

	The model generates the Modern Standard Arabic translation autoregressively.

	---

	# ✨ Key Features

	- Input: Egyptian Arabic slang/dialect
	- Output: Modern Standard Arabic (MSA)
	- Architecture: GPT-2 style decoder-only transformer
	- Tokenizer: BPE tokenizer with 64k vocabulary
	- Context length: 1024 tokens
	- Language: Arabic

	---

	# ⚙️ Training Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Batch size \| 8 (effective 32) \|
	\| Learning rate \| 5e-5 \|
	\| Scheduler \| Cosine \|
	\| Warmup \| 10% \|
	\| Gradient clipping \| 1.0 \|

	---

	# 🎛️ Inference Configuration

	\| Parameter \| Value \|
	\|---\|---\|
	\| Temperature \| 0.7 \|
	\| Top-k \| 50 \|
	\| Top-p \| 0.92 \|
	\| Repetition penalty \| 1.3 \|

	---

	# 📊 Quantitative Performance

	\| Metric \| Base AraGPT-2 \| SlangGPT \|
	\|---\|---\|---\|
	\| chrF \| 10.62 \| 29.08 \|
	\| BLEU \| 0.02 \| 6.63 \|
	\| chrF Improvement \| — \| +18.46 (+173%) \|

	### Metric Notes

	- chrF measures character n-gram overlap.
	- BLEU measures word n-gram precision.

	---

	# 🚀 Usage

	## 1. Install Dependencies

	```bash
	pip install transformers torch
	```

	---

	## 2. Load Model and Tokenizer

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	model_name = "AdhamAshraf/SlangGPT"

	tokenizer = AutoTokenizer.from_pretrained(model_name)

	if tokenizer.pad_token is None:
	tokenizer.pad_token = tokenizer.eos_token

	tokenizer.padding_side = "left"

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype=torch.float16,
	device_map="auto"
	)

	model.eval()
	```

	---

	## 3. Translation Function

	```python
	def translate(egyptian_text):
	prompt = f"dialect: {egyptian_text.strip()} ↔ msa:"

	inputs = tokenizer(
	prompt,
	return_tensors="pt",
	truncation=True,
	max_length=64
	)

	inputs = {
	k: v.to(model.device)
	for k, v in inputs.items()
	}

	with torch.no_grad():
	outputs = model.generate(
	**inputs,
	max_new_tokens=64,
	do_sample=True,
	temperature=0.7,
	top_k=50,
	top_p=0.92,
	repetition_penalty=1.3,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id,
	)

	full = tokenizer.decode(
	outputs[0],
	skip_special_tokens=True
	)

	if "msa:" in full:
	return full.split("msa:")[-1].strip()

	return full
	```

	---

	## 4. Example Usage

	```python
	print(translate("يلا فين؟"))
	# هيا، أين أنت؟

	print(translate("إنت رايح فين؟"))
	# أين أنت ذاهب؟

	print(translate("عايز اكل"))
	# أريد الطعام
	```

	---

	# 🌐 Interactive Web App

	Try the live demo here:

	https://huggingface.co/spaces/AdhamAshraf/SlangGPT

	The Space allows users to:

	- Translate Egyptian Arabic to MSA
	- Submit feedback
	- Rate translation quality
	- Help improve future versions of SlangGPT

	---

	# 📊 Training Dataset

	SlangGPT was fine-tuned using:

	## AdhamAshraf/egyptian-2-arabic

	Dataset statistics:

	\| Property \| Value \|
	\|---\|---\|
	\| Total samples \| 18,250 \|
	\| Format \| Parallel Egyptian ↔ MSA \|
	\| Train split \| 80% \|
	\| Validation split \| 10% \|
	\| Test split \| 10% \|

	### Preprocessing Steps

	- Diacritic removal
	- Punctuation normalization
	- English text filtering

	The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents.

	---

	# 🧪 Evaluation & Feedback

	The model was evaluated using:

	- chrF
	- BLEU

	User feedback collected through the Gradio Space is publicly stored in:

	https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset

	This feedback dataset supports:

	- RLHF research
	- Translation verification
	- Reward model training
	- Error analysis

	---

	# 📜 License

	This project is released under the MIT License.

	Free for academic and commercial use with attribution.

	---

	# 🙏 Acknowledgements

	- AraGPT-2 by Antoun et al. (2021)
	- Stanford CS224N framework and educational materials
	- The Arabic NLP open-source community

	---

	# 📚 Citation

	```bibtex
	@software{slanggpt2026,
	author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry},
	title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation},
	year = {2026},
	url = {https://github.com/adhamashraf7788/SlangGPT}
	}

	@dataset{egyptian_2_arabic,
	author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry},
	title = {Egyptian Arabic Slang to Formal Arabic Dataset},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic}
	}
	```

	---

	# ❓ Questions & Issues

	For bugs, issues, or feature requests:

	https://github.com/adhamashraf7788/SlangGPT/issues