--- language: - ar license: mit base_model: aubmindlab/aragpt2-medium tags: - arabic - egyptian - dialect - slang - translation - gpt-2 - aragpt - seq2seq - causal-lm datasets: - AdhamAshraf/egyptian-2-arabic - AdhamAshraf/slanggpt-feedback-dataset metrics: - chrF - BLEU - perplexity pipeline_tag: text-generation library_name: transformers --- # SlangGPT: Egyptian Arabic โ†’ Modern Standard Arabic (MSA) **SlangGPT** is a fine-tuned **AraGPT-2-medium** model that translates **Egyptian Arabic slang/dialect** into **Modern Standard Arabic (MSA)**. It is part of the broader SlangGPT project โ€” an end-to-end Arabic NLP system for dialect translation and translation verification. --- # ๐Ÿ“„ Project Resources - **Paper:** https://github.com/adhamashraf7788/SlangGPT/blob/main/report/SlangGPT_report.pdf - **Main Dataset:** https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic - **Feedback Dataset:** https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset - **GitHub Repository:** https://github.com/adhamashraf7788/SlangGPT - **Interactive Demo (Hugging Face Space):** https://huggingface.co/spaces/AdhamAshraf/SlangGPT --- # ๐Ÿง  Model Description SlangGPT is a **decoder-only causal language model** built on top of: - **Base model:** `aubmindlab/aragpt2-medium` The model was fine-tuned on Egyptian Arabic โ†” MSA parallel text using conditional autoregressive training. ## Prompt Format ```text dialect: {input} โ†” msa: ``` The model generates the Modern Standard Arabic translation autoregressively. --- # โœจ Key Features - **Input:** Egyptian Arabic slang/dialect - **Output:** Modern Standard Arabic (MSA) - **Architecture:** GPT-2 style decoder-only transformer - **Tokenizer:** BPE tokenizer with 64k vocabulary - **Context length:** 1024 tokens - **Language:** Arabic --- # โš™๏ธ Training Configuration | Parameter | Value | |---|---| | Batch size | 8 (effective 32) | | Learning rate | 5e-5 | | Scheduler | Cosine | | Warmup | 10% | | Gradient clipping | 1.0 | --- # ๐ŸŽ›๏ธ Inference Configuration | Parameter | Value | |---|---| | Temperature | 0.7 | | Top-k | 50 | | Top-p | 0.92 | | Repetition penalty | 1.3 | --- # ๐Ÿ“Š Quantitative Performance | Metric | Base AraGPT-2 | SlangGPT | |---|---|---| | chrF | 10.62 | **29.08** | | BLEU | 0.02 | **6.63** | | chrF Improvement | โ€” | **+18.46 (+173%)** | ### Metric Notes - **chrF** measures character n-gram overlap. - **BLEU** measures word n-gram precision. --- # ๐Ÿš€ Usage ## 1. Install Dependencies ```bash pip install transformers torch ``` --- ## 2. Load Model and Tokenizer ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch model_name = "AdhamAshraf/SlangGPT" tokenizer = AutoTokenizer.from_pretrained(model_name) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "left" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.float16, device_map="auto" ) model.eval() ``` --- ## 3. Translation Function ```python def translate(egyptian_text): prompt = f"dialect: {egyptian_text.strip()} โ†” msa:" inputs = tokenizer( prompt, return_tensors="pt", truncation=True, max_length=64 ) inputs = { k: v.to(model.device) for k, v in inputs.items() } with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, do_sample=True, temperature=0.7, top_k=50, top_p=0.92, repetition_penalty=1.3, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) full = tokenizer.decode( outputs[0], skip_special_tokens=True ) if "msa:" in full: return full.split("msa:")[-1].strip() return full ``` --- ## 4. Example Usage ```python print(translate("ูŠู„ุง ููŠู†ุŸ")) # ู‡ูŠุงุŒ ุฃูŠู† ุฃู†ุชุŸ print(translate("ุฅู†ุช ุฑุงูŠุญ ููŠู†ุŸ")) # ุฃูŠู† ุฃู†ุช ุฐุงู‡ุจุŸ print(translate("ุนุงูŠุฒ ุงูƒู„")) # ุฃุฑูŠุฏ ุงู„ุทุนุงู… ``` --- # ๐ŸŒ Interactive Web App Try the live demo here: https://huggingface.co/spaces/AdhamAshraf/SlangGPT The Space allows users to: - Translate Egyptian Arabic to MSA - Submit feedback - Rate translation quality - Help improve future versions of SlangGPT --- # ๐Ÿ“Š Training Dataset SlangGPT was fine-tuned using: ## AdhamAshraf/egyptian-2-arabic Dataset statistics: | Property | Value | |---|---| | Total samples | 18,250 | | Format | Parallel Egyptian โ†” MSA | | Train split | 80% | | Validation split | 10% | | Test split | 10% | ### Preprocessing Steps - Diacritic removal - Punctuation normalization - English text filtering The dataset was derived from the original Egyptian-English corpus by Abdalrahmankamel, with English translations replaced by curated MSA equivalents. --- # ๐Ÿงช Evaluation & Feedback The model was evaluated using: - chrF - BLEU User feedback collected through the Gradio Space is publicly stored in: https://huggingface.co/datasets/AdhamAshraf/slanggpt-feedback-dataset This feedback dataset supports: - RLHF research - Translation verification - Reward model training - Error analysis --- # ๐Ÿ“œ License This project is released under the MIT License. Free for academic and commercial use with attribution. --- # ๐Ÿ™ Acknowledgements - AraGPT-2 by Antoun et al. (2021) - Stanford CS224N framework and educational materials - The Arabic NLP open-source community --- # ๐Ÿ“š Citation ```bibtex @software{slanggpt2026, author = {Abdelrahman Ahmed and Adham Ashraf and Ahmed Fekry}, title = {SlangGPT: Fine-tuning AraGPT-2 for Egyptian Arabic Dialect-to-MSA Translation}, year = {2026}, url = {https://github.com/adhamashraf7788/SlangGPT} } @dataset{egyptian_2_arabic, author = {Adham Ashraf and Abdelrahman Ahmed and Ahmed Fekry}, title = {Egyptian Arabic Slang to Formal Arabic Dataset}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/datasets/AdhamAshraf/egyptian-2-arabic} } ``` --- # โ“ Questions & Issues For bugs, issues, or feature requests: https://github.com/adhamashraf7788/SlangGPT/issues