SaudiSpell-AraT5: Comprehensive Saudi Dialect Spelling Correction
SaudiSpell-AraT5 is a state-of-the-art sequence-to-sequence spelling correction model fine-tuned on a massive, balanced corpus of Saudi Dialectal Arabic. Unlike generic Arabic correctors, this model is engineered to handle the specific orthographic and phonetic nuances of Najdi, Hijazi, and Standard Saudi dialects alongside Modern Standard Arabic (MSA).
📊 Model Performance & Training Insights
The model achieved exceptional convergence, reaching a final training loss of 0.0729, indicating high confidence in correcting complex dialectal errors.
| Metric | Value | Description |
|---|---|---|
| Final Training Loss | 0.0729 | Low loss indicates high precision in error detection. |
| Training Steps | 45,000 | Fully converged over ~3 million samples. |
| Base Architecture | AraT5-Base | Specialized T5 model pre-trained on Arabic. |
Training Loss Curve
🛠️ Training Configuration & Hyperparameters
Transperency is key for reproducibility. The model was fine-tuned using the following configuration on NVIDIA A100 GPUs:
- Batch Size: 64
- Learning Rate: 5e-5
- Optimizer: AdamW
- Precision: bf16 (Bfloat16)
- Max Sequence Length: 128 tokens
- Num Epochs: 1 (Full pass over 3M balanced rows)
- Checkpoint Strategy: Saved at step 45,000.
📚 Dataset & Methodology
The training data consists of 3 Million sentences, strictly balanced to prevent the "majority dialect problem."
1. The "Balanced 33%" Strategy
Real-world data is often dominated by Najdi dialect. We employed an aggressive oversampling strategy (50x) for rare dialects to ensure the model treats all regions equally.
- 🛡️ Najdi: 33% (Central Region)
- 🕋 Hijazi: 33% (Western Region)
- 🇸🇦 Standard Saudi: 33% (Formal/Media)
2. Synthetic Error Injection Protocol
To mimic real-world typing behavior, we developed a "Saudi-Calibrated" error injector. Every sample has a 20% probability of containing one of the following error types:
- ⌨️ Keyboard Typos: Adjacency errors based on the specific Arabic keyboard layout (e.g., hitting 'ث' instead of 'ص').
- 🗣️ Phonetic Substitutions: Dialectal sound-alikes (e.g., 'ق' vs 'غ' or 'ذ' vs 'ز').
- 👁️ Visual Substitutions: Shape-based errors (e.g., 'ه' vs 'ة').
- ❌ Deletions & Insertions: Simulating fast typing/missed keystrokes.
- ✂️ Space Stripping: Merged words (e.g., "ياخي" instead of "يا اخي").
🚀 Usage
You can use this model directly with the Hugging Face transformers library.
Installation
pip install transformers torch
Inference Code
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load Model
model_name = "NAMAA-Space/SaudiSpell-AraT5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
def correct_text(text):
# IMPORTANT: The model was trained with the prefix "correct: "
input_text = "correct: " + text
inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
# Generate correction
outputs = model.generate(
**inputs,
max_new_tokens=128,
num_beams=5, # Beam search ensures higher quality
early_stopping=True
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# --- Examples ---
# 1. Hijazi Example (Phonetic & Typos)
# Input: "ya wad ishbk mst3jl kda" (Arabizi/Typos)
print(correct_text("يا واد اشبك مستعجل كدا"))
# Expected: "يا واد اشبك مستعجل كذا"
# 2. Najdi Example (Space Stripping)
# Input: "ياخي وراك مارديت"
print(correct_text("ياخي وراك مارديت"))
# Expected: "يا أخي وراك ما رديت"
Test Set & Results
Test Set Summary
تم إنشاء مجموعة اختبار مكونة من 500 جملة سعودية قصيرة (نجدي / حجازي / لهجة بيضاء)، مع توزيع متساوٍ لأنواع الأخطاء المصطنعة (20% لكل نوع):
Keyboard، Phonetic، Visual، Substitution، Space Strip.
Sample Test Cases
| Error Type | Clean Sentence | Noisy Sentence | Description |
|---|---|---|---|
| Keyboard | السلام عليكم | السلام عليكن | استبدال حرف نتيجة التجاور على لوحة المفاتيح |
| Phonetic | هذا قريب من البيت | زا غريب من البيت | تحويل صوتي لهجي (ذ→ز، ق→غ) |
| Visual | المدرسة قريبة | المدرسه قريبة | تشابه بصري (ة ↔ ه) |
| Substitution | وش مسوي اليوم | وش مسودي اليوم | استبدال عشوائي غير صوتي |
| Space Strip | يا هلا والله | ياهلاوالله | حذف المسافات بين الكلمات |
Evaluation Results
| Metric | Value |
|---|---|
| Total Samples | 500 |
| Exact Match | 40.0% |
| Average CER | 0.0572 |
| Average WER | 0.1685 |
Results Interpretation
- Exact Match (40%) يعكس صعوبة المجموعة، خاصة مع الأخطاء اللهجية والصوتية غير القياسية.
- CER منخفض (5.7%) يدل على أن معظم التصحيحات كانت قريبة جداً من النص الصحيح حتى عند عدم التطابق التام.
- WER (16.8%) يوضح أن غالبية الأخطاء تقع على مستوى الحرف أو الكلمة المفردة، وليس انهياراً كاملاً في البنية النصية.
بشكل عام، تُعد هذه النتائج مؤشراً إيجابياً على قدرة النموذج على تصحيح النصوص السعودية الواقعية، مع وجود مجال لتحسين التطابق الكامل في الحالات اللهجية المعقدة.
⚠️ Limitations
- Context Length: Optimized for sentences up to 128 tokens (tweets, chat messages). Longer texts should be split.
- Arabizi: While it has some capability to handle Latin-script Arabic (Arabizi), it is primarily optimized for Arabic script.
- Ultra-Local Slang: Extremely niche slang words not present in the training corpus may be normalized to MSA or Standard Saudi.
📜 Citation
If you use this model in your research, please cite:
@misc{SaudiSpell2025,
author = {NAMAA Community},
title = {SaudiSpell-AraT5: A Balanced Saudi Dialect Error Correction Model},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{[https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5](https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5)}}
}
- Downloads last month
- 36
Model tree for NAMAA-Space/SaudiSpell-AraT5
Base model
UBC-NLP/AraT5v2-base-1024