SaudiSpell-AraT5: Comprehensive Saudi Dialect Spelling Correction

SaudiSpell-AraT5 is a state-of-the-art sequence-to-sequence spelling correction model fine-tuned on a massive, balanced corpus of Saudi Dialectal Arabic. Unlike generic Arabic correctors, this model is engineered to handle the specific orthographic and phonetic nuances of Najdi, Hijazi, and Standard Saudi dialects alongside Modern Standard Arabic (MSA).

📊 Model Performance & Training Insights

The model achieved exceptional convergence, reaching a final training loss of 0.0729, indicating high confidence in correcting complex dialectal errors.

Metric Value Description
Final Training Loss 0.0729 Low loss indicates high precision in error detection.
Training Steps 45,000 Fully converged over ~3 million samples.
Base Architecture AraT5-Base Specialized T5 model pre-trained on Arabic.

Training Loss Curve

image

🛠️ Training Configuration & Hyperparameters

Transperency is key for reproducibility. The model was fine-tuned using the following configuration on NVIDIA A100 GPUs:

  • Batch Size: 64
  • Learning Rate: 5e-5
  • Optimizer: AdamW
  • Precision: bf16 (Bfloat16)
  • Max Sequence Length: 128 tokens
  • Num Epochs: 1 (Full pass over 3M balanced rows)
  • Checkpoint Strategy: Saved at step 45,000.

📚 Dataset & Methodology

The training data consists of 3 Million sentences, strictly balanced to prevent the "majority dialect problem."

1. The "Balanced 33%" Strategy

Real-world data is often dominated by Najdi dialect. We employed an aggressive oversampling strategy (50x) for rare dialects to ensure the model treats all regions equally.

  • 🛡️ Najdi: 33% (Central Region)
  • 🕋 Hijazi: 33% (Western Region)
  • 🇸🇦 Standard Saudi: 33% (Formal/Media)

2. Synthetic Error Injection Protocol

To mimic real-world typing behavior, we developed a "Saudi-Calibrated" error injector. Every sample has a 20% probability of containing one of the following error types:

  1. ⌨️ Keyboard Typos: Adjacency errors based on the specific Arabic keyboard layout (e.g., hitting 'ث' instead of 'ص').
  2. 🗣️ Phonetic Substitutions: Dialectal sound-alikes (e.g., 'ق' vs 'غ' or 'ذ' vs 'ز').
  3. 👁️ Visual Substitutions: Shape-based errors (e.g., 'ه' vs 'ة').
  4. ❌ Deletions & Insertions: Simulating fast typing/missed keystrokes.
  5. ✂️ Space Stripping: Merged words (e.g., "ياخي" instead of "يا اخي").

🚀 Usage

You can use this model directly with the Hugging Face transformers library.

Installation

pip install transformers torch

Inference Code

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load Model
model_name = "NAMAA-Space/SaudiSpell-AraT5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def correct_text(text):
    # IMPORTANT: The model was trained with the prefix "correct: "
    input_text = "correct: " + text
    
    inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
    
    # Generate correction
    outputs = model.generate(
        **inputs, 
        max_new_tokens=128,
        num_beams=5,        # Beam search ensures higher quality
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# --- Examples ---

# 1. Hijazi Example (Phonetic & Typos)
# Input: "ya wad ishbk mst3jl kda" (Arabizi/Typos)
print(correct_text("يا واد اشبك مستعجل كدا")) 
# Expected: "يا واد اشبك مستعجل كذا"

# 2. Najdi Example (Space Stripping)
# Input: "ياخي وراك مارديت"
print(correct_text("ياخي وراك مارديت")) 
# Expected: "يا أخي وراك ما رديت"

Test Set & Results

Test Set Summary

تم إنشاء مجموعة اختبار مكونة من 500 جملة سعودية قصيرة (نجدي / حجازي / لهجة بيضاء)، مع توزيع متساوٍ لأنواع الأخطاء المصطنعة (20% لكل نوع):
Keyboard، Phonetic، Visual، Substitution، Space Strip.


Sample Test Cases

Error Type Clean Sentence Noisy Sentence Description
Keyboard السلام عليكم السلام عليكن استبدال حرف نتيجة التجاور على لوحة المفاتيح
Phonetic هذا قريب من البيت زا غريب من البيت تحويل صوتي لهجي (ذ→ز، ق→غ)
Visual المدرسة قريبة المدرسه قريبة تشابه بصري (ة ↔ ه)
Substitution وش مسوي اليوم وش مسودي اليوم استبدال عشوائي غير صوتي
Space Strip يا هلا والله ياهلاوالله حذف المسافات بين الكلمات

Evaluation Results

Metric Value
Total Samples 500
Exact Match 40.0%
Average CER 0.0572
Average WER 0.1685

Results Interpretation

  • Exact Match (40%) يعكس صعوبة المجموعة، خاصة مع الأخطاء اللهجية والصوتية غير القياسية.
  • CER منخفض (5.7%) يدل على أن معظم التصحيحات كانت قريبة جداً من النص الصحيح حتى عند عدم التطابق التام.
  • WER (16.8%) يوضح أن غالبية الأخطاء تقع على مستوى الحرف أو الكلمة المفردة، وليس انهياراً كاملاً في البنية النصية.

بشكل عام، تُعد هذه النتائج مؤشراً إيجابياً على قدرة النموذج على تصحيح النصوص السعودية الواقعية، مع وجود مجال لتحسين التطابق الكامل في الحالات اللهجية المعقدة.


⚠️ Limitations

  • Context Length: Optimized for sentences up to 128 tokens (tweets, chat messages). Longer texts should be split.
  • Arabizi: While it has some capability to handle Latin-script Arabic (Arabizi), it is primarily optimized for Arabic script.
  • Ultra-Local Slang: Extremely niche slang words not present in the training corpus may be normalized to MSA or Standard Saudi.

📜 Citation

If you use this model in your research, please cite:

@misc{SaudiSpell2025,
  author = {NAMAA Community},
  title = {SaudiSpell-AraT5: A Balanced Saudi Dialect Error Correction Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5](https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5)}}
}
Downloads last month
36
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NAMAA-Space/SaudiSpell-AraT5

Finetuned
(23)
this model

Space using NAMAA-Space/SaudiSpell-AraT5 1