SaudiSpell-AraT5: Comprehensive Saudi Dialect Spelling Correction

SaudiSpell-AraT5 is a state-of-the-art sequence-to-sequence spelling correction model fine-tuned on a massive, balanced corpus of Saudi Dialectal Arabic. Unlike generic Arabic correctors, this model is engineered to handle the specific orthographic and phonetic nuances of Najdi, Hijazi, and Standard Saudi dialects alongside Modern Standard Arabic (MSA).

📊 Model Performance & Training Insights

The model achieved exceptional convergence, reaching a final training loss of 0.0729, indicating high confidence in correcting complex dialectal errors.

Metric	Value	Description
Final Training Loss	0.0729	Low loss indicates high precision in error detection.
Training Steps	45,000	Fully converged over ~3 million samples.
Base Architecture	AraT5-Base	Specialized T5 model pre-trained on Arabic.

Training Loss Curve

🛠️ Training Configuration & Hyperparameters

Transperency is key for reproducibility. The model was fine-tuned using the following configuration on NVIDIA A100 GPUs:

Batch Size: 64
Learning Rate: 5e-5
Optimizer: AdamW
Precision: bf16 (Bfloat16)
Max Sequence Length: 128 tokens
Num Epochs: 1 (Full pass over 3M balanced rows)
Checkpoint Strategy: Saved at step 45,000.

📚 Dataset & Methodology

The training data consists of 3 Million sentences, strictly balanced to prevent the "majority dialect problem."

1. The "Balanced 33%" Strategy

Real-world data is often dominated by Najdi dialect. We employed an aggressive oversampling strategy (50x) for rare dialects to ensure the model treats all regions equally.

🛡️ Najdi: 33% (Central Region)
🕋 Hijazi: 33% (Western Region)
🇸🇦 Standard Saudi: 33% (Formal/Media)

2. Synthetic Error Injection Protocol

To mimic real-world typing behavior, we developed a "Saudi-Calibrated" error injector. Every sample has a 20% probability of containing one of the following error types:

⌨️ Keyboard Typos: Adjacency errors based on the specific Arabic keyboard layout (e.g., hitting 'ث' instead of 'ص').
🗣️ Phonetic Substitutions: Dialectal sound-alikes (e.g., 'ق' vs 'غ' or 'ذ' vs 'ز').
👁️ Visual Substitutions: Shape-based errors (e.g., 'ه' vs 'ة').
❌ Deletions & Insertions: Simulating fast typing/missed keystrokes.
✂️ Space Stripping: Merged words (e.g., "ياخي" instead of "يا اخي").

🚀 Usage

You can use this model directly with the Hugging Face transformers library.

Installation

pip install transformers torch

Inference Code

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load Model
model_name = "NAMAA-Space/SaudiSpell-AraT5"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

def correct_text(text):
    # IMPORTANT: The model was trained with the prefix "correct: "
    input_text = "correct: " + text
    
    inputs = tokenizer(input_text, return_tensors="pt", max_length=128, truncation=True)
    
    # Generate correction
    outputs = model.generate(
        **inputs, 
        max_new_tokens=128,
        num_beams=5,        # Beam search ensures higher quality
        early_stopping=True
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# --- Examples ---

# 1. Hijazi Example (Phonetic & Typos)
# Input: "ya wad ishbk mst3jl kda" (Arabizi/Typos)
print(correct_text("يا واد اشبك مستعجل كدا")) 
# Expected: "يا واد اشبك مستعجل كذا"

# 2. Najdi Example (Space Stripping)
# Input: "ياخي وراك مارديت"
print(correct_text("ياخي وراك مارديت")) 
# Expected: "يا أخي وراك ما رديت"

Test Set & Results

Test Set Summary

تم إنشاء مجموعة اختبار مكونة من 500 جملة سعودية قصيرة (نجدي / حجازي / لهجة بيضاء)، مع توزيع متساوٍ لأنواع الأخطاء المصطنعة (20% لكل نوع):
Keyboard، Phonetic، Visual، Substitution، Space Strip.

Sample Test Cases

Error Type	Clean Sentence	Noisy Sentence	Description
Keyboard	السلام عليكم	السلام عليكن	استبدال حرف نتيجة التجاور على لوحة المفاتيح
Phonetic	هذا قريب من البيت	زا غريب من البيت	تحويل صوتي لهجي (ذ→ز، ق→غ)
Visual	المدرسة قريبة	المدرسه قريبة	تشابه بصري (ة ↔ ه)
Substitution	وش مسوي اليوم	وش مسودي اليوم	استبدال عشوائي غير صوتي
Space Strip	يا هلا والله	ياهلاوالله	حذف المسافات بين الكلمات

Evaluation Results

Metric	Value
Total Samples	500
Exact Match	40.0%
Average CER	0.0572
Average WER	0.1685

Results Interpretation

Exact Match (40%) يعكس صعوبة المجموعة، خاصة مع الأخطاء اللهجية والصوتية غير القياسية.
CER منخفض (5.7%) يدل على أن معظم التصحيحات كانت قريبة جداً من النص الصحيح حتى عند عدم التطابق التام.
WER (16.8%) يوضح أن غالبية الأخطاء تقع على مستوى الحرف أو الكلمة المفردة، وليس انهياراً كاملاً في البنية النصية.

بشكل عام، تُعد هذه النتائج مؤشراً إيجابياً على قدرة النموذج على تصحيح النصوص السعودية الواقعية، مع وجود مجال لتحسين التطابق الكامل في الحالات اللهجية المعقدة.

⚠️ Limitations

Context Length: Optimized for sentences up to 128 tokens (tweets, chat messages). Longer texts should be split.
Arabizi: While it has some capability to handle Latin-script Arabic (Arabizi), it is primarily optimized for Arabic script.
Ultra-Local Slang: Extremely niche slang words not present in the training corpus may be normalized to MSA or Standard Saudi.

📜 Citation

If you use this model in your research, please cite:

@misc{SaudiSpell2025,
  author = {NAMAA Community},
  title = {SaudiSpell-AraT5: A Balanced Saudi Dialect Error Correction Model},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{[https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5](https://huggingface.co/NAMAA-Space/SaudiSpell-AraT5)}}
}

Downloads last month: 47

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NAMAA-Space/SaudiSpell-AraT5

Base model

UBC-NLP/AraT5v2-base-1024

Finetuned

(23)

this model

Space using NAMAA-Space/SaudiSpell-AraT5 1

Collection including NAMAA-Space/SaudiSpell-AraT5

NAMAA SAUDI DIALECT HUB

Collection

Unified hub for Saudi Arabic dialect datasets, models, and benchmarks produced by NAMAA Community. • 6 items • Updated Jan 30 • 3