NAMAA SAUDI DIALECT HUB
Collection
Unified hub for Saudi Arabic dialect datasets, models, and benchmarks produced by NAMAA Community.
โข
3 items
โข
Updated
โข
2
NAMAA-T5-Saudi2English is a transformer-based Arabic-to-English translation model fine-tuned on Saudi dialectal data (Najdi, Hijazi, Southern, Northern, and Eastern varieties).
It is built on top of mBERT-initialized T5 architecture and aims to improve translation quality for informal and region-specific Arabic commonly used across Saudi Arabia.
| Property | Value |
|---|---|
| Model Type | T5 EncoderโDecoder |
| Base Architecture | T5ForConditionalGeneration (initialized with mBERT embeddings) |
| Languages | Arabic (Saudi dialects) โ English |
| Training Data | 120K sentence pairs from Najdi & Hijazi dialects |
| Framework | ๐ค Transformers v4.57.1 |
| License | Apache-2.0 |
| Pipeline Tag | translation |
| Library | transformers |
| Tokenizer | T5Tokenizer |
| Vocabulary Size | 110,208 tokens |
The model was evaluated on 12 Saudi sub-dialects within the NAMAA MT leaderboard.
| Dialect Group | # Examples | BLEU | CHRF | METEOR | BERTScore F1 | Adequacy | Faithfulness | Fluency | Overall |
|---|---|---|---|---|---|---|---|---|---|
| eastern_urban | 54 | 28.36 | 51.66 | 62.10 | 82.75 | 66.57 | 62.31 | 84.81 | 68.15 |
| hijazi_jeddah | 51 | 8.13 | 29.07 | 33.17 | 69.81 | 65.30 | 60.70 | 77.00 | 66.62 |
| hijazi_makkah | 49 | 32.41 | 53.95 | 61.45 | 84.57 | 82.72 | 79.89 | 87.83 | 82.78 |
| hijazi_urban | 50 | 19.75 | 44.65 | 50.85 | 78.21 | 72.24 | 67.04 | 68.98 | 69.12 |
| hijazi_urban_jeddah | 51 | 14.73 | 41.40 | 45.32 | 78.42 | 79.39 | 75.51 | 77.65 | 76.76 |
| najdi_qasim | 51 | 13.76 | 38.00 | 45.94 | 77.25 | 72.40 | 68.60 | 78.80 | 72.46 |
| najdi_riyadh | 51 | 18.44 | 40.08 | 43.73 | 75.24 | 64.12 | 61.27 | 79.12 | 66.59 |
| najdi_urban | 52 | 10.98 | 32.15 | 36.38 | 73.46 | 58.46 | 55.38 | 80.96 | 61.48 |
| northern_hail | 50 | 13.27 | 36.97 | 43.60 | 73.58 | 64.50 | 60.10 | 82.20 | 66.38 |
| southern_asiri | 50 | 13.28 | 38.97 | 29.53 | 68.87 | 42.02 | 39.04 | 60.21 | 43.13 |
| southern_jazan | 50 | 29.78 | 47.67 | 57.03 | 79.62 | 53.09 | 51.06 | 86.17 | 58.45 |
| southern_qahtan | 53 | 19.43 | 42.95 | 50.11 | 77.13 | 59.81 | 55.47 | 77.64 | 61.57 |
Average BLEU: โ 19.8โ|โAverage BERTScore F1: โ 77.8โ|โAverage Overall: โ 68.4
{
"architectures": ["T5ForConditionalGeneration"],
"d_model": 768,
"num_layers": 12,
"num_heads": 12,
"d_ff": 2048,
"dropout_rate": 0.1,
"feed_forward_proj": "gated-gelu",
"tie_word_embeddings": false,
"vocab_size": 110208,
"transformers_version": "4.57.1"
}
Primary:
Secondary:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")
model = AutoModelForSeq2SeqLM.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")
text = "ูุด ุตุงุฑ ุงูููู
ุ" # Najdi dialect
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# โ "What happened today?"
@misc{namaa2025saudi2eng,
title = {NAMAA-T5-Saudi2English: Dialect-aware ArabicโEnglish Translation Model},
author = {NAMAA Community},
year = {2025},
url = {https://huggingface.co/NAMAA-Space/NAMAA-T5-Saudi2English},
license= {Apache-2.0}
}