NAMAA-T5-Saudi2English:

NAMAA-T5-Saudi2English is a transformer-based Arabic-to-English translation model fine-tuned on Saudi dialectal data (Najdi, Hijazi, Southern, Northern, and Eastern varieties).

It is built on top of mBERT-initialized T5 architecture and aims to improve translation quality for informal and region-specific Arabic commonly used across Saudi Arabia.

Details and Model Config

Property Value
Model Type T5 Encoderโ€“Decoder
Base Architecture T5ForConditionalGeneration (initialized with mBERT embeddings)
Languages Arabic (Saudi dialects) โ†’ English
Training Data 120K sentence pairs from Najdi & Hijazi dialects
Framework ๐Ÿค— Transformers v4.57.1
License Apache-2.0
Pipeline Tag translation
Library transformers
Tokenizer T5Tokenizer
Vocabulary Size 110,208 tokens

๐Ÿ“Š Evaluation Summary

The model was evaluated on 12 Saudi sub-dialects within the NAMAA MT leaderboard.

Dialect Group # Examples BLEU CHRF METEOR BERTScore F1 Adequacy Faithfulness Fluency Overall
eastern_urban 54 28.36 51.66 62.10 82.75 66.57 62.31 84.81 68.15
hijazi_jeddah 51 8.13 29.07 33.17 69.81 65.30 60.70 77.00 66.62
hijazi_makkah 49 32.41 53.95 61.45 84.57 82.72 79.89 87.83 82.78
hijazi_urban 50 19.75 44.65 50.85 78.21 72.24 67.04 68.98 69.12
hijazi_urban_jeddah 51 14.73 41.40 45.32 78.42 79.39 75.51 77.65 76.76
najdi_qasim 51 13.76 38.00 45.94 77.25 72.40 68.60 78.80 72.46
najdi_riyadh 51 18.44 40.08 43.73 75.24 64.12 61.27 79.12 66.59
najdi_urban 52 10.98 32.15 36.38 73.46 58.46 55.38 80.96 61.48
northern_hail 50 13.27 36.97 43.60 73.58 64.50 60.10 82.20 66.38
southern_asiri 50 13.28 38.97 29.53 68.87 42.02 39.04 60.21 43.13
southern_jazan 50 29.78 47.67 57.03 79.62 53.09 51.06 86.17 58.45
southern_qahtan 53 19.43 42.95 50.11 77.13 59.81 55.47 77.64 61.57

Average BLEU: โ‰ˆ 19.8โ€ƒ|โ€ƒAverage BERTScore F1: โ‰ˆ 77.8โ€ƒ|โ€ƒAverage Overall: โ‰ˆ 68.4


๐Ÿงฉ Model Configuration

{
  "architectures": ["T5ForConditionalGeneration"],
  "d_model": 768,
  "num_layers": 12,
  "num_heads": 12,
  "d_ff": 2048,
  "dropout_rate": 0.1,
  "feed_forward_proj": "gated-gelu",
  "tie_word_embeddings": false,
  "vocab_size": 110208,
  "transformers_version": "4.57.1"
}

Training Details

  • Base model: multilingual BERT (T5 adaptation)
  • Objective: sequence-to-sequence translation (cross-entropy loss)
  • Optimizer: AdamW (learning rate 3e-4, weight decay 0.01)
  • Batch size: 32
  • Epochs: 10
  • Hardware: A100 (40 GB x 4 GPUs)
  • Mixed precision: FP16
  • Early stopping: based on validation BLEU

๐Ÿ“š Dataset Description

  • Size: โ‰ˆ 120 K sentence pairs
  • Source: Locally collected Saudi dialect corpora (Najdi and Hijazi)
  • Domains: Conversational, cultural, social media, and spoken language data
  • Data cleaning: automatic normalization + manual review
  • Split: 80/10/10 (train/validation/test)

๐Ÿš€ Intended Use

Primary:

  • Translate Saudi Arabic text (including dialectal social media, spoken data, and local phrases) into English.

Secondary:

  • Support downstream NLP tasks such as summarization, cross-lingual retrieval, and alignment evaluation.

Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")
model = AutoModelForSeq2SeqLM.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")

text = "ูˆุด ุตุงุฑ ุงู„ูŠูˆู…ุŸ"  # Najdi dialect
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# โ†’ "What happened today?"

Citation

@misc{namaa2025saudi2eng,
  title  = {NAMAA-T5-Saudi2English: Dialect-aware Arabicโ†’English Translation Model},
  author = {NAMAA Community},
  year   = {2025},
  url    = {https://huggingface.co/NAMAA-Space/NAMAA-T5-Saudi2English},
  license= {Apache-2.0}
}
Downloads last month
70
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 1 Ask for provider support

Collection including NAMAA-Space/NAMAA-MT-Saudi2English