NAMAA-T5-Saudi2English:

NAMAA-T5-Saudi2English is a transformer-based Arabic-to-English translation model fine-tuned on Saudi dialectal data (Najdi, Hijazi, Southern, Northern, and Eastern varieties).

It is built on top of mBERT-initialized T5 architecture and aims to improve translation quality for informal and region-specific Arabic commonly used across Saudi Arabia.

Details and Model Config

Property	Value
Model Type	T5 Encoder–Decoder
Base Architecture	`T5ForConditionalGeneration` (initialized with mBERT embeddings)
Languages	Arabic (Saudi dialects) → English
Training Data	120K sentence pairs from Najdi & Hijazi dialects
Framework	🤗 Transformers v4.57.1
License	Apache-2.0
Pipeline Tag	`translation`
Library	`transformers`
Tokenizer	`T5Tokenizer`
Vocabulary Size	110,208 tokens

📊 Evaluation Summary

The model was evaluated on 12 Saudi sub-dialects within the NAMAA MT leaderboard.

Dialect Group	# Examples	BLEU	CHRF	METEOR	BERTScore F1	Adequacy	Faithfulness	Fluency	Overall
eastern_urban	54	28.36	51.66	62.10	82.75	66.57	62.31	84.81	68.15
hijazi_jeddah	51	8.13	29.07	33.17	69.81	65.30	60.70	77.00	66.62
hijazi_makkah	49	32.41	53.95	61.45	84.57	82.72	79.89	87.83	82.78
hijazi_urban	50	19.75	44.65	50.85	78.21	72.24	67.04	68.98	69.12
hijazi_urban_jeddah	51	14.73	41.40	45.32	78.42	79.39	75.51	77.65	76.76
najdi_qasim	51	13.76	38.00	45.94	77.25	72.40	68.60	78.80	72.46
najdi_riyadh	51	18.44	40.08	43.73	75.24	64.12	61.27	79.12	66.59
najdi_urban	52	10.98	32.15	36.38	73.46	58.46	55.38	80.96	61.48
northern_hail	50	13.27	36.97	43.60	73.58	64.50	60.10	82.20	66.38
southern_asiri	50	13.28	38.97	29.53	68.87	42.02	39.04	60.21	43.13
southern_jazan	50	29.78	47.67	57.03	79.62	53.09	51.06	86.17	58.45
southern_qahtan	53	19.43	42.95	50.11	77.13	59.81	55.47	77.64	61.57

Average BLEU: ≈ 19.8 | Average BERTScore F1: ≈ 77.8 | Average Overall: ≈ 68.4

🧩 Model Configuration

{
  "architectures": ["T5ForConditionalGeneration"],
  "d_model": 768,
  "num_layers": 12,
  "num_heads": 12,
  "d_ff": 2048,
  "dropout_rate": 0.1,
  "feed_forward_proj": "gated-gelu",
  "tie_word_embeddings": false,
  "vocab_size": 110208,
  "transformers_version": "4.57.1"
}

Training Details

Base model: multilingual BERT (T5 adaptation)
Objective: sequence-to-sequence translation (cross-entropy loss)
Optimizer: AdamW (learning rate 3e-4, weight decay 0.01)
Batch size: 32
Epochs: 10
Hardware: A100 (40 GB x 4 GPUs)
Mixed precision: FP16
Early stopping: based on validation BLEU

📚 Dataset Description

Size: ≈ 120 K sentence pairs
Source: Locally collected Saudi dialect corpora (Najdi and Hijazi)
Domains: Conversational, cultural, social media, and spoken language data
Data cleaning: automatic normalization + manual review
Split: 80/10/10 (train/validation/test)

🚀 Intended Use

Primary:

Translate Saudi Arabic text (including dialectal social media, spoken data, and local phrases) into English.

Secondary:

Support downstream NLP tasks such as summarization, cross-lingual retrieval, and alignment evaluation.

Example Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")
model = AutoModelForSeq2SeqLM.from_pretrained("NAMAA-Space/NAMAA-T5-Saudi2English")

text = "وش صار اليوم؟"  # Najdi dialect
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# → "What happened today?"

Citation

@misc{namaa2025saudi2eng,
  title  = {NAMAA-T5-Saudi2English: Dialect-aware Arabic→English Translation Model},
  author = {NAMAA Community},
  year   = {2025},
  url    = {https://huggingface.co/NAMAA-Space/NAMAA-T5-Saudi2English},
  license= {Apache-2.0}
}

Downloads last month: 13

Safetensors

Model size

0.4B params

Tensor type

F32

Collection including NAMAA-Space/NAMAA-MT-Saudi2English

NAMAA SAUDI DIALECT HUB

Collection

Unified hub for Saudi Arabic dialect datasets, models, and benchmarks produced by NAMAA Community. • 6 items • Updated Jan 30 • 3