SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect

This model is based on the paper SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System.

image/png

Model Description

SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.

Model Details

  • Model Type: Sequence-to-Sequence Translation
  • Base Model: UBC-NLP/AraT5v2-base-1024
  • Language: Arabic (MSA โ†’ Syrian Dialect)
  • License: Apache 2.0
  • Library: Transformers

Dataset

The model was trained on the Nรขbra dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.

image/png

Nรขbra Dataset Details

Citation:

Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023). 
Nรขbra: Syrian Arabic dialects with morphological annotations. 
arXiv preprint arXiv:2310.17315.

Key Statistics:

  • Tokens: ~60,000 words
  • Dialects Covered: Multiple Syrian regional dialects including:
    • Aleppo
    • Damascus
    • Deir-ezzur
    • Hama
    • Homs
    • Huran
    • Latakia
    • Mardin
    • Raqqah
    • Suwayda

Data Sources:

  • Social media posts
  • Movie and TV series scripts
  • Song lyrics
  • Local proverbs

Training Details

The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:

  • Total Training Steps: 10,384
  • Epochs: 22
  • Final Training Loss: 1.396
  • Final Evaluation Loss: 0.771
  • Learning Rate: Cosine schedule starting at 5e-5
  • Batch Size: 256
  • Total FLOPs: 1.58e+17

Training Progress

The model showed consistent improvement throughout training:

  • Initial loss: 12.93 โ†’ Final loss: 1.40
  • Evaluation loss steadily decreased from 1.44 to 0.77
  • Gradient norms remained stable throughout training

Usage

Installation

pip install transformers torch

Inference Code

from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

# Load model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")

# Example usage
ar_prompt = "ู…ุฑุญุจุง ุจูƒ ู‡ู†ุง"  # MSA input
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)

print("Input (MSA):", ar_prompt)
print("Tokenized input:", tokenizer.tokenize(ar_prompt))
print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))

Generation Parameters

For optimal results, you can adjust generation parameters:

outputs = model.generate(
    input_ids,
    max_length=128,
    num_beams=4,
    temperature=0.7,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id,
    eos_token_id=tokenizer.eos_token_id
)

Evaluation Results

  • Test Set: 1,500 unseen sentences
  • Evaluation Method: GPT-4.1 as automated judge
  • Average Score: 4.01/5.0 โญ
  • Evaluation Criteria: Translation quality, dialectal accuracy, and semantic preservation

The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:

"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
Please return a rating from 0 to 5 and a short comment.

MSA Input: [input sentence]
Model Prediction (Shami dialect): [model output]
Ground Truth (Shami dialect): [reference translation]

Respond in this format:
Score: <number from 0 to 5>
Comment: <brief explanation of the score>"

Score Distribution Analysis:

  • Excellent (5.0): High-quality translations with perfect dialectal conversion
  • Good (4.0-4.9): Minor dialectal variations or stylistic differences
  • Average (3.0-3.9): Acceptable translations with some dialectal inconsistencies
  • Below Average (2.0-2.9): Noticeable errors in dialect or meaning
  • Poor (0-1.9): Significant translation errors or loss of meaning

Performance Highlights

  • Strong Dialectal Conversion: Successfully transforms MSA into authentic Syrian dialect
  • Semantic Preservation: Maintains original meaning while adapting linguistic style
  • Regional Adaptability: Handles various Syrian sub-dialects effectively
  • Consistent Quality: Stable performance across different text types and domains

Applications

This model is particularly useful for:

  • Content Localization: Adapting MSA content for Syrian audiences
  • Cultural Preservation: Maintaining and promoting Syrian dialectal variations
  • Educational Tools: Teaching differences between MSA and Syrian dialect
  • Research: Syrian Arabic NLP and dialectology studies

Regional Coverage

The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:

๐Ÿ›๏ธ Urban Centers: Damascus, Aleppo
๐Ÿ”๏ธ Northern Regions: Latakia, Mardin
๐Ÿœ๏ธ Eastern Areas: Deir-ezzur, Raqqah
๐ŸŒ„ Central/Southern: Hama, Homs, Huran, Suwayda

Limitations

  • Trained specifically on Syrian dialect variations
  • Performance may vary for other Arabic dialects
  • Limited to text-based translation (no speech support)
  • Dataset size constraints may affect handling of very rare dialectal expressions

Citation

If you use this model in your research, please cite:

@misc{shami-mt-2024,
  title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
  author={Omartificial Intelligence Space},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
}

@article{nayouf2023nabra,
  title={Nรขbra: Syrian Arabic dialects with morphological annotations},
  author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
  journal={arXiv preprint arXiv:2310.17315},
  year={2023}
}

@misc{onajar2025shamiMT,
  title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA},
  author={Sibaee, Serry and Nacar, Omer},
  year={2025}
}

Contact & Support

For questions, issues, or contributions, please visit the model repository or contact the development team.

Downloads last month
74
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Omartificial-Intelligence-Space/Shami-MT

Finetuned
(22)
this model

Space using Omartificial-Intelligence-Space/Shami-MT 1

Collection including Omartificial-Intelligence-Space/Shami-MT