--- base_model: - UBC-NLP/AraT5v2-base-1024 language: - ar library_name: transformers license: apache-2.0 metrics: - bleu pipeline_tag: translation tags: - Syrian - Shami - MT - MSA - Dialect - ArabicNLP --- # SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268). ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/eyHzopOleQcVFz9LkO6Nv.png) ## Model Description SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic. ## Model Details - **Model Type**: Sequence-to-Sequence Translation - **Base Model**: UBC-NLP/AraT5v2-base-1024 - **Language**: Arabic (MSA → Syrian Dialect) - **License**: Apache 2.0 - **Library**: Transformers ## Dataset The model was trained on the **Nâbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png) ### Nâbra Dataset Details **Citation:** ``` Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023). Nâbra: Syrian Arabic dialects with morphological annotations. arXiv preprint arXiv:2310.17315. ``` **Key Statistics:** - **Tokens**: ~60,000 words - **Dialects Covered**: Multiple Syrian regional dialects including: - Aleppo - Damascus - Deir-ezzur - Hama - Homs - Huran - Latakia - Mardin - Raqqah - Suwayda **Data Sources:** - Social media posts - Movie and TV series scripts - Song lyrics - Local proverbs ## Training Details The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics: - **Total Training Steps**: 10,384 - **Epochs**: 22 - **Final Training Loss**: 1.396 - **Final Evaluation Loss**: 0.771 - **Learning Rate**: Cosine schedule starting at 5e-5 - **Batch Size**: 256 - **Total FLOPs**: 1.58e+17 ### Training Progress The model showed consistent improvement throughout training: - Initial loss: 12.93 → Final loss: 1.40 - Evaluation loss steadily decreased from 1.44 to 0.77 - Gradient norms remained stable throughout training ## Usage ### Installation ```bash pip install transformers torch ``` ### Inference Code ```python from transformers import T5Tokenizer, AutoModelForSeq2SeqLM # Load model and tokenizer tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT") model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT") # Example usage ar_prompt = "مرحبا بك هنا" # MSA input input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids outputs = model.generate(input_ids) print("Input (MSA):", ar_prompt) print("Tokenized input:", tokenizer.tokenize(ar_prompt)) print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### Generation Parameters For optimal results, you can adjust generation parameters: ```python outputs = model.generate( input_ids, max_length=128, num_beams=4, temperature=0.7, do_sample=True, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id ) ``` ### Evaluation Results - **Test Set**: 1,500 unseen sentences - **Evaluation Method**: GPT-4.1 as automated judge - **Average Score**: **4.01/5.0** ⭐ - **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt: ``` "You are a language evaluation assistant. Compare the predicted Shami sentence to the reference. Please return a rating from 0 to 5 and a short comment. MSA Input: [input sentence] Model Prediction (Shami dialect): [model output] Ground Truth (Shami dialect): [reference translation] Respond in this format: Score: Comment: " ``` **Score Distribution Analysis:** - **Excellent (5.0)**: High-quality translations with perfect dialectal conversion - **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences - **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies - **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning - **Poor (0-1.9)**: Significant translation errors or loss of meaning ### Performance Highlights - **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect - **Semantic Preservation**: Maintains original meaning while adapting linguistic style - **Regional Adaptability**: Handles various Syrian sub-dialects effectively - **Consistent Quality**: Stable performance across different text types and domains ## Applications This model is particularly useful for: - **Content Localization**: Adapting MSA content for Syrian audiences - **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations - **Educational Tools**: Teaching differences between MSA and Syrian dialect - **Research**: Syrian Arabic NLP and dialectology studies ## Regional Coverage The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria: 🏛️ **Urban Centers**: Damascus, Aleppo 🏔️ **Northern Regions**: Latakia, Mardin 🏜️ **Eastern Areas**: Deir-ezzur, Raqqah 🌄 **Central/Southern**: Hama, Homs, Huran, Suwayda ## Limitations - Trained specifically on Syrian dialect variations - Performance may vary for other Arabic dialects - Limited to text-based translation (no speech support) - Dataset size constraints may affect handling of very rare dialectal expressions ## Citation If you use this model in your research, please cite: ```bibtex @misc{shami-mt-2024, title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect}, author={Omartificial Intelligence Space}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT} } @article{nayouf2023nabra, title={Nâbra: Syrian Arabic dialects with morphological annotations}, author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam}, journal={arXiv preprint arXiv:2310.17315}, year={2023} } @misc{onajar2025shamiMT, title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA}, author={Sibaee, Serry and Nacar, Omer}, year={2025} } ``` ## Contact & Support For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.