Shami-MT / README.md
Omartificial-Intelligence-Space's picture
Improve model card: Add pipeline tag and correct paper link (#1)
d7e7f06 verified
---
base_model:
- UBC-NLP/AraT5v2-base-1024
language:
- ar
library_name: transformers
license: apache-2.0
metrics:
- bleu
pipeline_tag: translation
tags:
- Syrian
- Shami
- MT
- MSA
- Dialect
- ArabicNLP
---
# SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect
This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268).
![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/eyHzopOleQcVFz9LkO6Nv.png)
## Model Description
SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.
## Model Details
- **Model Type**: Sequence-to-Sequence Translation
- **Base Model**: UBC-NLP/AraT5v2-base-1024
- **Language**: Arabic (MSA โ†’ Syrian Dialect)
- **License**: Apache 2.0
- **Library**: Transformers
## Dataset
The model was trained on the **Nรขbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png)
### Nรขbra Dataset Details
**Citation:**
```
Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023).
Nรขbra: Syrian Arabic dialects with morphological annotations.
arXiv preprint arXiv:2310.17315.
```
**Key Statistics:**
- **Tokens**: ~60,000 words
- **Dialects Covered**: Multiple Syrian regional dialects including:
- Aleppo
- Damascus
- Deir-ezzur
- Hama
- Homs
- Huran
- Latakia
- Mardin
- Raqqah
- Suwayda
**Data Sources:**
- Social media posts
- Movie and TV series scripts
- Song lyrics
- Local proverbs
## Training Details
The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:
- **Total Training Steps**: 10,384
- **Epochs**: 22
- **Final Training Loss**: 1.396
- **Final Evaluation Loss**: 0.771
- **Learning Rate**: Cosine schedule starting at 5e-5
- **Batch Size**: 256
- **Total FLOPs**: 1.58e+17
### Training Progress
The model showed consistent improvement throughout training:
- Initial loss: 12.93 โ†’ Final loss: 1.40
- Evaluation loss steadily decreased from 1.44 to 0.77
- Gradient norms remained stable throughout training
## Usage
### Installation
```bash
pip install transformers torch
```
### Inference Code
```python
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
# Load model and tokenizer
tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
# Example usage
ar_prompt = "ู…ุฑุญุจุง ุจูƒ ู‡ู†ุง" # MSA input
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print("Input (MSA):", ar_prompt)
print("Tokenized input:", tokenizer.tokenize(ar_prompt))
print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
```
### Generation Parameters
For optimal results, you can adjust generation parameters:
```python
outputs = model.generate(
input_ids,
max_length=128,
num_beams=4,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id
)
```
### Evaluation Results
- **Test Set**: 1,500 unseen sentences
- **Evaluation Method**: GPT-4.1 as automated judge
- **Average Score**: **4.01/5.0** โญ
- **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation
The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:
```
"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
Please return a rating from 0 to 5 and a short comment.
MSA Input: [input sentence]
Model Prediction (Shami dialect): [model output]
Ground Truth (Shami dialect): [reference translation]
Respond in this format:
Score: <number from 0 to 5>
Comment: <brief explanation of the score>"
```
**Score Distribution Analysis:**
- **Excellent (5.0)**: High-quality translations with perfect dialectal conversion
- **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences
- **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies
- **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning
- **Poor (0-1.9)**: Significant translation errors or loss of meaning
### Performance Highlights
- **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect
- **Semantic Preservation**: Maintains original meaning while adapting linguistic style
- **Regional Adaptability**: Handles various Syrian sub-dialects effectively
- **Consistent Quality**: Stable performance across different text types and domains
## Applications
This model is particularly useful for:
- **Content Localization**: Adapting MSA content for Syrian audiences
- **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations
- **Educational Tools**: Teaching differences between MSA and Syrian dialect
- **Research**: Syrian Arabic NLP and dialectology studies
## Regional Coverage
The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:
๐Ÿ›๏ธ **Urban Centers**: Damascus, Aleppo
๐Ÿ”๏ธ **Northern Regions**: Latakia, Mardin
๐Ÿœ๏ธ **Eastern Areas**: Deir-ezzur, Raqqah
๐ŸŒ„ **Central/Southern**: Hama, Homs, Huran, Suwayda
## Limitations
- Trained specifically on Syrian dialect variations
- Performance may vary for other Arabic dialects
- Limited to text-based translation (no speech support)
- Dataset size constraints may affect handling of very rare dialectal expressions
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{shami-mt-2024,
title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
author={Omartificial Intelligence Space},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
}
@article{nayouf2023nabra,
title={Nรขbra: Syrian Arabic dialects with morphological annotations},
author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
journal={arXiv preprint arXiv:2310.17315},
year={2023}
}
@misc{onajar2025shamiMT,
title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA},
author={Sibaee, Serry and Nacar, Omer},
year={2025}
}
```
## Contact & Support
For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.