|
|
--- |
|
|
base_model: |
|
|
- UBC-NLP/AraT5v2-base-1024 |
|
|
language: |
|
|
- ar |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
metrics: |
|
|
- bleu |
|
|
pipeline_tag: translation |
|
|
tags: |
|
|
- Syrian |
|
|
- Shami |
|
|
- MT |
|
|
- MSA |
|
|
- Dialect |
|
|
- ArabicNLP |
|
|
--- |
|
|
|
|
|
# SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect |
|
|
|
|
|
This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268). |
|
|
|
|
|
 |
|
|
|
|
|
## Model Description |
|
|
|
|
|
SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Sequence-to-Sequence Translation |
|
|
- **Base Model**: UBC-NLP/AraT5v2-base-1024 |
|
|
- **Language**: Arabic (MSA โ Syrian Dialect) |
|
|
- **License**: Apache 2.0 |
|
|
- **Library**: Transformers |
|
|
|
|
|
## Dataset |
|
|
|
|
|
The model was trained on the **Nรขbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations. |
|
|
|
|
|
|
|
|
 |
|
|
|
|
|
### Nรขbra Dataset Details |
|
|
|
|
|
**Citation:** |
|
|
``` |
|
|
Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023). |
|
|
Nรขbra: Syrian Arabic dialects with morphological annotations. |
|
|
arXiv preprint arXiv:2310.17315. |
|
|
``` |
|
|
|
|
|
**Key Statistics:** |
|
|
- **Tokens**: ~60,000 words |
|
|
- **Dialects Covered**: Multiple Syrian regional dialects including: |
|
|
- Aleppo |
|
|
- Damascus |
|
|
- Deir-ezzur |
|
|
- Hama |
|
|
- Homs |
|
|
- Huran |
|
|
- Latakia |
|
|
- Mardin |
|
|
- Raqqah |
|
|
- Suwayda |
|
|
|
|
|
**Data Sources:** |
|
|
- Social media posts |
|
|
- Movie and TV series scripts |
|
|
- Song lyrics |
|
|
- Local proverbs |
|
|
|
|
|
## Training Details |
|
|
|
|
|
The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics: |
|
|
|
|
|
- **Total Training Steps**: 10,384 |
|
|
- **Epochs**: 22 |
|
|
- **Final Training Loss**: 1.396 |
|
|
- **Final Evaluation Loss**: 0.771 |
|
|
- **Learning Rate**: Cosine schedule starting at 5e-5 |
|
|
- **Batch Size**: 256 |
|
|
- **Total FLOPs**: 1.58e+17 |
|
|
|
|
|
### Training Progress |
|
|
|
|
|
The model showed consistent improvement throughout training: |
|
|
- Initial loss: 12.93 โ Final loss: 1.40 |
|
|
- Evaluation loss steadily decreased from 1.44 to 0.77 |
|
|
- Gradient norms remained stable throughout training |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install transformers torch |
|
|
``` |
|
|
|
|
|
### Inference Code |
|
|
|
|
|
```python |
|
|
from transformers import T5Tokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT") |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT") |
|
|
|
|
|
# Example usage |
|
|
ar_prompt = "ู
ุฑุญุจุง ุจู ููุง" # MSA input |
|
|
input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids |
|
|
outputs = model.generate(input_ids) |
|
|
|
|
|
print("Input (MSA):", ar_prompt) |
|
|
print("Tokenized input:", tokenizer.tokenize(ar_prompt)) |
|
|
print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
|
``` |
|
|
|
|
|
### Generation Parameters |
|
|
|
|
|
For optimal results, you can adjust generation parameters: |
|
|
|
|
|
```python |
|
|
outputs = model.generate( |
|
|
input_ids, |
|
|
max_length=128, |
|
|
num_beams=4, |
|
|
temperature=0.7, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.pad_token_id, |
|
|
eos_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
``` |
|
|
### Evaluation Results |
|
|
- **Test Set**: 1,500 unseen sentences |
|
|
- **Evaluation Method**: GPT-4.1 as automated judge |
|
|
- **Average Score**: **4.01/5.0** โญ |
|
|
- **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation |
|
|
|
|
|
The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt: |
|
|
|
|
|
``` |
|
|
"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference. |
|
|
Please return a rating from 0 to 5 and a short comment. |
|
|
|
|
|
MSA Input: [input sentence] |
|
|
Model Prediction (Shami dialect): [model output] |
|
|
Ground Truth (Shami dialect): [reference translation] |
|
|
|
|
|
Respond in this format: |
|
|
Score: <number from 0 to 5> |
|
|
Comment: <brief explanation of the score>" |
|
|
``` |
|
|
|
|
|
**Score Distribution Analysis:** |
|
|
- **Excellent (5.0)**: High-quality translations with perfect dialectal conversion |
|
|
- **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences |
|
|
- **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies |
|
|
- **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning |
|
|
- **Poor (0-1.9)**: Significant translation errors or loss of meaning |
|
|
|
|
|
### Performance Highlights |
|
|
- **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect |
|
|
- **Semantic Preservation**: Maintains original meaning while adapting linguistic style |
|
|
- **Regional Adaptability**: Handles various Syrian sub-dialects effectively |
|
|
- **Consistent Quality**: Stable performance across different text types and domains |
|
|
|
|
|
## Applications |
|
|
|
|
|
This model is particularly useful for: |
|
|
- **Content Localization**: Adapting MSA content for Syrian audiences |
|
|
- **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations |
|
|
- **Educational Tools**: Teaching differences between MSA and Syrian dialect |
|
|
- **Research**: Syrian Arabic NLP and dialectology studies |
|
|
|
|
|
## Regional Coverage |
|
|
|
|
|
The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria: |
|
|
|
|
|
๐๏ธ **Urban Centers**: Damascus, Aleppo |
|
|
๐๏ธ **Northern Regions**: Latakia, Mardin |
|
|
๐๏ธ **Eastern Areas**: Deir-ezzur, Raqqah |
|
|
๐ **Central/Southern**: Hama, Homs, Huran, Suwayda |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained specifically on Syrian dialect variations |
|
|
- Performance may vary for other Arabic dialects |
|
|
- Limited to text-based translation (no speech support) |
|
|
- Dataset size constraints may affect handling of very rare dialectal expressions |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{shami-mt-2024, |
|
|
title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect}, |
|
|
author={Omartificial Intelligence Space}, |
|
|
year={2024}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT} |
|
|
} |
|
|
|
|
|
@article{nayouf2023nabra, |
|
|
title={Nรขbra: Syrian Arabic dialects with morphological annotations}, |
|
|
author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam}, |
|
|
journal={arXiv preprint arXiv:2310.17315}, |
|
|
year={2023} |
|
|
} |
|
|
|
|
|
@misc{onajar2025shamiMT, |
|
|
title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA}, |
|
|
author={Sibaee, Serry and Nacar, Omer}, |
|
|
year={2025} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Contact & Support |
|
|
|
|
|
For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team. |