Omartificial-Intelligence-Space
/

Shami-MT

 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/8FzZPY8o9cqrMVHb4ubD4.png)
+## Model Description
+SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.
+## Model Details
+- **Model Type**: Sequence-to-Sequence Translation
+- **Base Model**: UBC-NLP/AraT5v2-base-1024
+- **Language**: Arabic (MSA → Syrian Dialect)
+- **License**: Apache 2.0
+- **Library**: Transformers
+## Dataset
+The model was trained on the **Nâbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png)
+### Nâbra Dataset Details
+**Citation:**
+```
+Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023).
+Nâbra: Syrian Arabic dialects with morphological annotations.
+arXiv preprint arXiv:2310.17315.
+```
+**Key Statistics:**
+- **Tokens**: ~60,000 words
+- **Dialects Covered**: Multiple Syrian regional dialects including:
+  - Aleppo
+  - Damascus
+  - Deir-ezzur
+  - Hama
+  - Homs
+  - Huran
+  - Latakia
+  - Mardin
+  - Raqqah
+  - Suwayda
+**Data Sources:**
+- Social media posts
+- Movie and TV series scripts
+- Song lyrics
+- Local proverbs
+## Training Details
+The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:
+- **Total Training Steps**: 10,384
+- **Epochs**: 22
+- **Final Training Loss**: 1.396
+- **Final Evaluation Loss**: 0.771
+- **Learning Rate**: Cosine schedule starting at 5e-5
+- **Batch Size**: 256
+- **Total FLOPs**: 1.58e+17
+### Training Progress
+The model showed consistent improvement throughout training:
+- Initial loss: 12.93 → Final loss: 1.40
+- Evaluation loss steadily decreased from 1.44 to 0.77
+- Gradient norms remained stable throughout training
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Inference Code
+```python
+from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
+# Load model and tokenizer
+tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
+model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
+# Example usage
+ar_prompt = "مرحبا بك هنا"  # MSA input
+input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
+outputs = model.generate(input_ids)
+print("Input (MSA):", ar_prompt)
+print("Tokenized input:", tokenizer.tokenize(ar_prompt))
+print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+### Generation Parameters
+For optimal results, you can adjust generation parameters:
+```python
+outputs = model.generate(
+    input_ids,
+    max_length=128,
+    num_beams=4,
+    temperature=0.7,
+    do_sample=True,
+    pad_token_id=tokenizer.pad_token_id,
+    eos_token_id=tokenizer.eos_token_id
+)
+```
+### Evaluation Results
+- **Test Set**: 1,500 unseen sentences
+- **Evaluation Method**: GPT-4.1 as automated judge
+- **Average Score**: **4.01/5.0** ⭐
+- **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation
+The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:
+```
+"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
+Please return a rating from 0 to 5 and a short comment.
+MSA Input: [input sentence]
+Model Prediction (Shami dialect): [model output]
+Ground Truth (Shami dialect): [reference translation]
+Respond in this format:
+Score: <number from 0 to 5>
+Comment: <brief explanation of the score>"
+```
+**Score Distribution Analysis:**
+- **Excellent (5.0)**: High-quality translations with perfect dialectal conversion
+- **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences
+- **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies
+- **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning
+- **Poor (0-1.9)**: Significant translation errors or loss of meaning
+### Performance Highlights
+- **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect
+- **Semantic Preservation**: Maintains original meaning while adapting linguistic style
+- **Regional Adaptability**: Handles various Syrian sub-dialects effectively
+- **Consistent Quality**: Stable performance across different text types and domains
+## Applications
+This model is particularly useful for:
+- **Content Localization**: Adapting MSA content for Syrian audiences
+- **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations
+- **Educational Tools**: Teaching differences between MSA and Syrian dialect
+- **Research**: Syrian Arabic NLP and dialectology studies
+## Regional Coverage
+The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:
+🏛️ **Urban Centers**: Damascus, Aleppo
+🏔️ **Northern Regions**: Latakia, Mardin
+🏜️ **Eastern Areas**: Deir-ezzur, Raqqah
+🌄 **Central/Southern**: Hama, Homs, Huran, Suwayda
+## Limitations
+- Trained specifically on Syrian dialect variations
+- Performance may vary for other Arabic dialects
+- Limited to text-based translation (no speech support)
+- Dataset size constraints may affect handling of very rare dialectal expressions
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{shami-mt-2024,
+  title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
+  author={Omartificial Intelligence Space},
+  year={2024},
+  publisher={Hugging Face},
+  url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
+}
+@article{nayouf2023nabra,
+  title={Nâbra: Syrian Arabic dialects with morphological annotations},
+  author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
+  journal={arXiv preprint arXiv:2310.17315},
+  year={2023}
+}
+```
+## Contact & Support
+For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.