Shami-MT / README.md

Omartificial-Intelligence-Space's picture

Improve model card: Add pipeline tag and correct paper link (#1)

d7e7f06 verified 5 months ago

6.95 kB

	---
	base_model:
	- UBC-NLP/AraT5v2-base-1024
	language:
	- ar
	library_name: transformers
	license: apache-2.0
	metrics:
	- bleu
	pipeline_tag: translation
	tags:
	- Syrian
	- Shami
	- MT
	- MSA
	- Dialect
	- ArabicNLP
	---

	# SHAMI-MT : A Machine Translation Model From MSA to Syrian Dialect

	This model is based on the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268).

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/eyHzopOleQcVFz9LkO6Nv.png)

	## Model Description

	SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.

	## Model Details

	- Model Type: Sequence-to-Sequence Translation
	- Base Model: UBC-NLP/AraT5v2-base-1024
	- Language: Arabic (MSA → Syrian Dialect)
	- License: Apache 2.0
	- Library: Transformers

	## Dataset

	The model was trained on the Nâbra dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.


	![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png)

	### Nâbra Dataset Details

	Citation:
	```
	Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023).
	Nâbra: Syrian Arabic dialects with morphological annotations.
	arXiv preprint arXiv:2310.17315.
	```

	Key Statistics:
	- Tokens: ~60,000 words
	- Dialects Covered: Multiple Syrian regional dialects including:
	- Aleppo
	- Damascus
	- Deir-ezzur
	- Hama
	- Homs
	- Huran
	- Latakia
	- Mardin
	- Raqqah
	- Suwayda

	Data Sources:
	- Social media posts
	- Movie and TV series scripts
	- Song lyrics
	- Local proverbs

	## Training Details

	The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:

	- Total Training Steps: 10,384
	- Epochs: 22
	- Final Training Loss: 1.396
	- Final Evaluation Loss: 0.771
	- Learning Rate: Cosine schedule starting at 5e-5
	- Batch Size: 256
	- Total FLOPs: 1.58e+17

	### Training Progress

	The model showed consistent improvement throughout training:
	- Initial loss: 12.93 → Final loss: 1.40
	- Evaluation loss steadily decreased from 1.44 to 0.77
	- Gradient norms remained stable throughout training

	## Usage

	### Installation

	```bash
	pip install transformers torch
	```

	### Inference Code

	```python
	from transformers import T5Tokenizer, AutoModelForSeq2SeqLM

	# Load model and tokenizer
	tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
	model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")

	# Example usage
	ar_prompt = "مرحبا بك هنا" # MSA input
	input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
	outputs = model.generate(input_ids)

	print("Input (MSA):", ar_prompt)
	print("Tokenized input:", tokenizer.tokenize(ar_prompt))
	print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	### Generation Parameters

	For optimal results, you can adjust generation parameters:

	```python
	outputs = model.generate(
	input_ids,
	max_length=128,
	num_beams=4,
	temperature=0.7,
	do_sample=True,
	pad_token_id=tokenizer.pad_token_id,
	eos_token_id=tokenizer.eos_token_id
	)
	```
	### Evaluation Results
	- Test Set: 1,500 unseen sentences
	- Evaluation Method: GPT-4.1 as automated judge
	- Average Score: 4.01/5.0 ⭐
	- Evaluation Criteria: Translation quality, dialectal accuracy, and semantic preservation

	The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:

	```
	"You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
	Please return a rating from 0 to 5 and a short comment.

	MSA Input: [input sentence]
	Model Prediction (Shami dialect): [model output]
	Ground Truth (Shami dialect): [reference translation]

	Respond in this format:
	Score: <number from 0 to 5>
	Comment: <brief explanation of the score>"
	```

	Score Distribution Analysis:
	- Excellent (5.0): High-quality translations with perfect dialectal conversion
	- Good (4.0-4.9): Minor dialectal variations or stylistic differences
	- Average (3.0-3.9): Acceptable translations with some dialectal inconsistencies
	- Below Average (2.0-2.9): Noticeable errors in dialect or meaning
	- Poor (0-1.9): Significant translation errors or loss of meaning

	### Performance Highlights
	- Strong Dialectal Conversion: Successfully transforms MSA into authentic Syrian dialect
	- Semantic Preservation: Maintains original meaning while adapting linguistic style
	- Regional Adaptability: Handles various Syrian sub-dialects effectively
	- Consistent Quality: Stable performance across different text types and domains

	## Applications

	This model is particularly useful for:
	- Content Localization: Adapting MSA content for Syrian audiences
	- Cultural Preservation: Maintaining and promoting Syrian dialectal variations
	- Educational Tools: Teaching differences between MSA and Syrian dialect
	- Research: Syrian Arabic NLP and dialectology studies

	## Regional Coverage

	The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:

	🏛️ Urban Centers: Damascus, Aleppo
	🏔️ Northern Regions: Latakia, Mardin
	🏜️ Eastern Areas: Deir-ezzur, Raqqah
	🌄 Central/Southern: Hama, Homs, Huran, Suwayda

	## Limitations

	- Trained specifically on Syrian dialect variations
	- Performance may vary for other Arabic dialects
	- Limited to text-based translation (no speech support)
	- Dataset size constraints may affect handling of very rare dialectal expressions

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{shami-mt-2024,
	title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
	author={Omartificial Intelligence Space},
	year={2024},
	publisher={Hugging Face},
	url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
	}

	@article{nayouf2023nabra,
	title={Nâbra: Syrian Arabic dialects with morphological annotations},
	author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
	journal={arXiv preprint arXiv:2310.17315},
	year={2023}
	}

	@misc{onajar2025shamiMT,
	title={Shami-MT-2MSA : A Machine Translation from Syrian Dialect to MSA},
	author={Sibaee, Serry and Nacar, Omer},
	year={2025}
	}
	```

	## Contact & Support

	For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.