Improve model card: Add pipeline tag, paper info, usage, and refine description
Browse filesThis PR significantly enhances the model card for `Omartificial-Intelligence-Space/Shami-MT-2MSA` by:
- Adding `pipeline_tag: translation` to the metadata, which is crucial for model discoverability on the Hugging Face Hub, accurately reflecting its machine translation capabilities.
- Correcting a typo in the main model card title ("ton" to "to").
- Including a direct link to the accompanying paper, [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268), at the top of the card.
- Adding the paper's abstract to provide comprehensive context about the model's design, training, and evaluation.
- Clarifying the model description to explicitly state that `SHAMI-MT-2MSA` is the Syrian dialect to MSA component of the broader `SHAMI-MT` bidirectional system.
- Adding a Python sample usage snippet, demonstrating how to load and use the model with the `transformers` library, which improves immediate usability.
- Refining the citation section to include a proper BibTeX entry for the main paper and removing redundant details.
- Removing the "Model Details" section as its content is either covered by the metadata or the main description.
These improvements make the model card more informative, user-friendly, and aligned with best practices on the Hugging Face Hub.
|
@@ -1,12 +1,13 @@
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
| 3 |
language:
|
| 4 |
- ar
|
| 5 |
library_name: transformers
|
|
|
|
| 6 |
metrics:
|
| 7 |
- bleu
|
| 8 |
-
|
| 9 |
-
- UBC-NLP/AraT5v2-base-1024
|
| 10 |
tags:
|
| 11 |
- Syrian
|
| 12 |
- Shami
|
|
@@ -15,35 +16,56 @@ tags:
|
|
| 15 |
- Dialect
|
| 16 |
- ArabicNLP
|
| 17 |
---
|
| 18 |
-
# SHAMI-MT-2MSA : A Machine Translation Model From Syrian Dialect ton MSA
|
| 19 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |

|
| 22 |
|
| 23 |
## Model Description
|
| 24 |
|
| 25 |
-
SHAMI-MT is
|
|
|
|
|
|
|
| 26 |
|
| 27 |
-
|
| 28 |
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
- **Language**: Arabic (Syrian Dialect โ MSA)
|
| 32 |
-
- **License**: Apache 2.0
|
| 33 |
-
- **Library**: Transformers
|
| 34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 35 |
|
| 36 |
## Citation
|
| 37 |
|
| 38 |
-
If you use this model in your research, please cite:
|
| 39 |
|
| 40 |
```bibtex
|
| 41 |
-
@
|
| 42 |
-
title={SHAMI-MT: A
|
| 43 |
-
author={
|
| 44 |
-
year={
|
| 45 |
-
|
| 46 |
-
url={https://huggingface.co/
|
| 47 |
}
|
| 48 |
|
| 49 |
@article{nayouf2023nabra,
|
|
@@ -52,15 +74,8 @@ If you use this model in your research, please cite:
|
|
| 52 |
journal={arXiv preprint arXiv:2310.17315},
|
| 53 |
year={2023}
|
| 54 |
}
|
| 55 |
-
|
| 56 |
-
@misc{onajar2025shamiMT,
|
| 57 |
-
title={Shami-MT : A Machine Translation from MSA to Syrian Dialect},
|
| 58 |
-
author={Sibaee, Serry and Nacar, Omer},
|
| 59 |
-
year={2025}
|
| 60 |
-
}
|
| 61 |
```
|
| 62 |
|
| 63 |
## Contact & Support
|
| 64 |
|
| 65 |
-
For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.
|
| 66 |
-
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- UBC-NLP/AraT5v2-base-1024
|
| 4 |
language:
|
| 5 |
- ar
|
| 6 |
library_name: transformers
|
| 7 |
+
license: apache-2.0
|
| 8 |
metrics:
|
| 9 |
- bleu
|
| 10 |
+
pipeline_tag: translation
|
|
|
|
| 11 |
tags:
|
| 12 |
- Syrian
|
| 13 |
- Shami
|
|
|
|
| 16 |
- Dialect
|
| 17 |
- ArabicNLP
|
| 18 |
---
|
|
|
|
| 19 |
|
| 20 |
+
# SHAMI-MT-2MSA : A Machine Translation Model From Syrian Dialect to MSA
|
| 21 |
+
|
| 22 |
+
This model is part of the work presented in the paper [SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System](https://huggingface.co/papers/2508.02268).
|
| 23 |
+
|
| 24 |
+
## Paper Abstract
|
| 25 |
+
|
| 26 |
+
The rich linguistic landscape of the Arab world is characterized by a significant gap between Modern Standard Arabic (MSA), the language of formal communication, and the diverse regional dialects used in everyday life. This diglossia presents a formidable challenge for natural language processing, particularly machine translation. This paper introduces \textbf{SHAMI-MT}, a bidirectional machine translation system specifically engineered to bridge the communication gap between MSA and the Syrian dialect. We present two specialized models, one for MSA-to-Shami and another for Shami-to-MSA translation, both built upon the state-of-the-art AraT5v2-base-1024 architecture. The models were fine-tuned on the comprehensive Nabra dataset and rigorously evaluated on unseen data from the MADAR corpus. Our MSA-to-Shami model achieved an outstanding average quality score of \textbf{4.01 out of 5.0} when judged by OPENAI model GPT-4.1, demonstrating its ability to produce translations that are not only accurate but also dialectally authentic. This work provides a crucial, high-fidelity tool for a previously underserved language pair, advancing the field of dialectal Arabic translation and offering significant applications in content localization, cultural heritage, and intercultural communication.
|
| 27 |
|
| 28 |

|
| 29 |
|
| 30 |
## Model Description
|
| 31 |
|
| 32 |
+
SHAMI-MT-2MSA is one of two specialized models that constitute the **SHAMI-MT** bidirectional machine translation system. This particular model is designed to translate from **Syrian dialect to Modern Standard Arabic (MSA)**. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.
|
| 33 |
+
|
| 34 |
+
## Usage
|
| 35 |
|
| 36 |
+
This model can be used directly with the Hugging Face `transformers` library.
|
| 37 |
|
| 38 |
+
```python
|
| 39 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
# Load tokenizer and model
|
| 42 |
+
model_id = "Omartificial-Intelligence-Space/Shami-MT-2MSA"
|
| 43 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 44 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
| 45 |
+
|
| 46 |
+
# Example input: Syrian Arabic dialect
|
| 47 |
+
input_text = "ูููู ุงูููู
ุ" # "How are you today?" in Syrian dialect
|
| 48 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
| 49 |
+
|
| 50 |
+
# Generate translation
|
| 51 |
+
outputs = model.generate(**inputs, max_new_tokens=128) # Added max_new_tokens for generation to prevent infinite loop
|
| 52 |
+
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 53 |
+
|
| 54 |
+
print(f"Syrian Dialect: {input_text}")
|
| 55 |
+
print(f"Modern Standard Arabic: {translated_text}")
|
| 56 |
+
```
|
| 57 |
|
| 58 |
## Citation
|
| 59 |
|
| 60 |
+
If you use this model in your research, please cite the main paper and the dataset paper:
|
| 61 |
|
| 62 |
```bibtex
|
| 63 |
+
@article{sibaee2025shamimt,
|
| 64 |
+
title={SHAMI-MT: A Syrian Arabic Dialect to Modern Standard Arabic Bidirectional Machine Translation System},
|
| 65 |
+
author={Sibaee, Serry and Nacar, Omer},
|
| 66 |
+
year={2025},
|
| 67 |
+
journal={Hugging Face Papers},
|
| 68 |
+
url={https://huggingface.co/papers/2508.02268}
|
| 69 |
}
|
| 70 |
|
| 71 |
@article{nayouf2023nabra,
|
|
|
|
| 74 |
journal={arXiv preprint arXiv:2310.17315},
|
| 75 |
year={2023}
|
| 76 |
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
```
|
| 78 |
|
| 79 |
## Contact & Support
|
| 80 |
|
| 81 |
+
For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT-2MSA) or contact the development team.
|
|
|