Omartificial-Intelligence-Space commited on
Commit
42c411e
·
verified ·
1 Parent(s): f3cfaf6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +193 -0
README.md CHANGED
@@ -19,3 +19,196 @@ tags:
19
 
20
 
21
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/8FzZPY8o9cqrMVHb4ubD4.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
 
21
  ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/8FzZPY8o9cqrMVHb4ubD4.png)
22
+
23
+
24
+ ## Model Description
25
+
26
+ SHAMI-MT is a specialized machine translation model designed to translate from Modern Standard Arabic (MSA) to Syrian dialect. Built on the robust AraT5v2-base-1024 architecture, this model bridges the gap between formal Arabic and the rich dialectal variations of Syrian Arabic.
27
+
28
+ ## Model Details
29
+
30
+ - **Model Type**: Sequence-to-Sequence Translation
31
+ - **Base Model**: UBC-NLP/AraT5v2-base-1024
32
+ - **Language**: Arabic (MSA → Syrian Dialect)
33
+ - **License**: Apache 2.0
34
+ - **Library**: Transformers
35
+
36
+ ## Dataset
37
+
38
+ The model was trained on the **Nâbra** dataset, a comprehensive corpus of Syrian Arabic dialects with morphological annotations.
39
+
40
+
41
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/AaN6gPticioHBTXdPsroy.png)
42
+
43
+ ### Nâbra Dataset Details
44
+
45
+ **Citation:**
46
+ ```
47
+ Nayouf, A., Hammouda, T., Jarrar, M., Zaraket, F., & Kurdy, M. B. (2023).
48
+ Nâbra: Syrian Arabic dialects with morphological annotations.
49
+ arXiv preprint arXiv:2310.17315.
50
+ ```
51
+
52
+ **Key Statistics:**
53
+ - **Tokens**: ~60,000 words
54
+ - **Dialects Covered**: Multiple Syrian regional dialects including:
55
+ - Aleppo
56
+ - Damascus
57
+ - Deir-ezzur
58
+ - Hama
59
+ - Homs
60
+ - Huran
61
+ - Latakia
62
+ - Mardin
63
+ - Raqqah
64
+ - Suwayda
65
+
66
+ **Data Sources:**
67
+ - Social media posts
68
+ - Movie and TV series scripts
69
+ - Song lyrics
70
+ - Local proverbs
71
+
72
+ ## Training Details
73
+
74
+ The model was fine-tuned on the AraT5v2-base-1024 architecture with the following training metrics:
75
+
76
+ - **Total Training Steps**: 10,384
77
+ - **Epochs**: 22
78
+ - **Final Training Loss**: 1.396
79
+ - **Final Evaluation Loss**: 0.771
80
+ - **Learning Rate**: Cosine schedule starting at 5e-5
81
+ - **Batch Size**: 256
82
+ - **Total FLOPs**: 1.58e+17
83
+
84
+ ### Training Progress
85
+
86
+ The model showed consistent improvement throughout training:
87
+ - Initial loss: 12.93 → Final loss: 1.40
88
+ - Evaluation loss steadily decreased from 1.44 to 0.77
89
+ - Gradient norms remained stable throughout training
90
+
91
+ ## Usage
92
+
93
+ ### Installation
94
+
95
+ ```bash
96
+ pip install transformers torch
97
+ ```
98
+
99
+ ### Inference Code
100
+
101
+ ```python
102
+ from transformers import T5Tokenizer, AutoModelForSeq2SeqLM
103
+
104
+ # Load model and tokenizer
105
+ tokenizer = T5Tokenizer.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
106
+ model = AutoModelForSeq2SeqLM.from_pretrained("Omartificial-Intelligence-Space/Shami-MT")
107
+
108
+ # Example usage
109
+ ar_prompt = "مرحبا بك هنا" # MSA input
110
+ input_ids = tokenizer(ar_prompt, return_tensors="pt").input_ids
111
+ outputs = model.generate(input_ids)
112
+
113
+ print("Input (MSA):", ar_prompt)
114
+ print("Tokenized input:", tokenizer.tokenize(ar_prompt))
115
+ print("Output (Syrian Dialect):", tokenizer.decode(outputs[0], skip_special_tokens=True))
116
+ ```
117
+
118
+ ### Generation Parameters
119
+
120
+ For optimal results, you can adjust generation parameters:
121
+
122
+ ```python
123
+ outputs = model.generate(
124
+ input_ids,
125
+ max_length=128,
126
+ num_beams=4,
127
+ temperature=0.7,
128
+ do_sample=True,
129
+ pad_token_id=tokenizer.pad_token_id,
130
+ eos_token_id=tokenizer.eos_token_id
131
+ )
132
+ ```
133
+ ### Evaluation Results
134
+ - **Test Set**: 1,500 unseen sentences
135
+ - **Evaluation Method**: GPT-4.1 as automated judge
136
+ - **Average Score**: **4.01/5.0** ⭐
137
+ - **Evaluation Criteria**: Translation quality, dialectal accuracy, and semantic preservation
138
+
139
+ The model was evaluated using GPT-4.1 as an automated judge with the following structured prompt:
140
+
141
+ ```
142
+ "You are a language evaluation assistant. Compare the predicted Shami sentence to the reference.
143
+ Please return a rating from 0 to 5 and a short comment.
144
+
145
+ MSA Input: [input sentence]
146
+ Model Prediction (Shami dialect): [model output]
147
+ Ground Truth (Shami dialect): [reference translation]
148
+
149
+ Respond in this format:
150
+ Score: <number from 0 to 5>
151
+ Comment: <brief explanation of the score>"
152
+ ```
153
+
154
+ **Score Distribution Analysis:**
155
+ - **Excellent (5.0)**: High-quality translations with perfect dialectal conversion
156
+ - **Good (4.0-4.9)**: Minor dialectal variations or stylistic differences
157
+ - **Average (3.0-3.9)**: Acceptable translations with some dialectal inconsistencies
158
+ - **Below Average (2.0-2.9)**: Noticeable errors in dialect or meaning
159
+ - **Poor (0-1.9)**: Significant translation errors or loss of meaning
160
+
161
+ ### Performance Highlights
162
+ - **Strong Dialectal Conversion**: Successfully transforms MSA into authentic Syrian dialect
163
+ - **Semantic Preservation**: Maintains original meaning while adapting linguistic style
164
+ - **Regional Adaptability**: Handles various Syrian sub-dialects effectively
165
+ - **Consistent Quality**: Stable performance across different text types and domains
166
+
167
+ ## Applications
168
+
169
+ This model is particularly useful for:
170
+ - **Content Localization**: Adapting MSA content for Syrian audiences
171
+ - **Cultural Preservation**: Maintaining and promoting Syrian dialectal variations
172
+ - **Educational Tools**: Teaching differences between MSA and Syrian dialect
173
+ - **Research**: Syrian Arabic NLP and dialectology studies
174
+
175
+ ## Regional Coverage
176
+
177
+ The model handles multiple Syrian sub-dialects, making it versatile for different regions within Syria:
178
+
179
+ 🏛️ **Urban Centers**: Damascus, Aleppo
180
+ 🏔️ **Northern Regions**: Latakia, Mardin
181
+ 🏜️ **Eastern Areas**: Deir-ezzur, Raqqah
182
+ 🌄 **Central/Southern**: Hama, Homs, Huran, Suwayda
183
+
184
+ ## Limitations
185
+
186
+ - Trained specifically on Syrian dialect variations
187
+ - Performance may vary for other Arabic dialects
188
+ - Limited to text-based translation (no speech support)
189
+ - Dataset size constraints may affect handling of very rare dialectal expressions
190
+
191
+ ## Citation
192
+
193
+ If you use this model in your research, please cite:
194
+
195
+ ```bibtex
196
+ @misc{shami-mt-2024,
197
+ title={SHAMI-MT: A Machine Translation Model From MSA to Syrian Dialect},
198
+ author={Omartificial Intelligence Space},
199
+ year={2024},
200
+ publisher={Hugging Face},
201
+ url={https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT}
202
+ }
203
+
204
+ @article{nayouf2023nabra,
205
+ title={Nâbra: Syrian Arabic dialects with morphological annotations},
206
+ author={Nayouf, Amal and Hammouda, Tymaa Hasanain and Jarrar, Mustafa and Zaraket, Fadi A and Kurdy, Mohamad-Bassam},
207
+ journal={arXiv preprint arXiv:2310.17315},
208
+ year={2023}
209
+ }
210
+ ```
211
+
212
+ ## Contact & Support
213
+
214
+ For questions, issues, or contributions, please visit the [model repository](https://huggingface.co/Omartificial-Intelligence-Space/Shami-MT) or contact the development team.