kalixlouiis commited on
Commit
4d0c805
·
verified ·
1 Parent(s): 1308b79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -1
README.md CHANGED
@@ -61,4 +61,137 @@ model-index:
61
  value: 0.9693
62
  name: BERTScore F1
63
 
64
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
61
  value: 0.9693
62
  name: BERTScore F1
63
 
64
+ ---
65
+
66
+ # 📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်)
67
+
68
+ **myX-TransStyle-W2S** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is specifically designed to transform formal **Written Burmese (ရေးဟန်)** into its natural colloquial **Spoken Burmese (ပြောဟန်)** counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity.
69
+
70
+ ## Model Details
71
+
72
+ - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
73
+ - **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
74
+ - **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
75
+ - **Language:** Burmese (Myanmar)
76
+ - **Task:** Text Style Transfer (Written → Spoken)
77
+ - **License:** MIT
78
+ - **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)
79
+
80
+ ---
81
+
82
+ ## Linguistic Context: The Diglossia Challenge
83
+
84
+ Burmese is a **diglossic language**, featuring a major linguistic gap between two functional registers:
85
+
86
+ * **Written Style (ရေးဟန်):** Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as **"သည်"**, **"၏"**, and **"၍"**.
87
+ * **Spoken Style (ပြောဟန်):** Used in daily life, verbal communication, and social media. It uses colloquial markers like **"တယ်"** (tense), **"ရဲ့"** (possessive), and **"နဲ့"** (conjunction).
88
+
89
+ **myX-TransStyle-W2S** addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day.
90
+
91
+ ---
92
+
93
+ ## Training Methodology
94
+
95
+ The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style.
96
+
97
+ ### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
98
+ The model was trained on **5,555 high-quality, unique parallel text pairs**. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity.
99
+
100
+ ### 2. Parameter-Efficient Fine-Tuning (PEFT)
101
+ To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized **Low-Rank Adaptation (LoRA)**:
102
+ * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
103
+ * **Rank (R):** 32 | **Alpha:** 64.
104
+ * **Learning Rate:** 8e-5 with a Cosine scheduler.
105
+
106
+ ### 3. Merging Strategy
107
+ The LoRA adapters were merged into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. The resulting standalone **2.8 GB** model provides high-speed inference without requiring the PEFT library.
108
+
109
+ ---
110
+
111
+ ## Evaluation Results
112
+
113
+ The model was validated on **100 unseen test sentences** and showed superior performance compared to its S2W sibling.
114
+
115
+ ### Performance Metrics
116
+ | Metric | Score | Interpretation |
117
+ |---|---|---|
118
+ | **BERTScore F1** | **0.9693** | Indicates near-perfect meaning preservation during style transfer. |
119
+ | **chrF** | **78.40** | Exceptional character-level accuracy, specifically in converting formal suffixes. |
120
+ | **BLEU** | **19.64** | Higher than S2W, reflecting a more consistent conversion pattern into spoken style. |
121
+
122
+ ### Qualitative Analysis
123
+ Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting *“အလွန်ပင်”* to *“သိပ်”* or *“အကယ်ပင်”* to *“တကယ်လို့တောင်”*) in a way that feels authentic and human.
124
+
125
+ ---
126
+
127
+ ## How to Use
128
+
129
+ ```python
130
+ from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
131
+
132
+ # 1. Load the Merged Model
133
+ model_id = "DatarrX/myX-TransStyle-W2S"
134
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
135
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
136
+
137
+ # 2. Prepare Input
138
+ prefix = "Rewrite Burmese formal written sentence into spoken Burmese: "
139
+ written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။"
140
+ input_text = prefix + written_text
141
+
142
+ # 3. Generate Spoken Style
143
+ inputs = tokenizer(input_text, return_tensors="pt")
144
+ outputs = model.generate(
145
+ **inputs,
146
+ forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
147
+ max_length=160,
148
+ num_beams=5
149
+ )
150
+
151
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
152
+ # Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး ���င်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။
153
+ ```
154
+
155
+ ---
156
+
157
+ ## Intended Use & Limitations
158
+
159
+ ### Use Cases
160
+ - **Natural AI Personalities:** Converting formal bot responses into natural-sounding speech.
161
+ - **Content Localization:** Making formal news or articles more accessible for audio/podcasts.
162
+ - **Creative Writing:** Assisting authors in converting narrative descriptions into natural character dialogue.
163
+
164
+ ### Limitations
165
+ - **Dialectal Focus:** Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented.
166
+ - **Contextual Nuance:** While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input.
167
+
168
+ ## Citation
169
+
170
+ ### BibTeX
171
+ ```BibTeX
172
+ @misc{myx_transstyle_w2s_2026,
173
+ author = {Khant Sint Heinn (Kalix Louis)},
174
+ title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model},
175
+ year = {2026},
176
+ publisher = {Hugging Face},
177
+ organization = {DatarrX},
178
+ howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S}
179
+ }
180
+ ```
181
+ ---
182
+
183
+ ## About the Author
184
+
185
+ **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
186
+
187
+ He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
188
+
189
+ Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
190
+
191
+ His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
192
+
193
+ **Connect with the Author:**
194
+ [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
195
+
196
+ ---
197
+ *Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*