File size: 9,102 Bytes
1308b79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d0c805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c45299d
 
 
 
 
 
 
 
4d0c805
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
---
license: mit

datasets:
  - DatarrX/Myanmar-Written-Spoken-Parallel-Corpus

language:
  - my

metrics:
  - bleu
  - chrf
  - ter
  - bertscore

base_model:
  - facebook/nllb-200-distilled-600M

pipeline_tag: text-generation

library_name: transformers

tags:
  - burmese
  - myanmar
  - myanmar-language
  - burmese-nlp
  - style-transfer
  - text-rewriting
  - formal-to-informal
  - written-to-spoken
  - seq2seq
  - nllb
  - lora
  - peft
  - low-resource-language
  - text-generation

model-index:
  - name: myX-TransStyle-W2S
    results:
      - task:
          type: text-generation
          name: Burmese Style Transfer (Written to Spoken)
        dataset:
          name: Custom External Test Set
          type: csv
          config: default
          split: test
        metrics:
          - type: bleu
            value: 19.6381
            name: BLEU
          - type: chrf
            value: 78.3975
            name: chrF
          - type: ter
            value: 50.7353
            name: TER
          - type: bertscore
            value: 0.9693
            name: BERTScore F1

---

# 📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်)

**myX-TransStyle-W2S** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is specifically designed to transform formal **Written Burmese (ရေးဟန်)** into its natural colloquial **Spoken Burmese (ပြောဟန်)** counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity.

## Model Details

- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
- **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- **Language:** Burmese (Myanmar)
- **Task:** Text Style Transfer (Written → Spoken)
- **License:** MIT
- **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)

---

## Linguistic Context: The Diglossia Challenge

Burmese is a **diglossic language**, featuring a major linguistic gap between two functional registers:

* **Written Style (ရေးဟန်):** Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as **"သည်"**, **"၏"**, and **"၍"**.
* **Spoken Style (ပြောဟန်):** Used in daily life, verbal communication, and social media. It uses colloquial markers like **"တယ်"** (tense), **"ရဲ့"** (possessive), and **"နဲ့"** (conjunction).

**myX-TransStyle-W2S** addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day.

---

## Training Methodology

The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style.

### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
The model was trained on **5,555 high-quality, unique parallel text pairs**. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity.

### 2. Parameter-Efficient Fine-Tuning (PEFT)
To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized **Low-Rank Adaptation (LoRA)**:
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
* **Rank (R):** 32 | **Alpha:** 64.
* **Learning Rate:** 8e-5 with a Cosine scheduler.

### 3. Merging Strategy
The LoRA adapters were merged into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. The resulting standalone **2.8 GB** model provides high-speed inference without requiring the PEFT library.

---

## Evaluation Results

The model was validated on **100 unseen test sentences** and showed superior performance compared to its S2W sibling.

### Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| **BERTScore F1** | **0.9693** | Indicates near-perfect meaning preservation during style transfer. |
| **chrF** | **78.40** | Exceptional character-level accuracy, specifically in converting formal suffixes. |
| **BLEU** | **19.64** | Higher than S2W, reflecting a more consistent conversion pattern into spoken style. |

### Qualitative Analysis
Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting *“အလွန်ပင်”* to *“သိပ်”* or *“အကယ်ပင်”* to *“တကယ်လို့တောင်”*) in a way that feels authentic and human.

---

## 🔗 Related Models in the DatarrX Ecosystem

Explore other specialized models for Myanmar linguistic styles:

* **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** The sibling model for converting Spoken Style to formal Written Style.
* **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** Use this to automatically detect the style of your input text before processing.
---

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-W2S"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Prepare Input
prefix = "Rewrite Burmese formal written sentence into spoken Burmese: "
written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။"
input_text = prefix + written_text

# 3. Generate Spoken Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
    max_length=160,
    num_beams=5
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။
```

---

## Intended Use & Limitations

### Use Cases
- **Natural AI Personalities:** Converting formal bot responses into natural-sounding speech.
- **Content Localization:** Making formal news or articles more accessible for audio/podcasts.
- **Creative Writing:** Assisting authors in converting narrative descriptions into natural character dialogue.

### Limitations
- **Dialectal Focus:** Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented.
- **Contextual Nuance:** While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input.

## Citation

### BibTeX
```BibTeX
@misc{myx_transstyle_w2s_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S}
}
```
---

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

---
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*