File size: 9,169 Bytes
b04b240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67e6552
b04b240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b385467
 
 
 
 
 
 
 
b04b240
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
---
license: mit

datasets:
  - DatarrX/Myanmar-Written-Spoken-Parallel-Corpus

language:
  - my

metrics:
  - bleu
  - chrf
  - ter
  - bertscore

base_model:
  - facebook/nllb-200-distilled-600M

pipeline_tag: text-generation

library_name: transformers

tags:
  - burmese
  - myanmar
  - myanmar-language
  - burmese-nlp
  - style-transfer
  - text-rewriting
  - informal-to-formal
  - spoken-to-written
  - seq2seq
  - nllb
  - lora
  - peft
  - low-resource-language
  - text-generation

model-index:
  - name: myX-TransStyle-S2W
    results:
      - task:
          type: text-generation
          name: Burmese Style Transfer (Spoken to Written)
        dataset:
          name: Custom External Test Set
          type: csv
          config: default
          split: test
        metrics:
          - type: bleu
            value: 12.9445
            name: BLEU
          - type: chrf
            value: 75.5601
            name: chrF
          - type: ter
            value: 58.0189
            name: TER
          - type: bertscore
            value: 0.9685
            name: BERTScore F1

---
# 📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်)

**myX-TransStyle-S2W** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is designed to transform colloquial **Spoken Burmese (ပြောဟန်)** into its formal **Written Burmese (ရေးဟန်)** counterpart while strictly preserving the original semantic meaning.

## Model Details

- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
- **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- **Language:** Burmese (Myanmar)
- **Task:** Text Style Transfer (Spoken → Written)
- **License:** MIT
- **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)

---

## Linguistic Context: The Diglossia Challenge

Burmese is a **diglossic language**, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP:

* **Spoken Style (ပြောဟန်):** Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like **"တယ်"** (tense) or **"ရဲ့"** (possessive).
* **Written Style (ရေးဟန်):** The standard for news, law, textbooks, and officialdom. It uses formal markers such as **"သည်"**, **"၏"**, and **"၍"**.

Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. **myX-TransStyle-S2W** bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation.

---

## Training Methodology

The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar.

### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
We utilized **5,555 high-quality, unique parallel text pairs** from the [MWSPC dataset](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus). This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity.

### 2. Parameter-Efficient Fine-Tuning (PEFT)
To capture complex structural transformations without losing the base model's knowledge, we used **Low-Rank Adaptation (LoRA)**:
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
* **Rank (R):** 32 | **Alpha:** 64.
* **Learning Rate:** 8e-5 with a Cosine scheduler.

### 3. Merging Strategy
After training, the LoRA weights were merged back into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. This creates a standalone **2.8 GB** model that does not require additional PEFT libraries for inference.

---

## Evaluation Results

The model was evaluated on **100 unseen test sentences** across multiple metrics to ensure reliability.

### Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| **BERTScore F1** | **0.9685** | Indicates near-perfect meaning preservation during style transfer. |
| **chrF** | **75.56** | High character-level similarity, showing mastery over Myanmar suffixes. |
| **BLEU** | **12.94** | Reflects the model's creative flexibility; multiple formal rewrites are often valid. |

### Qualitative Analysis
Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., *...တာပါ။*) for formal equivalents (e.g., *...ခြင်းဖြစ်သည်။*). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context.
---

## 🔗 Related Models in the DatarrX Ecosystem

To get the most out of Myanmar Style Transfer, we recommend using these sibling models:

* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** The inverse model for converting Written Style to Spoken Style.
* **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer.

---

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-S2W"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Prepare Input
prefix = "Rewrite Burmese spoken sentence into formal written Burmese: "
spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။"
input_text = prefix + spoken_text

# 3. Generate Written Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
    max_length=160,
    num_beams=5
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။
```

---

## Intended Use & Limitations

### Use Cases
- **Formalizing Content:** Converting interview transcripts or casual notes into professional reports.
- **Data Normalization:** Cleaning social media text for downstream NLP tasks.
- **Educational Tools:** Helping students learn the differences between Myanmar registers.

### Limitations
- **Hybrid Ambiguity:** In cases where a sentence structure is valid in both registers, the model may output minimal changes.
- **Domain Specificity:** Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang.

## Citation

### BibTeX
```BibTeX
@misc{myx_transstyle_s2w_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W}
}
```
---

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

---
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*