Text Generation
Transformers
Safetensors
PEFT
Burmese
m2m_100
text2text-generation
burmese
myanmar
myanmar-language
burmese-nlp
style-transfer
text-rewriting
informal-to-formal
spoken-to-written
seq2seq
nllb
lora
low-resource-language
Eval Results (legacy)
Instructions to use DatarrX/myX-TransStyle-S2W with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DatarrX/myX-TransStyle-S2W with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DatarrX/myX-TransStyle-S2W")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("DatarrX/myX-TransStyle-S2W") model = AutoModelForMultimodalLM.from_pretrained("DatarrX/myX-TransStyle-S2W") - PEFT
How to use DatarrX/myX-TransStyle-S2W with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use DatarrX/myX-TransStyle-S2W with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DatarrX/myX-TransStyle-S2W" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/DatarrX/myX-TransStyle-S2W
- SGLang
How to use DatarrX/myX-TransStyle-S2W with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-S2W" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-S2W" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use DatarrX/myX-TransStyle-S2W with Docker Model Runner:
docker model run hf.co/DatarrX/myX-TransStyle-S2W
File size: 9,169 Bytes
b04b240 67e6552 b04b240 b385467 b04b240 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 | ---
license: mit
datasets:
- DatarrX/Myanmar-Written-Spoken-Parallel-Corpus
language:
- my
metrics:
- bleu
- chrf
- ter
- bertscore
base_model:
- facebook/nllb-200-distilled-600M
pipeline_tag: text-generation
library_name: transformers
tags:
- burmese
- myanmar
- myanmar-language
- burmese-nlp
- style-transfer
- text-rewriting
- informal-to-formal
- spoken-to-written
- seq2seq
- nllb
- lora
- peft
- low-resource-language
- text-generation
model-index:
- name: myX-TransStyle-S2W
results:
- task:
type: text-generation
name: Burmese Style Transfer (Spoken to Written)
dataset:
name: Custom External Test Set
type: csv
config: default
split: test
metrics:
- type: bleu
value: 12.9445
name: BLEU
- type: chrf
value: 75.5601
name: chrF
- type: ter
value: 58.0189
name: TER
- type: bertscore
value: 0.9685
name: BERTScore F1
---
# 📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်)
**myX-TransStyle-S2W** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is designed to transform colloquial **Spoken Burmese (ပြောဟန်)** into its formal **Written Burmese (ရေးဟန်)** counterpart while strictly preserving the original semantic meaning.
## Model Details
- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
- **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- **Language:** Burmese (Myanmar)
- **Task:** Text Style Transfer (Spoken → Written)
- **License:** MIT
- **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)
---
## Linguistic Context: The Diglossia Challenge
Burmese is a **diglossic language**, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP:
* **Spoken Style (ပြောဟန်):** Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like **"တယ်"** (tense) or **"ရဲ့"** (possessive).
* **Written Style (ရေးဟန်):** The standard for news, law, textbooks, and officialdom. It uses formal markers such as **"သည်"**, **"၏"**, and **"၍"**.
Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. **myX-TransStyle-S2W** bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation.
---
## Training Methodology
The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar.
### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
We utilized **5,555 high-quality, unique parallel text pairs** from the [MWSPC dataset](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus). This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity.
### 2. Parameter-Efficient Fine-Tuning (PEFT)
To capture complex structural transformations without losing the base model's knowledge, we used **Low-Rank Adaptation (LoRA)**:
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
* **Rank (R):** 32 | **Alpha:** 64.
* **Learning Rate:** 8e-5 with a Cosine scheduler.
### 3. Merging Strategy
After training, the LoRA weights were merged back into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. This creates a standalone **2.8 GB** model that does not require additional PEFT libraries for inference.
---
## Evaluation Results
The model was evaluated on **100 unseen test sentences** across multiple metrics to ensure reliability.
### Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| **BERTScore F1** | **0.9685** | Indicates near-perfect meaning preservation during style transfer. |
| **chrF** | **75.56** | High character-level similarity, showing mastery over Myanmar suffixes. |
| **BLEU** | **12.94** | Reflects the model's creative flexibility; multiple formal rewrites are often valid. |
### Qualitative Analysis
Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., *...တာပါ။*) for formal equivalents (e.g., *...ခြင်းဖြစ်သည်။*). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context.
---
## 🔗 Related Models in the DatarrX Ecosystem
To get the most out of Myanmar Style Transfer, we recommend using these sibling models:
* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** The inverse model for converting Written Style to Spoken Style.
* **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer.
---
## How to Use
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-S2W"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
# 2. Prepare Input
prefix = "Rewrite Burmese spoken sentence into formal written Burmese: "
spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။"
input_text = prefix + spoken_text
# 3. Generate Written Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
max_length=160,
num_beams=5
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။
```
---
## Intended Use & Limitations
### Use Cases
- **Formalizing Content:** Converting interview transcripts or casual notes into professional reports.
- **Data Normalization:** Cleaning social media text for downstream NLP tasks.
- **Educational Tools:** Helping students learn the differences between Myanmar registers.
### Limitations
- **Hybrid Ambiguity:** In cases where a sentence structure is valid in both registers, the model may output minimal changes.
- **Domain Specificity:** Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang.
## Citation
### BibTeX
```BibTeX
@misc{myx_transstyle_s2w_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W}
}
```
---
## About the Author
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
**Connect with the Author:**
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
---
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.* |