Text Generation
Transformers
Safetensors
PEFT
Burmese
m2m_100
text2text-generation
burmese
myanmar
myanmar-language
burmese-nlp
style-transfer
text-rewriting
formal-to-informal
written-to-spoken
seq2seq
nllb
lora
low-resource-language
Eval Results (legacy)
Instructions to use DatarrX/myX-TransStyle-W2S with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DatarrX/myX-TransStyle-W2S with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DatarrX/myX-TransStyle-W2S")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("DatarrX/myX-TransStyle-W2S") model = AutoModelForMultimodalLM.from_pretrained("DatarrX/myX-TransStyle-W2S") - PEFT
How to use DatarrX/myX-TransStyle-W2S with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use DatarrX/myX-TransStyle-W2S with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DatarrX/myX-TransStyle-W2S" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-W2S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/DatarrX/myX-TransStyle-W2S
- SGLang
How to use DatarrX/myX-TransStyle-W2S with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-W2S" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-W2S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-W2S" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-W2S", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use DatarrX/myX-TransStyle-W2S with Docker Model Runner:
docker model run hf.co/DatarrX/myX-TransStyle-W2S
Update README.md
Browse files
README.md
CHANGED
|
@@ -61,4 +61,137 @@ model-index:
|
|
| 61 |
value: 0.9693
|
| 62 |
name: BERTScore F1
|
| 63 |
|
| 64 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
value: 0.9693
|
| 62 |
name: BERTScore F1
|
| 63 |
|
| 64 |
+
---
|
| 65 |
+
|
| 66 |
+
# 📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်)
|
| 67 |
+
|
| 68 |
+
**myX-TransStyle-W2S** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is specifically designed to transform formal **Written Burmese (ရေးဟန်)** into its natural colloquial **Spoken Burmese (ပြောဟန်)** counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity.
|
| 69 |
+
|
| 70 |
+
## Model Details
|
| 71 |
+
|
| 72 |
+
- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
|
| 73 |
+
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
|
| 74 |
+
- **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
|
| 75 |
+
- **Language:** Burmese (Myanmar)
|
| 76 |
+
- **Task:** Text Style Transfer (Written → Spoken)
|
| 77 |
+
- **License:** MIT
|
| 78 |
+
- **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)
|
| 79 |
+
|
| 80 |
+
---
|
| 81 |
+
|
| 82 |
+
## Linguistic Context: The Diglossia Challenge
|
| 83 |
+
|
| 84 |
+
Burmese is a **diglossic language**, featuring a major linguistic gap between two functional registers:
|
| 85 |
+
|
| 86 |
+
* **Written Style (ရေးဟန်):** Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as **"သည်"**, **"၏"**, and **"၍"**.
|
| 87 |
+
* **Spoken Style (ပြောဟန်):** Used in daily life, verbal communication, and social media. It uses colloquial markers like **"တယ်"** (tense), **"ရဲ့"** (possessive), and **"နဲ့"** (conjunction).
|
| 88 |
+
|
| 89 |
+
**myX-TransStyle-W2S** addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day.
|
| 90 |
+
|
| 91 |
+
---
|
| 92 |
+
|
| 93 |
+
## Training Methodology
|
| 94 |
+
|
| 95 |
+
The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style.
|
| 96 |
+
|
| 97 |
+
### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
|
| 98 |
+
The model was trained on **5,555 high-quality, unique parallel text pairs**. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity.
|
| 99 |
+
|
| 100 |
+
### 2. Parameter-Efficient Fine-Tuning (PEFT)
|
| 101 |
+
To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized **Low-Rank Adaptation (LoRA)**:
|
| 102 |
+
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
|
| 103 |
+
* **Rank (R):** 32 | **Alpha:** 64.
|
| 104 |
+
* **Learning Rate:** 8e-5 with a Cosine scheduler.
|
| 105 |
+
|
| 106 |
+
### 3. Merging Strategy
|
| 107 |
+
The LoRA adapters were merged into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. The resulting standalone **2.8 GB** model provides high-speed inference without requiring the PEFT library.
|
| 108 |
+
|
| 109 |
+
---
|
| 110 |
+
|
| 111 |
+
## Evaluation Results
|
| 112 |
+
|
| 113 |
+
The model was validated on **100 unseen test sentences** and showed superior performance compared to its S2W sibling.
|
| 114 |
+
|
| 115 |
+
### Performance Metrics
|
| 116 |
+
| Metric | Score | Interpretation |
|
| 117 |
+
|---|---|---|
|
| 118 |
+
| **BERTScore F1** | **0.9693** | Indicates near-perfect meaning preservation during style transfer. |
|
| 119 |
+
| **chrF** | **78.40** | Exceptional character-level accuracy, specifically in converting formal suffixes. |
|
| 120 |
+
| **BLEU** | **19.64** | Higher than S2W, reflecting a more consistent conversion pattern into spoken style. |
|
| 121 |
+
|
| 122 |
+
### Qualitative Analysis
|
| 123 |
+
Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting *“အလွန်ပင်”* to *“သိပ်”* or *“အကယ်ပင်”* to *“တကယ်လို့တောင်”*) in a way that feels authentic and human.
|
| 124 |
+
|
| 125 |
+
---
|
| 126 |
+
|
| 127 |
+
## How to Use
|
| 128 |
+
|
| 129 |
+
```python
|
| 130 |
+
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
|
| 131 |
+
|
| 132 |
+
# 1. Load the Merged Model
|
| 133 |
+
model_id = "DatarrX/myX-TransStyle-W2S"
|
| 134 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 135 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
|
| 136 |
+
|
| 137 |
+
# 2. Prepare Input
|
| 138 |
+
prefix = "Rewrite Burmese formal written sentence into spoken Burmese: "
|
| 139 |
+
written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။"
|
| 140 |
+
input_text = prefix + written_text
|
| 141 |
+
|
| 142 |
+
# 3. Generate Spoken Style
|
| 143 |
+
inputs = tokenizer(input_text, return_tensors="pt")
|
| 144 |
+
outputs = model.generate(
|
| 145 |
+
**inputs,
|
| 146 |
+
forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
|
| 147 |
+
max_length=160,
|
| 148 |
+
num_beams=5
|
| 149 |
+
)
|
| 150 |
+
|
| 151 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 152 |
+
# Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး ���င်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
---
|
| 156 |
+
|
| 157 |
+
## Intended Use & Limitations
|
| 158 |
+
|
| 159 |
+
### Use Cases
|
| 160 |
+
- **Natural AI Personalities:** Converting formal bot responses into natural-sounding speech.
|
| 161 |
+
- **Content Localization:** Making formal news or articles more accessible for audio/podcasts.
|
| 162 |
+
- **Creative Writing:** Assisting authors in converting narrative descriptions into natural character dialogue.
|
| 163 |
+
|
| 164 |
+
### Limitations
|
| 165 |
+
- **Dialectal Focus:** Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented.
|
| 166 |
+
- **Contextual Nuance:** While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input.
|
| 167 |
+
|
| 168 |
+
## Citation
|
| 169 |
+
|
| 170 |
+
### BibTeX
|
| 171 |
+
```BibTeX
|
| 172 |
+
@misc{myx_transstyle_w2s_2026,
|
| 173 |
+
author = {Khant Sint Heinn (Kalix Louis)},
|
| 174 |
+
title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model},
|
| 175 |
+
year = {2026},
|
| 176 |
+
publisher = {Hugging Face},
|
| 177 |
+
organization = {DatarrX},
|
| 178 |
+
howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S}
|
| 179 |
+
}
|
| 180 |
+
```
|
| 181 |
+
---
|
| 182 |
+
|
| 183 |
+
## About the Author
|
| 184 |
+
|
| 185 |
+
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
|
| 186 |
+
|
| 187 |
+
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
|
| 188 |
+
|
| 189 |
+
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
|
| 190 |
+
|
| 191 |
+
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
|
| 192 |
+
|
| 193 |
+
**Connect with the Author:**
|
| 194 |
+
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
|
| 195 |
+
|
| 196 |
+
---
|
| 197 |
+
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*
|