Text Generation
Transformers
Safetensors
PEFT
Burmese
m2m_100
text2text-generation
burmese
myanmar
myanmar-language
burmese-nlp
style-transfer
text-rewriting
informal-to-formal
spoken-to-written
seq2seq
nllb
lora
low-resource-language
Eval Results (legacy)
Instructions to use DatarrX/myX-TransStyle-S2W with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use DatarrX/myX-TransStyle-S2W with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="DatarrX/myX-TransStyle-S2W")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("DatarrX/myX-TransStyle-S2W") model = AutoModelForMultimodalLM.from_pretrained("DatarrX/myX-TransStyle-S2W") - PEFT
How to use DatarrX/myX-TransStyle-S2W with PEFT:
Task type is invalid.
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use DatarrX/myX-TransStyle-S2W with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "DatarrX/myX-TransStyle-S2W" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/DatarrX/myX-TransStyle-S2W
- SGLang
How to use DatarrX/myX-TransStyle-S2W with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-S2W" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "DatarrX/myX-TransStyle-S2W" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "DatarrX/myX-TransStyle-S2W", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use DatarrX/myX-TransStyle-S2W with Docker Model Runner:
docker model run hf.co/DatarrX/myX-TransStyle-S2W
| license: mit | |
| datasets: | |
| - DatarrX/Myanmar-Written-Spoken-Parallel-Corpus | |
| language: | |
| - my | |
| metrics: | |
| - bleu | |
| - chrf | |
| - ter | |
| - bertscore | |
| base_model: | |
| - facebook/nllb-200-distilled-600M | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| tags: | |
| - burmese | |
| - myanmar | |
| - myanmar-language | |
| - burmese-nlp | |
| - style-transfer | |
| - text-rewriting | |
| - informal-to-formal | |
| - spoken-to-written | |
| - seq2seq | |
| - nllb | |
| - lora | |
| - peft | |
| - low-resource-language | |
| - text-generation | |
| model-index: | |
| - name: myX-TransStyle-S2W | |
| results: | |
| - task: | |
| type: text-generation | |
| name: Burmese Style Transfer (Spoken to Written) | |
| dataset: | |
| name: Custom External Test Set | |
| type: csv | |
| config: default | |
| split: test | |
| metrics: | |
| - type: bleu | |
| value: 12.9445 | |
| name: BLEU | |
| - type: chrf | |
| value: 75.5601 | |
| name: chrF | |
| - type: ter | |
| value: 58.0189 | |
| name: TER | |
| - type: bertscore | |
| value: 0.9685 | |
| name: BERTScore F1 | |
| # 📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်) | |
| **myX-TransStyle-S2W** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is designed to transform colloquial **Spoken Burmese (ပြောဟန်)** into its formal **Written Burmese (ရေးဟန်)** counterpart while strictly preserving the original semantic meaning. | |
| ## Model Details | |
| - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) | |
| - **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX) | |
| - **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters | |
| - **Language:** Burmese (Myanmar) | |
| - **Task:** Text Style Transfer (Spoken → Written) | |
| - **License:** MIT | |
| - **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus) | |
| --- | |
| ## Linguistic Context: The Diglossia Challenge | |
| Burmese is a **diglossic language**, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP: | |
| * **Spoken Style (ပြောဟန်):** Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like **"တယ်"** (tense) or **"ရဲ့"** (possessive). | |
| * **Written Style (ရေးဟန်):** The standard for news, law, textbooks, and officialdom. It uses formal markers such as **"သည်"**, **"၏"**, and **"၍"**. | |
| Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. **myX-TransStyle-S2W** bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation. | |
| --- | |
| ## Training Methodology | |
| The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar. | |
| ### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)) | |
| We utilized **5,555 high-quality, unique parallel text pairs** from the [MWSPC dataset](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus). This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity. | |
| ### 2. Parameter-Efficient Fine-Tuning (PEFT) | |
| To capture complex structural transformations without losing the base model's knowledge, we used **Low-Rank Adaptation (LoRA)**: | |
| * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`. | |
| * **Rank (R):** 32 | **Alpha:** 64. | |
| * **Learning Rate:** 8e-5 with a Cosine scheduler. | |
| ### 3. Merging Strategy | |
| After training, the LoRA weights were merged back into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. This creates a standalone **2.8 GB** model that does not require additional PEFT libraries for inference. | |
| --- | |
| ## Evaluation Results | |
| The model was evaluated on **100 unseen test sentences** across multiple metrics to ensure reliability. | |
| ### Performance Metrics | |
| | Metric | Score | Interpretation | | |
| |---|---|---| | |
| | **BERTScore F1** | **0.9685** | Indicates near-perfect meaning preservation during style transfer. | | |
| | **chrF** | **75.56** | High character-level similarity, showing mastery over Myanmar suffixes. | | |
| | **BLEU** | **12.94** | Reflects the model's creative flexibility; multiple formal rewrites are often valid. | | |
| ### Qualitative Analysis | |
| Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., *...တာပါ။*) for formal equivalents (e.g., *...ခြင်းဖြစ်သည်။*). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context. | |
| --- | |
| ## 🔗 Related Models in the DatarrX Ecosystem | |
| To get the most out of Myanmar Style Transfer, we recommend using these sibling models: | |
| * **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** The inverse model for converting Written Style to Spoken Style. | |
| * **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer. | |
| --- | |
| ## How to Use | |
| ```python | |
| from transformers import AutoTokenizer, AutoModelForSeq2SeqLM | |
| # 1. Load the Merged Model | |
| model_id = "DatarrX/myX-TransStyle-S2W" | |
| tokenizer = AutoTokenizer.from_pretrained(model_id) | |
| model = AutoModelForSeq2SeqLM.from_pretrained(model_id) | |
| # 2. Prepare Input | |
| prefix = "Rewrite Burmese spoken sentence into formal written Burmese: " | |
| spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။" | |
| input_text = prefix + spoken_text | |
| # 3. Generate Written Style | |
| inputs = tokenizer(input_text, return_tensors="pt") | |
| outputs = model.generate( | |
| **inputs, | |
| forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"), | |
| max_length=160, | |
| num_beams=5 | |
| ) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| # Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။ | |
| ``` | |
| --- | |
| ## Intended Use & Limitations | |
| ### Use Cases | |
| - **Formalizing Content:** Converting interview transcripts or casual notes into professional reports. | |
| - **Data Normalization:** Cleaning social media text for downstream NLP tasks. | |
| - **Educational Tools:** Helping students learn the differences between Myanmar registers. | |
| ### Limitations | |
| - **Hybrid Ambiguity:** In cases where a sentence structure is valid in both registers, the model may output minimal changes. | |
| - **Domain Specificity:** Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang. | |
| ## Citation | |
| ### BibTeX | |
| ```BibTeX | |
| @misc{myx_transstyle_s2w_2026, | |
| author = {Khant Sint Heinn (Kalix Louis)}, | |
| title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model}, | |
| year = {2026}, | |
| publisher = {Hugging Face}, | |
| organization = {DatarrX}, | |
| howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W} | |
| } | |
| ``` | |
| --- | |
| ## About the Author | |
| **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology. | |
| He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications. | |
| Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications. | |
| His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation. | |
| **Connect with the Author:** | |
| [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis) | |
| --- | |
| *Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.* |