Instructions to use DatarrX/myX-TransStyle-S2W with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use DatarrX/myX-TransStyle-S2W with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="DatarrX/myX-TransStyle-S2W")

# Load model directly
from transformers import AutoTokenizer, AutoModelForMultimodalLM

tokenizer = AutoTokenizer.from_pretrained("DatarrX/myX-TransStyle-S2W")
model = AutoModelForMultimodalLM.from_pretrained("DatarrX/myX-TransStyle-S2W")

PEFT
How to use DatarrX/myX-TransStyle-S2W with PEFT:
```
Task type is invalid.
```
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use DatarrX/myX-TransStyle-S2W with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "DatarrX/myX-TransStyle-S2W"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DatarrX/myX-TransStyle-S2W",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/DatarrX/myX-TransStyle-S2W

SGLang

How to use DatarrX/myX-TransStyle-S2W with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "DatarrX/myX-TransStyle-S2W" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DatarrX/myX-TransStyle-S2W",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "DatarrX/myX-TransStyle-S2W" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "DatarrX/myX-TransStyle-S2W",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use DatarrX/myX-TransStyle-S2W with Docker Model Runner:
```
docker model run hf.co/DatarrX/myX-TransStyle-S2W
```

myX-TransStyle-S2W

File size: 9,169 Bytes

---
license: mit

datasets:
  - DatarrX/Myanmar-Written-Spoken-Parallel-Corpus

language:
  - my

metrics:
  - bleu
  - chrf
  - ter
  - bertscore

base_model:
  - facebook/nllb-200-distilled-600M

pipeline_tag: text-generation

library_name: transformers

tags:
  - burmese
  - myanmar
  - myanmar-language
  - burmese-nlp
  - style-transfer
  - text-rewriting
  - informal-to-formal
  - spoken-to-written
  - seq2seq
  - nllb
  - lora
  - peft
  - low-resource-language
  - text-generation

model-index:
  - name: myX-TransStyle-S2W
    results:
      - task:
          type: text-generation
          name: Burmese Style Transfer (Spoken to Written)
        dataset:
          name: Custom External Test Set
          type: csv
          config: default
          split: test
        metrics:
          - type: bleu
            value: 12.9445
            name: BLEU
          - type: chrf
            value: 75.5601
            name: chrF
          - type: ter
            value: 58.0189
            name: TER
          - type: bertscore
            value: 0.9685
            name: BERTScore F1

---
# 📝 myX-TransStyle-S2W: A Transformer-based Style Transfer for Myanmar Spoken (ပြောဟန်) to Written (ရေးဟန်)

**myX-TransStyle-S2W** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is designed to transform colloquial **Spoken Burmese (ပြောဟန်)** into its formal **Written Burmese (ရေးဟန်)** counterpart while strictly preserving the original semantic meaning.

## Model Details

- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
- **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters
- **Language:** Burmese (Myanmar)
- **Task:** Text Style Transfer (Spoken → Written)
- **License:** MIT
- **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)

---

## Linguistic Context: The Diglossia Challenge

Burmese is a **diglossic language**, characterized by a sharp divide between two distinct registers. Understanding this is crucial for effective Myanmar NLP:

* **Spoken Style (ပြောဟန်):** Used in daily life, social media, and verbal communication. It relies on colloquial grammatical markers like **"တယ်"** (tense) or **"ရဲ့"** (possessive).
* **Written Style (ရေးဟန်):** The standard for news, law, textbooks, and officialdom. It uses formal markers such as **"သည်"**, **"၏"**, and **"၍"**.

Most existing AI models sound "robotic" because they are trained primarily on formal web-scraped data. **myX-TransStyle-S2W** bridges this gap by enabling AI to convert natural spoken input into grammatically correct formal documentation.

---

## Training Methodology

The model was trained using an efficient yet powerful adaptation strategy to handle the nuances of Myanmar grammar.

### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus))
We utilized **5,555 high-quality, unique parallel text pairs** from the [MWSPC dataset](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus). This dataset provides a direct mapping between informal and formal structures, curated specifically to remove duplicates and ensure linguistic diversity.

### 2. Parameter-Efficient Fine-Tuning (PEFT)
To capture complex structural transformations without losing the base model's knowledge, we used **Low-Rank Adaptation (LoRA)**:
* **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`.
* **Rank (R):** 32 | **Alpha:** 64.
* **Learning Rate:** 8e-5 with a Cosine scheduler.

### 3. Merging Strategy
After training, the LoRA weights were merged back into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. This creates a standalone **2.8 GB** model that does not require additional PEFT libraries for inference.

---

## Evaluation Results

The model was evaluated on **100 unseen test sentences** across multiple metrics to ensure reliability.

### Performance Metrics
| Metric | Score | Interpretation |
|---|---|---|
| **BERTScore F1** | **0.9685** | Indicates near-perfect meaning preservation during style transfer. |
| **chrF** | **75.56** | High character-level similarity, showing mastery over Myanmar suffixes. |
| **BLEU** | **12.94** | Reflects the model's creative flexibility; multiple formal rewrites are often valid. |

### Qualitative Analysis
Manual review by native speakers confirms that the model excels at swapping spoken particles (e.g., *...တာပါ။*) for formal equivalents (e.g., *...ခြင်းဖြစ်သည်။*). Even when the model deviates from the reference text, the outputs remain linguistically acceptable and natural within a formal context.
---

## 🔗 Related Models in the DatarrX Ecosystem

To get the most out of Myanmar Style Transfer, we recommend using these sibling models:

* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** The inverse model for converting Written Style to Spoken Style.
* **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** A high-performance classifier to identify whether a sentence is Written or Spoken before applying style transfer.

---

## How to Use

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. Load the Merged Model
model_id = "DatarrX/myX-TransStyle-S2W"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

# 2. Prepare Input
prefix = "Rewrite Burmese spoken sentence into formal written Burmese: "
spoken_text = "ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့တယ်။"
input_text = prefix + spoken_text

# 3. Generate Written Style
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(
    **inputs, 
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"),
    max_length=160,
    num_beams=5
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
# Output: ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံး အင်ပါယာနိုင်ငံကြီး ဖြစ်ခဲ့၏။
```

---

## Intended Use & Limitations

### Use Cases
- **Formalizing Content:** Converting interview transcripts or casual notes into professional reports.
- **Data Normalization:** Cleaning social media text for downstream NLP tasks.
- **Educational Tools:** Helping students learn the differences between Myanmar registers.

### Limitations
- **Hybrid Ambiguity:** In cases where a sentence structure is valid in both registers, the model may output minimal changes.
- **Domain Specificity:** Performance is optimized for standard Yangon/Mandalay dialects and may vary with heavy regional slang.

## Citation

### BibTeX
```BibTeX
@misc{myx_transstyle_s2w_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-TransStyle-S2W: A Spoken to Written Burmese Style Transfer Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-S2W}
}
```
---

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

---
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*