--- license: mit datasets: - DatarrX/Myanmar-Written-Spoken-Parallel-Corpus language: - my metrics: - bleu - chrf - ter - bertscore base_model: - facebook/nllb-200-distilled-600M pipeline_tag: text-generation library_name: transformers tags: - burmese - myanmar - myanmar-language - burmese-nlp - style-transfer - text-rewriting - formal-to-informal - written-to-spoken - seq2seq - nllb - lora - peft - low-resource-language - text-generation model-index: - name: myX-TransStyle-W2S results: - task: type: text-generation name: Burmese Style Transfer (Written to Spoken) dataset: name: Custom External Test Set type: csv config: default split: test metrics: - type: bleu value: 19.6381 name: BLEU - type: chrf value: 78.3975 name: chrF - type: ter value: 50.7353 name: TER - type: bertscore value: 0.9693 name: BERTScore F1 --- # 📝 myX-TransStyle-W2S: A Transformer-based Style Transfer for Myanmar Written (ရေးဟန်) to Spoken (ပြောဟန်) **myX-TransStyle-W2S** is a specialized Sequence-to-Sequence (Seq2Seq) model developed by **Khant Sint Heinn (Kalix Louis)** under **DatarrX**. It is specifically designed to transform formal **Written Burmese (ရေးဟန်)** into its natural colloquial **Spoken Burmese (ပြောဟန်)** counterpart. This model ensures that formal documents or news can be converted into fluid, human-like dialogue while maintaining 100% semantic integrity. ## Model Details - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) - **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX) - **Model Architecture:** Fine-tuned NLLB-200 (600M Distilled) with merged LoRA adapters - **Language:** Burmese (Myanmar) - **Task:** Text Style Transfer (Written → Spoken) - **License:** MIT - **Trained on:** [Myanmar Written-Spoken Parallel Corpus (MWSPC)](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus) --- ## Linguistic Context: The Diglossia Challenge Burmese is a **diglossic language**, featuring a major linguistic gap between two functional registers: * **Written Style (ရေးဟန်):** Used in news, law, textbooks, and officialdom. It relies on formal grammatical markers such as **"သည်"**, **"၏"**, and **"၍"**. * **Spoken Style (ပြောဟန်):** Used in daily life, verbal communication, and social media. It uses colloquial markers like **"တယ်"** (tense), **"ရဲ့"** (possessive), and **"နဲ့"** (conjunction). **myX-TransStyle-W2S** addresses the "robotic" nature of modern AI by allowing formal text to be localized into the natural, warm tone used by native speakers every day. --- ## Training Methodology The model was trained using an efficient adaptation strategy optimized for the unique structural shifts of Myanmar style. ### 1. The Dataset ([MWSPC](https://huggingface.co/datasets/DatarrX/Myanmar-Written-Spoken-Parallel-Corpus)) The model was trained on **5,555 high-quality, unique parallel text pairs**. This dataset provides a direct mapping from formal literary structures to their informal colloquial equivalents, filtered to ensure maximum diversity. ### 2. Parameter-Efficient Fine-Tuning (PEFT) To capture nuanced stylistic shifts without overwriting the base model's linguistic depth, we utilized **Low-Rank Adaptation (LoRA)**: * **Target Modules:** `q_proj`, `k_proj`, `v_proj`, `out_proj`. * **Rank (R):** 32 | **Alpha:** 64. * **Learning Rate:** 8e-5 with a Cosine scheduler. ### 3. Merging Strategy The LoRA adapters were merged into the base `nllb-200-distilled-600M` model using `merge_and_unload()`. The resulting standalone **2.8 GB** model provides high-speed inference without requiring the PEFT library. --- ## Evaluation Results The model was validated on **100 unseen test sentences** and showed superior performance compared to its S2W sibling. ### Performance Metrics | Metric | Score | Interpretation | |---|---|---| | **BERTScore F1** | **0.9693** | Indicates near-perfect meaning preservation during style transfer. | | **chrF** | **78.40** | Exceptional character-level accuracy, specifically in converting formal suffixes. | | **BLEU** | **19.64** | Higher than S2W, reflecting a more consistent conversion pattern into spoken style. | ### Qualitative Analysis Manual review by native speakers confirms the model's ability to not only swap particles but also adjust vocabulary (e.g., converting *“အလွန်ပင်”* to *“သိပ်”* or *“အကယ်ပင်”* to *“တကယ်လို့တောင်”*) in a way that feels authentic and human. --- ## 🔗 Related Models in the DatarrX Ecosystem Explore other specialized models for Myanmar linguistic styles: * **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** The sibling model for converting Spoken Style to formal Written Style. * **[myX-StyleClassifier](https://huggingface.co/DatarrX/myX-StyleClassifier):** Use this to automatically detect the style of your input text before processing. --- ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM # 1. Load the Merged Model model_id = "DatarrX/myX-TransStyle-W2S" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSeq2SeqLM.from_pretrained(model_id) # 2. Prepare Input prefix = "Rewrite Burmese formal written sentence into spoken Burmese: " written_text = "ပုဂံခေတ်သည် မြန်မာနိုင်ငံသမိုင်းတွင် ပထမဆုံးသော အင်ပါယာနိုင်ငံတော်ကြီး ဖြစ်ခဲ့သည်။" input_text = prefix + written_text # 3. Generate Spoken Style inputs = tokenizer(input_text, return_tensors="pt") outputs = model.generate( **inputs, forced_bos_token_id=tokenizer.convert_tokens_to_ids("mya_Mymr"), max_length=160, num_beams=5 ) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) # Output: ပုဂံခေတ်က မြန်မာနိုင်ငံသမိုင်းမှာ ပထမဆုံး အင်ပါယာနိုင်ငံတော်ကြီးဖြစ်ခဲ့တယ်။ ``` --- ## Intended Use & Limitations ### Use Cases - **Natural AI Personalities:** Converting formal bot responses into natural-sounding speech. - **Content Localization:** Making formal news or articles more accessible for audio/podcasts. - **Creative Writing:** Assisting authors in converting narrative descriptions into natural character dialogue. ### Limitations - **Dialectal Focus:** Primarily focuses on the standard Yangon/Mandalay dialect; regional slang may be less represented. - **Contextual Nuance:** While meaning is preserved, the "warmth" of the spoken style may vary depending on the complexity of the input. ## Citation ### BibTeX ```BibTeX @misc{myx_transstyle_w2s_2026, author = {Khant Sint Heinn (Kalix Louis)}, title = {myX-TransStyle-W2S: A Written to Spoken Burmese Style Transfer Model}, year = {2026}, publisher = {Hugging Face}, organization = {DatarrX}, howpublished = {https://huggingface.co/DatarrX/myX-TransStyle-W2S} } ``` --- ## About the Author **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology. He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications. Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications. His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation. **Connect with the Author:** [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis) --- *Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*