File size: 7,707 Bytes

---
license: mit
datasets:
- DatarrX/Myanmar-Style-Classification-Corpus
language:
- my
pipeline_tag: text-classification
metrics:
- f1
- accuracy
- precision
- recall
library_name: sklearn
---

# 📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles

**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.

## Model Details

- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
- **Model Type:** Ensemble Machine Learning (Voting Classifier)
- **Language(s):** Burmese (Myanmar)
- **License:** MIT
- **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)

## Training Methodology

To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.

### 1. Feature Engineering
The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.

### 2. Ensemble Architecture
We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
* **Random Forest:** Captures non-linear relationships and specific word importance.

The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.

## Evaluation Results

The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split). 

### Metrics
| Metric | Score |
|---|---|
| **Accuracy** | **96.00%** |
| **Macro F1-Score** | **0.96** |

### Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |

### Evaluation breakdown (Confusion Matrix)

The following table illustrates how the model performed on 100 unseen test sentences:

| | Predicted Formal | Predicted Colloquial |
|---|:---:|:---:|
| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |

**Key Insights from the Matrix:**
* **True Positives (Formal):** 37 formal sentences were correctly identified.
* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.

### Error Analysis (Ambiguity Handling)
In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.


## How to Use
> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.

```Python
import joblib
from huggingface_hub import hf_hub_download

# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)

# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)

# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
    "ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
    "ငါ ကျောင်းသွားမလို့။",              # Colloquial
    "ခဏစောင့်ပေးပါ။"                   # Ambiguous/Polite
]

predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores

for text, pred, prob in zip(sample_texts, predictions, probabilities):
    label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
    confidence = prob[pred] * 100
    print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
```
---

## 🔄 Beyond Classification: Style Transfer

Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers:

* **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose.
* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue.

---

## Intended Use & Limitations

### Use Cases
- **Style Checking**: Automating the detection of informal language in professional documents.
- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.

### Limitations
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.

## Citation

### BibTeX
```BibTeX
@misc{myx_styleclassifier_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}
```
---

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

---
*Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*