File size: 7,707 Bytes
ecd91fa 5a3c992 c3a90ee 04433ec c3a90ee 04433ec c3a90ee 04433ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 | ---
license: mit
datasets:
- DatarrX/Myanmar-Style-Classification-Corpus
language:
- my
pipeline_tag: text-classification
metrics:
- f1
- accuracy
- precision
- recall
library_name: sklearn
---
# π myX-StyleClassifier: A Classifier for Myanmar Spoken (ααΌα±α¬αααΊ) and Written (αα±αΈαααΊ) Styles
**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.
## Model Details
- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | αα±αα¬-α‘ααΊα
αΊ](https://huggingface.co/DatarrX)
- **Model Type:** Ensemble Machine Learning (Voting Classifier)
- **Language(s):** Burmese (Myanmar)
- **License:** MIT
- **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)
## Training Methodology
To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.
### 1. Feature Engineering
The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...αααΊ" vs "...αααΊ") and complex structural patterns without requiring a custom tokenizer.
### 2. Ensemble Architecture
We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
* **Random Forest:** Captures non-linear relationships and specific word importance.
The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
## Evaluation Results
The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split).
### Metrics
| Metric | Score |
|---|---|
| **Accuracy** | **96.00%** |
| **Macro F1-Score** | **0.96** |
### Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |
### Evaluation breakdown (Confusion Matrix)
The following table illustrates how the model performed on 100 unseen test sentences:
| | Predicted Formal | Predicted Colloquial |
|---|:---:|:---:|
| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |
**Key Insights from the Matrix:**
* **True Positives (Formal):** 37 formal sentences were correctly identified.
* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
### Error Analysis (Ambiguity Handling)
In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
## How to Use
> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.
```Python
import joblib
from huggingface_hub import hf_hub_download
# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)
# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
"αα»α½ααΊα―ααΊαααΊ αα»α±α¬ααΊαΈααα―α· αα½α¬αΈαα«αααΊα", # Formal
"αα« αα»α±α¬ααΊαΈαα½α¬αΈαααα―α·α", # Colloquial
"ααα
α±α¬αα·αΊαα±αΈαα«α" # Ambiguous/Polite
]
predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores
for text, pred, prob in zip(sample_texts, predictions, probabilities):
label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
confidence = prob[pred] * 100
print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
```
---
## π Beyond Classification: Style Transfer
Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers:
* **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose.
* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue.
---
## Intended Use & Limitations
### Use Cases
- **Style Checking**: Automating the detection of informal language in professional documents.
- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.
### Limitations
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
## Citation
### BibTeX
```BibTeX
@misc{myx_styleclassifier_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}
```
---
## About the Author
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
**Connect with the Author:**
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
---
*Developed with β€οΈ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.* |