Text Classification
Scikit-learn
Joblib
Burmese
myX-StyleClassifier / README.md
kalixlouiis's picture
Update README.md
04433ec verified
---
license: mit
datasets:
- DatarrX/Myanmar-Style-Classification-Corpus
language:
- my
pipeline_tag: text-classification
metrics:
- f1
- accuracy
- precision
- recall
library_name: sklearn
---
# πŸ“ myX-StyleClassifier: A Classifier for Myanmar Spoken (α€•α€Όα€±α€¬α€Ÿα€”α€Ί) and Written (α€›α€±α€Έα€Ÿα€”α€Ί) Styles
**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.
## Model Details
- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-ထက်စ်](https://huggingface.co/DatarrX)
- **Model Type:** Ensemble Machine Learning (Voting Classifier)
- **Language(s):** Burmese (Myanmar)
- **License:** MIT
- **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)
## Training Methodology
To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.
### 1. Feature Engineering
The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...α€žα€Šα€Ί" vs "...α€α€šα€Ί") and complex structural patterns without requiring a custom tokenizer.
### 2. Ensemble Architecture
We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
* **Random Forest:** Captures non-linear relationships and specific word importance.
The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
## Evaluation Results
The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split).
### Metrics
| Metric | Score |
|---|---|
| **Accuracy** | **96.00%** |
| **Macro F1-Score** | **0.96** |
### Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |
### Evaluation breakdown (Confusion Matrix)
The following table illustrates how the model performed on 100 unseen test sentences:
| | Predicted Formal | Predicted Colloquial |
|---|:---:|:---:|
| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |
**Key Insights from the Matrix:**
* **True Positives (Formal):** 37 formal sentences were correctly identified.
* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
### Error Analysis (Ambiguity Handling)
In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
## How to Use
> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.
```Python
import joblib
from huggingface_hub import hf_hub_download
# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)
# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
"α€€α€»α€½α€”α€Ία€―α€•α€Ία€žα€Šα€Ί α€€α€»α€±α€¬α€„α€Ία€Έα€žα€­α€―α€· α€žα€½α€¬α€Έα€•α€«α€žα€Šα€Ία‹", # Formal
"ငါ α€€α€»α€±α€¬α€„α€Ία€Έα€žα€½α€¬α€Έα€™α€œα€­α€―α€·α‹", # Colloquial
"ခဏစောင့်ပေးပါ။" # Ambiguous/Polite
]
predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores
for text, pred, prob in zip(sample_texts, predictions, probabilities):
label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
confidence = prob[pred] * 100
print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
```
---
## πŸ”„ Beyond Classification: Style Transfer
Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers:
* **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose.
* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue.
---
## Intended Use & Limitations
### Use Cases
- **Style Checking**: Automating the detection of informal language in professional documents.
- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.
### Limitations
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
## Citation
### BibTeX
```BibTeX
@misc{myx_styleclassifier_2026,
author = {Khant Sint Heinn (Kalix Louis)},
title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
year = {2026},
publisher = {Hugging Face},
organization = {DatarrX},
howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}
```
---
## About the Author
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
**Connect with the Author:**
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
---
*Developed with ❀️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*