Text Classification
Scikit-learn
Joblib
Burmese
File size: 7,707 Bytes
ecd91fa
 
 
 
 
 
 
5a3c992
 
 
 
 
 
c3a90ee
 
 
 
 
 
 
 
 
 
 
 
 
04433ec
c3a90ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04433ec
 
 
 
 
 
 
 
 
 
c3a90ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
04433ec
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: mit
datasets:
- DatarrX/Myanmar-Style-Classification-Corpus
language:
- my
pipeline_tag: text-classification
metrics:
- f1
- accuracy
- precision
- recall
library_name: sklearn
---

# πŸ“ myX-StyleClassifier: A Classifier for Myanmar Spoken (α€•α€Όα€±α€¬α€Ÿα€”α€Ί) and Written (α€›α€±α€Έα€Ÿα€”α€Ί) Styles

**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.

## Model Details

- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
- **Organization:** [DatarrX | ဒေတာ-ထက်စ်](https://huggingface.co/DatarrX)
- **Model Type:** Ensemble Machine Learning (Voting Classifier)
- **Language(s):** Burmese (Myanmar)
- **License:** MIT
- **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)

## Training Methodology

To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.

### 1. Feature Engineering
The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...α€žα€Šα€Ί" vs "...α€α€šα€Ί") and complex structural patterns without requiring a custom tokenizer.

### 2. Ensemble Architecture
We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
* **Random Forest:** Captures non-linear relationships and specific word importance.

The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.

## Evaluation Results

The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split). 

### Metrics
| Metric | Score |
|---|---|
| **Accuracy** | **96.00%** |
| **Macro F1-Score** | **0.96** |

### Classification Report
| Class | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |

### Evaluation breakdown (Confusion Matrix)

The following table illustrates how the model performed on 100 unseen test sentences:

| | Predicted Formal | Predicted Colloquial |
|---|:---:|:---:|
| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |

**Key Insights from the Matrix:**
* **True Positives (Formal):** 37 formal sentences were correctly identified.
* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.

### Error Analysis (Ambiguity Handling)
In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.


## How to Use
> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.

```Python
import joblib
from huggingface_hub import hf_hub_download

# 1. Download the model from Hugging Face Hub
repo_id = "DatarrX/myX-StyleClassifier"
filename = "model.joblib"
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)

# 2. Load the Ensemble Model
model = joblib.load(checkpoint_path)

# 3. Predict Styles
# 0 = Written/Formal, 1 = Spoken/Colloquial
sample_texts = [
    "α€€α€»α€½α€”α€Ία€―α€•α€Ία€žα€Šα€Ί α€€α€»α€±α€¬α€„α€Ία€Έα€žα€­α€―α€· α€žα€½α€¬α€Έα€•α€«α€žα€Šα€Ία‹", # Formal
    "ငါ α€€α€»α€±α€¬α€„α€Ία€Έα€žα€½α€¬α€Έα€™α€œα€­α€―α€·α‹",              # Colloquial
    "ခဏစောင့်ပေးပါ။"                   # Ambiguous/Polite
]

predictions = model.predict(sample_texts)
probabilities = model.predict_proba(sample_texts) # Get confidence scores

for text, pred, prob in zip(sample_texts, predictions, probabilities):
    label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
    confidence = prob[pred] * 100
    print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
```
---

## πŸ”„ Beyond Classification: Style Transfer

Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers:

* **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose.
* **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue.

---

## Intended Use & Limitations

### Use Cases
- **Style Checking**: Automating the detection of informal language in professional documents.
- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.

### Limitations
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.

## Citation

### BibTeX
```BibTeX
@misc{myx_styleclassifier_2026,
  author = {Khant Sint Heinn (Kalix Louis)},
  title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
  year = {2026},
  publisher = {Hugging Face},
  organization = {DatarrX},
  howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
}
```
---

## About the Author

**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

**Connect with the Author:**  
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

---
*Developed with ❀️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*