Update README.md
Browse files
README.md
CHANGED
|
@@ -11,4 +11,137 @@ metrics:
|
|
| 11 |
- precision
|
| 12 |
- recall
|
| 13 |
library_name: sklearn
|
| 14 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
- precision
|
| 12 |
- recall
|
| 13 |
library_name: sklearn
|
| 14 |
+
---
|
| 15 |
+
|
| 16 |
+
# 📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles
|
| 17 |
+
|
| 18 |
+
**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.
|
| 19 |
+
|
| 20 |
+
## Model Details
|
| 21 |
+
|
| 22 |
+
- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
|
| 23 |
+
- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
|
| 24 |
+
- **Model Type:** Ensemble Machine Learning (Voting Classifier)
|
| 25 |
+
- **Language(s):** Burmese (Myanmar)
|
| 26 |
+
- **License:** MIT
|
| 27 |
+
- **Parent Dataset:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)
|
| 28 |
+
|
| 29 |
+
## Training Methodology
|
| 30 |
+
|
| 31 |
+
To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.
|
| 32 |
+
|
| 33 |
+
### 1. Feature Engineering
|
| 34 |
+
The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.
|
| 35 |
+
|
| 36 |
+
### 2. Ensemble Architecture
|
| 37 |
+
We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
|
| 38 |
+
* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
|
| 39 |
+
* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
|
| 40 |
+
* **Random Forest:** Captures non-linear relationships and specific word importance.
|
| 41 |
+
|
| 42 |
+
The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
|
| 43 |
+
|
| 44 |
+
## Evaluation Results
|
| 45 |
+
|
| 46 |
+
The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split).
|
| 47 |
+
|
| 48 |
+
### Metrics
|
| 49 |
+
| Metric | Score |
|
| 50 |
+
|---|---|
|
| 51 |
+
| **Accuracy** | **96.00%** |
|
| 52 |
+
| **Macro F1-Score** | **0.96** |
|
| 53 |
+
|
| 54 |
+
### Classification Report
|
| 55 |
+
| Class | Precision | Recall | F1-Score | Support |
|
| 56 |
+
|---|---|---|---|---|
|
| 57 |
+
| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
|
| 58 |
+
| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |
|
| 59 |
+
|
| 60 |
+
### Evaluation breakdown (Confusion Matrix)
|
| 61 |
+
|
| 62 |
+
The following table illustrates how the model performed on 100 unseen test sentences:
|
| 63 |
+
|
| 64 |
+
| | Predicted Formal | Predicted Colloquial |
|
| 65 |
+
|---|:---:|:---:|
|
| 66 |
+
| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
|
| 67 |
+
| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |
|
| 68 |
+
|
| 69 |
+
**Key Insights from the Matrix:**
|
| 70 |
+
* **True Positives (Formal):** 37 formal sentences were correctly identified.
|
| 71 |
+
* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
|
| 72 |
+
* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
|
| 73 |
+
|
| 74 |
+
### Error Analysis (Ambiguity Handling)
|
| 75 |
+
In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
## How to Use
|
| 79 |
+
> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.
|
| 80 |
+
|
| 81 |
+
```Python
|
| 82 |
+
import joblib
|
| 83 |
+
from huggingface_hub import hf_hub_download
|
| 84 |
+
|
| 85 |
+
# 1. Download the model from Hugging Face Hub
|
| 86 |
+
repo_id = "DatarrX/myX-StyleClassifier"
|
| 87 |
+
filename = "model.joblib"
|
| 88 |
+
checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
|
| 89 |
+
|
| 90 |
+
# 2. Load the Ensemble Model
|
| 91 |
+
model = joblib.load(checkpoint_path)
|
| 92 |
+
|
| 93 |
+
# 3. Predict Styles
|
| 94 |
+
# 0 = Written/Formal, 1 = Spoken/Colloquial
|
| 95 |
+
sample_texts = [
|
| 96 |
+
"ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
|
| 97 |
+
"ငါ ကျောင်းသွားမလို့။", # Colloquial
|
| 98 |
+
"ခဏစောင့်ပေးပါ။" # Ambiguous/Polite
|
| 99 |
+
]
|
| 100 |
+
|
| 101 |
+
predictions = model.predict(sample_texts)
|
| 102 |
+
probabilities = model.predict_proba(sample_texts) # Get confidence scores
|
| 103 |
+
|
| 104 |
+
for text, pred, prob in zip(sample_texts, predictions, probabilities):
|
| 105 |
+
label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
|
| 106 |
+
confidence = prob[pred] * 100
|
| 107 |
+
print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
|
| 108 |
+
```
|
| 109 |
+
|
| 110 |
+
## Intended Use & Limitations
|
| 111 |
+
|
| 112 |
+
### Use Cases
|
| 113 |
+
- **Style Checking**: Automating the detection of informal language in professional documents.
|
| 114 |
+
- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
|
| 115 |
+
- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.
|
| 116 |
+
|
| 117 |
+
### Limitations
|
| 118 |
+
- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
|
| 119 |
+
- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
|
| 120 |
+
|
| 121 |
+
## Citation
|
| 122 |
+
|
| 123 |
+
### BibTeX
|
| 124 |
+
```BibTeX
|
| 125 |
+
@misc{myx_styleclassifier_2026,
|
| 126 |
+
author = {Khant Sint Heinn (Kalix Louis)},
|
| 127 |
+
title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
|
| 128 |
+
year = {2026},
|
| 129 |
+
publisher = {Hugging Face},
|
| 130 |
+
organization = {DatarrX},
|
| 131 |
+
howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
|
| 132 |
+
}
|
| 133 |
+
```
|
| 134 |
+
---
|
| 135 |
+
|
| 136 |
+
## About the Author
|
| 137 |
+
|
| 138 |
+
**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
|
| 139 |
+
|
| 140 |
+
He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
|
| 141 |
+
|
| 142 |
+
Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
|
| 143 |
+
|
| 144 |
+
His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
|
| 145 |
+
|
| 146 |
+
**Connect with the Author:**
|
| 147 |
+
[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)
|