DatarrX
/

myX-StyleClassifier

 - precision
 - recall
 library_name: sklearn
+---
+# 📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles
+**myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.
+## Model Details
+- **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
+- **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
+- **Model Type:** Ensemble Machine Learning (Voting Classifier)
+- **Language(s):** Burmese (Myanmar)
+- **License:** MIT
+- **Parent Dataset:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)
+## Training Methodology
+To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.
+### 1. Feature Engineering
+The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.
+### 2. Ensemble Architecture
+We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
+* **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
+* **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
+* **Random Forest:** Captures non-linear relationships and specific word importance.
+The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
+## Evaluation Results
+The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split).
+### Metrics
+| Metric | Score |
+|---|---|
+| **Accuracy** | **96.00%** |
+| **Macro F1-Score** | **0.96** |
+### Classification Report
+| Class | Precision | Recall | F1-Score | Support |
+|---|---|---|---|---|
+| **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
+| **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |
+### Evaluation breakdown (Confusion Matrix)
+The following table illustrates how the model performed on 100 unseen test sentences:
+| | Predicted Formal | Predicted Colloquial |
+|---|:---:|:---:|
+| **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
+| **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |
+**Key Insights from the Matrix:**
+* **True Positives (Formal):** 37 formal sentences were correctly identified.
+* **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
+* **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
+### Error Analysis (Ambiguity Handling)
+In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
+## How to Use
+> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.
+```Python
+import joblib
+from huggingface_hub import hf_hub_download
+# 1. Download the model from Hugging Face Hub
+repo_id = "DatarrX/myX-StyleClassifier"
+filename = "model.joblib"
+checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
+# 2. Load the Ensemble Model
+model = joblib.load(checkpoint_path)
+# 3. Predict Styles
+# 0 = Written/Formal, 1 = Spoken/Colloquial
+sample_texts = [
+    "ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
+    "ငါ ကျောင်းသွားမလို့။",              # Colloquial
+    "ခဏစောင့်ပေးပါ။"                   # Ambiguous/Polite
+]
+predictions = model.predict(sample_texts)
+probabilities = model.predict_proba(sample_texts) # Get confidence scores
+for text, pred, prob in zip(sample_texts, predictions, probabilities):
+    label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
+    confidence = prob[pred] * 100
+    print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
+```
+## Intended Use & Limitations
+### Use Cases
+- **Style Checking**: Automating the detection of informal language in professional documents.
+- **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
+- **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.
+### Limitations
+- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
+- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
+## Citation
+### BibTeX
+```BibTeX
+@misc{myx_styleclassifier_2026,
+  author = {Khant Sint Heinn (Kalix Louis)},
+  title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
+  year = {2026},
+  publisher = {Hugging Face},
+  organization = {DatarrX},
+  howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
+}
+```
+---
+## About the Author
+**Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
+He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
+Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
+His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
+**Connect with the Author:**
+[GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)