| --- |
| license: mit |
| datasets: |
| - DatarrX/Myanmar-Style-Classification-Corpus |
| language: |
| - my |
| pipeline_tag: text-classification |
| metrics: |
| - f1 |
| - accuracy |
| - precision |
| - recall |
| library_name: sklearn |
| --- |
| |
| # π myX-StyleClassifier: A Classifier for Myanmar Spoken (ααΌα±α¬αααΊ) and Written (αα±αΈαααΊ) Styles |
|
|
| **myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**. |
|
|
| ## Model Details |
|
|
| - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) |
| - **Organization:** [DatarrX | αα±αα¬-α‘ααΊα
αΊ](https://huggingface.co/DatarrX) |
| - **Model Type:** Ensemble Machine Learning (Voting Classifier) |
| - **Language(s):** Burmese (Myanmar) |
| - **License:** MIT |
| - **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus) |
|
|
| ## Training Methodology |
|
|
| To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach. |
|
|
| ### 1. Feature Engineering |
| The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...αααΊ" vs "...αααΊ") and complex structural patterns without requiring a custom tokenizer. |
|
|
| ### 2. Ensemble Architecture |
| We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms: |
| * **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation. |
| * **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space. |
| * **Random Forest:** Captures non-linear relationships and specific word importance. |
|
|
| The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language. |
|
|
| ## Evaluation Results |
|
|
| The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split). |
|
|
| ### Metrics |
| | Metric | Score | |
| |---|---| |
| | **Accuracy** | **96.00%** | |
| | **Macro F1-Score** | **0.96** | |
|
|
| ### Classification Report |
| | Class | Precision | Recall | F1-Score | Support | |
| |---|---|---|---|---| |
| | **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 | |
| | **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 | |
|
|
| ### Evaluation breakdown (Confusion Matrix) |
|
|
| The following table illustrates how the model performed on 100 unseen test sentences: |
|
|
| | | Predicted Formal | Predicted Colloquial | |
| |---|:---:|:---:| |
| | **Actual Formal** | **37** (Correct) | **3** (Misclassified) | |
| | **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) | |
|
|
| **Key Insights from the Matrix:** |
| * **True Positives (Formal):** 37 formal sentences were correctly identified. |
| * **True Positives (Colloquial):** 59 colloquial sentences were correctly identified. |
| * **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style. |
|
|
| ### Error Analysis (Ambiguity Handling) |
| In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation. |
|
|
|
|
| ## How to Use |
| > To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed. |
| |
| ```Python |
| import joblib |
| from huggingface_hub import hf_hub_download |
|
|
| # 1. Download the model from Hugging Face Hub |
| repo_id = "DatarrX/myX-StyleClassifier" |
| filename = "model.joblib" |
| checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename) |
|
|
| # 2. Load the Ensemble Model |
| model = joblib.load(checkpoint_path) |
| |
| # 3. Predict Styles |
| # 0 = Written/Formal, 1 = Spoken/Colloquial |
| sample_texts = [ |
| "αα»α½ααΊα―ααΊαααΊ αα»α±α¬ααΊαΈααα―α· αα½α¬αΈαα«αααΊα", # Formal |
| "αα« αα»α±α¬ααΊαΈαα½α¬αΈαααα―α·α", # Colloquial |
| "ααα
α±α¬αα·αΊαα±αΈαα«α" # Ambiguous/Polite |
| ] |
| |
| predictions = model.predict(sample_texts) |
| probabilities = model.predict_proba(sample_texts) # Get confidence scores |
| |
| for text, pred, prob in zip(sample_texts, predictions, probabilities): |
| label = "Spoken/Colloquial" if pred == 1 else "Written/Formal" |
| confidence = prob[pred] * 100 |
| print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)") |
| ``` |
| --- |
| |
| ## π Beyond Classification: Style Transfer |
|
|
| Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers: |
|
|
| * **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose. |
| * **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue. |
|
|
| --- |
|
|
| ## Intended Use & Limitations |
|
|
| ### Use Cases |
| - **Style Checking**: Automating the detection of informal language in professional documents. |
| - **Chatbot Alignment**: Ensuring AI responses match the user's preferred register. |
| - **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models. |
|
|
| ### Limitations |
| - The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers. |
| - Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores. |
|
|
| ## Citation |
|
|
| ### BibTeX |
| ```BibTeX |
| @misc{myx_styleclassifier_2026, |
| author = {Khant Sint Heinn (Kalix Louis)}, |
| title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model}, |
| year = {2026}, |
| publisher = {Hugging Face}, |
| organization = {DatarrX}, |
| howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier} |
| } |
| ``` |
| --- |
|
|
| ## About the Author |
|
|
| **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology. |
|
|
| He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications. |
|
|
| Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications. |
|
|
| His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation. |
|
|
| **Connect with the Author:** |
| [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis) |
|
|
| --- |
| *Developed with β€οΈ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.* |