--- license: mit datasets: - DatarrX/Myanmar-Style-Classification-Corpus language: - my pipeline_tag: text-classification metrics: - f1 - accuracy - precision - recall library_name: sklearn --- # πŸ“ myX-StyleClassifier: A Classifier for Myanmar Spoken (α€•α€Όα€±α€¬α€Ÿα€”α€Ί) and Written (α€›α€±α€Έα€Ÿα€”α€Ί) Styles **myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**. ## Model Details - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) - **Organization:** [DatarrX | ဒေတာ-ထက်စ်](https://huggingface.co/DatarrX) - **Model Type:** Ensemble Machine Learning (Voting Classifier) - **Language(s):** Burmese (Myanmar) - **License:** MIT - **Trained on:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus) ## Training Methodology To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach. ### 1. Feature Engineering The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...α€žα€Šα€Ί" vs "...α€α€šα€Ί") and complex structural patterns without requiring a custom tokenizer. ### 2. Ensemble Architecture We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms: * **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation. * **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space. * **Random Forest:** Captures non-linear relationships and specific word importance. The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language. ## Evaluation Results The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split). ### Metrics | Metric | Score | |---|---| | **Accuracy** | **96.00%** | | **Macro F1-Score** | **0.96** | ### Classification Report | Class | Precision | Recall | F1-Score | Support | |---|---|---|---|---| | **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 | | **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 | ### Evaluation breakdown (Confusion Matrix) The following table illustrates how the model performed on 100 unseen test sentences: | | Predicted Formal | Predicted Colloquial | |---|:---:|:---:| | **Actual Formal** | **37** (Correct) | **3** (Misclassified) | | **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) | **Key Insights from the Matrix:** * **True Positives (Formal):** 37 formal sentences were correctly identified. * **True Positives (Colloquial):** 59 colloquial sentences were correctly identified. * **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style. ### Error Analysis (Ambiguity Handling) In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation. ## How to Use > To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed. ```Python import joblib from huggingface_hub import hf_hub_download # 1. Download the model from Hugging Face Hub repo_id = "DatarrX/myX-StyleClassifier" filename = "model.joblib" checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename) # 2. Load the Ensemble Model model = joblib.load(checkpoint_path) # 3. Predict Styles # 0 = Written/Formal, 1 = Spoken/Colloquial sample_texts = [ "α€€α€»α€½α€”α€Ία€―α€•α€Ία€žα€Šα€Ί α€€α€»α€±α€¬α€„α€Ία€Έα€žα€­α€―α€· α€žα€½α€¬α€Έα€•α€«α€žα€Šα€Ία‹", # Formal "ငါ α€€α€»α€±α€¬α€„α€Ία€Έα€žα€½α€¬α€Έα€™α€œα€­α€―α€·α‹", # Colloquial "ခဏစောင့်ပေးပါ။" # Ambiguous/Polite ] predictions = model.predict(sample_texts) probabilities = model.predict_proba(sample_texts) # Get confidence scores for text, pred, prob in zip(sample_texts, predictions, probabilities): label = "Spoken/Colloquial" if pred == 1 else "Written/Formal" confidence = prob[pred] * 100 print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)") ``` --- ## πŸ”„ Beyond Classification: Style Transfer Once you have identified the style of your text using **myX-StyleClassifier**, you can use our transformation models to switch between registers: * **[myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W):** Convert detected Spoken text into formal Written prose. * **[myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S):** Transform detected Written text into natural Spoken dialogue. --- ## Intended Use & Limitations ### Use Cases - **Style Checking**: Automating the detection of informal language in professional documents. - **Chatbot Alignment**: Ensuring AI responses match the user's preferred register. - **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models. ### Limitations - The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers. - Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores. ## Citation ### BibTeX ```BibTeX @misc{myx_styleclassifier_2026, author = {Khant Sint Heinn (Kalix Louis)}, title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model}, year = {2026}, publisher = {Hugging Face}, organization = {DatarrX}, howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier} } ``` --- ## About the Author **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology. He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications. Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications. His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation. **Connect with the Author:** [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis) --- *Developed with ❀️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.*