Text Classification
Scikit-learn
Joblib
Burmese
kalixlouiis commited on
Commit
c3a90ee
·
verified ·
1 Parent(s): 5a3c992

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +134 -1
README.md CHANGED
@@ -11,4 +11,137 @@ metrics:
11
  - precision
12
  - recall
13
  library_name: sklearn
14
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  - precision
12
  - recall
13
  library_name: sklearn
14
+ ---
15
+
16
+ # 📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles
17
+
18
+ **myX-StyleClassifier** is a high-performance Machine Learning model developed by **Khant Sint Heinn** under, **DatarrX** to classify Myanmar (Burmese) text into two distinct linguistic registers: **Written Style (Formal)** and **Spoken Style (Colloquial)**.
19
+
20
+ ## Model Details
21
+
22
+ - **Developed by:** [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
23
+ - **Organization:** [DatarrX | ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
24
+ - **Model Type:** Ensemble Machine Learning (Voting Classifier)
25
+ - **Language(s):** Burmese (Myanmar)
26
+ - **License:** MIT
27
+ - **Parent Dataset:** [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)
28
+
29
+ ## Training Methodology
30
+
31
+ To achieve robust performance beyond simple keyword matching, the model was trained using an **Advanced Ensemble Learning** approach.
32
+
33
+ ### 1. Feature Engineering
34
+ The model utilizes a **TF-IDF (Term Frequency-Inverse Document Frequency)** vectorizer with a character-level N-gram range of **(2, 4)**. This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.
35
+
36
+ ### 2. Ensemble Architecture
37
+ We implemented a **Soft Voting Classifier** that combines the strengths of three diverse algorithms:
38
+ * **Logistic Regression:** Optimized with `C=10.0` for high-precision linear separation.
39
+ * **Support Vector Machine (SVC):** Providing robust boundaries in high-dimensional text space.
40
+ * **Random Forest:** Captures non-linear relationships and specific word importance.
41
+
42
+ The final configuration was selected via **GridSearchCV**, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.
43
+
44
+ ## Evaluation Results
45
+
46
+ The model was validated against a blind test set of **100 unseen sentences** (not included in the training/validation split).
47
+
48
+ ### Metrics
49
+ | Metric | Score |
50
+ |---|---|
51
+ | **Accuracy** | **96.00%** |
52
+ | **Macro F1-Score** | **0.96** |
53
+
54
+ ### Classification Report
55
+ | Class | Precision | Recall | F1-Score | Support |
56
+ |---|---|---|---|---|
57
+ | **Formal (0)** | 0.97 | 0.93 | 0.95 | 40 |
58
+ | **Colloquial (1)** | 0.95 | 0.98 | 0.97 | 60 |
59
+
60
+ ### Evaluation breakdown (Confusion Matrix)
61
+
62
+ The following table illustrates how the model performed on 100 unseen test sentences:
63
+
64
+ | | Predicted Formal | Predicted Colloquial |
65
+ |---|:---:|:---:|
66
+ | **Actual Formal** | **37** (Correct) | **3** (Misclassified) |
67
+ | **Actual Colloquial** | **1** (Misclassified) | **59** (Correct) |
68
+
69
+ **Key Insights from the Matrix:**
70
+ * **True Positives (Formal):** 37 formal sentences were correctly identified.
71
+ * **True Positives (Colloquial):** 59 colloquial sentences were correctly identified.
72
+ * **Misclassifications:** Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.
73
+
74
+ ### Error Analysis (Ambiguity Handling)
75
+ In the 4% of cases where the model failed, human review confirmed **stylistic ambiguity**. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.
76
+
77
+
78
+ ## How to Use
79
+ > To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.
80
+
81
+ ```Python
82
+ import joblib
83
+ from huggingface_hub import hf_hub_download
84
+
85
+ # 1. Download the model from Hugging Face Hub
86
+ repo_id = "DatarrX/myX-StyleClassifier"
87
+ filename = "model.joblib"
88
+ checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)
89
+
90
+ # 2. Load the Ensemble Model
91
+ model = joblib.load(checkpoint_path)
92
+
93
+ # 3. Predict Styles
94
+ # 0 = Written/Formal, 1 = Spoken/Colloquial
95
+ sample_texts = [
96
+ "ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
97
+ "ငါ ကျောင်းသွားမလို့။", # Colloquial
98
+ "ခဏစောင့်ပေးပါ။" # Ambiguous/Polite
99
+ ]
100
+
101
+ predictions = model.predict(sample_texts)
102
+ probabilities = model.predict_proba(sample_texts) # Get confidence scores
103
+
104
+ for text, pred, prob in zip(sample_texts, predictions, probabilities):
105
+ label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
106
+ confidence = prob[pred] * 100
107
+ print(f"Text: {text} | Style: {label} ({confidence:.2f}% confidence)")
108
+ ```
109
+
110
+ ## Intended Use & Limitations
111
+
112
+ ### Use Cases
113
+ - **Style Checking**: Automating the detection of informal language in professional documents.
114
+ - **Chatbot Alignment**: Ensuring AI responses match the user's preferred register.
115
+ - **NLP Pre-processing**: Filtering datasets for fine-tuning specific language models.
116
+
117
+ ### Limitations
118
+ - The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
119
+ - Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.
120
+
121
+ ## Citation
122
+
123
+ ### BibTeX
124
+ ```BibTeX
125
+ @misc{myx_styleclassifier_2026,
126
+ author = {Khant Sint Heinn (Kalix Louis)},
127
+ title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
128
+ year = {2026},
129
+ publisher = {Hugging Face},
130
+ organization = {DatarrX},
131
+ howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
132
+ }
133
+ ```
134
+ ---
135
+
136
+ ## About the Author
137
+
138
+ **Khant Sint Heinn**, working under the name **Kalix Louis**, is a **Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development**. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.
139
+
140
+ He is currently the **Lead Developer at DatarrX**, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.
141
+
142
+ Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.
143
+
144
+ His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.
145
+
146
+ **Connect with the Author:**
147
+ [GitHub](https://github.com/kalixlouiis) | [Hugging Face](https://huggingface.co/kalixlouiis) | [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)