Update README.md

04433ec verified 14 days ago

7.71 kB

	---
	license: mit
	datasets:
	- DatarrX/Myanmar-Style-Classification-Corpus
	language:
	- my
	pipeline_tag: text-classification
	metrics:
	- f1
	- accuracy
	- precision
	- recall
	library_name: sklearn
	---

	# 📝 myX-StyleClassifier: A Classifier for Myanmar Spoken (ပြောဟန်) and Written (ရေးဟန်) Styles

	myX-StyleClassifier is a high-performance Machine Learning model developed by Khant Sint Heinn under, DatarrX to classify Myanmar (Burmese) text into two distinct linguistic registers: Written Style (Formal) and Spoken Style (Colloquial).

	## Model Details

	- Developed by: [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
	- Organization: [DatarrX \| ဒေတာ-အက်စ်](https://huggingface.co/DatarrX)
	- Model Type: Ensemble Machine Learning (Voting Classifier)
	- Language(s): Burmese (Myanmar)
	- License: MIT
	- Trained on: [Myanmar Style Classification Corpus (MSCC)](https://huggingface.co/datasets/DatarrX/Myanmar-Style-Classification-Corpus)

	## Training Methodology

	To achieve robust performance beyond simple keyword matching, the model was trained using an Advanced Ensemble Learning approach.

	### 1. Feature Engineering
	The model utilizes a TF-IDF (Term Frequency-Inverse Document Frequency) vectorizer with a character-level N-gram range of (2, 4). This allows the model to capture the nuances of Myanmar grammatical suffixes (e.g., "...သည်" vs "...တယ်") and complex structural patterns without requiring a custom tokenizer.

	### 2. Ensemble Architecture
	We implemented a Soft Voting Classifier that combines the strengths of three diverse algorithms:
	* Logistic Regression: Optimized with `C=10.0` for high-precision linear separation.
	* Support Vector Machine (SVC): Providing robust boundaries in high-dimensional text space.
	* Random Forest: Captures non-linear relationships and specific word importance.

	The final configuration was selected via GridSearchCV, ensuring the hyperparameters are fine-tuned for the unique structure of the Myanmar language.

	## Evaluation Results

	The model was validated against a blind test set of 100 unseen sentences (not included in the training/validation split).

	### Metrics
	\| Metric \| Score \|
	\|---\|---\|
	\| Accuracy \| 96.00% \|
	\| Macro F1-Score \| 0.96 \|

	### Classification Report
	\| Class \| Precision \| Recall \| F1-Score \| Support \|
	\|---\|---\|---\|---\|---\|
	\| Formal (0) \| 0.97 \| 0.93 \| 0.95 \| 40 \|
	\| Colloquial (1) \| 0.95 \| 0.98 \| 0.97 \| 60 \|

	### Evaluation breakdown (Confusion Matrix)

	The following table illustrates how the model performed on 100 unseen test sentences:

	\| \| Predicted Formal \| Predicted Colloquial \|
	\|---\|:---:\|:---:\|
	\| Actual Formal \| 37 (Correct) \| 3 (Misclassified) \|
	\| Actual Colloquial \| 1 (Misclassified) \| 59 (Correct) \|

	Key Insights from the Matrix:
	* True Positives (Formal): 37 formal sentences were correctly identified.
	* True Positives (Colloquial): 59 colloquial sentences were correctly identified.
	* Misclassifications: Only 4 out of 100 sentences were misclassified, primarily due to "Hybrid" linguistic features where the sentence structure could reasonably belong to either style.

	### Error Analysis (Ambiguity Handling)
	In the 4% of cases where the model failed, human review confirmed stylistic ambiguity. Certain Myanmar sentences are "Hybrid" or "Dual-use," where the vocabulary is neutral enough to be used in both formal writing and polite daily conversation.


	## How to Use
	> To use this model, you need `scikit-learn`, `joblib`, and `huggingface_hub` installed.

	```Python
	import joblib
	from huggingface_hub import hf_hub_download

	# 1. Download the model from Hugging Face Hub
	repo_id = "DatarrX/myX-StyleClassifier"
	filename = "model.joblib"
	checkpoint_path = hf_hub_download(repo_id=repo_id, filename=filename)

	# 2. Load the Ensemble Model
	model = joblib.load(checkpoint_path)

	# 3. Predict Styles
	# 0 = Written/Formal, 1 = Spoken/Colloquial
	sample_texts = [
	"ကျွန်ုပ်သည် ကျောင်းသို့ သွားပါသည်။", # Formal
	"ငါ ကျောင်းသွားမလို့။", # Colloquial
	"ခဏစောင့်ပေးပါ။" # Ambiguous/Polite
	]

	predictions = model.predict(sample_texts)
	probabilities = model.predict_proba(sample_texts) # Get confidence scores

	for text, pred, prob in zip(sample_texts, predictions, probabilities):
	label = "Spoken/Colloquial" if pred == 1 else "Written/Formal"
	confidence = prob[pred] * 100
	print(f"Text: {text} \| Style: {label} ({confidence:.2f}% confidence)")
	```
	---

	## 🔄 Beyond Classification: Style Transfer

	Once you have identified the style of your text using myX-StyleClassifier, you can use our transformation models to switch between registers:

	* [myX-TransStyle-S2W](https://huggingface.co/DatarrX/myX-TransStyle-S2W): Convert detected Spoken text into formal Written prose.
	* [myX-TransStyle-W2S](https://huggingface.co/DatarrX/myX-TransStyle-W2S): Transform detected Written text into natural Spoken dialogue.

	---

	## Intended Use & Limitations

	### Use Cases
	- Style Checking: Automating the detection of informal language in professional documents.
	- Chatbot Alignment: Ensuring AI responses match the user's preferred register.
	- NLP Pre-processing: Filtering datasets for fine-tuning specific language models.

	### Limitations
	- The model may struggle with Internet Slang or Ancient Literary Burmese that deviates from modern standard registers.
	- Sentences that lack specific grammatical particles (suffixes) may result in lower confidence scores.

	## Citation

	### BibTeX
	```BibTeX
	@misc{myx_styleclassifier_2026,
	author = {Khant Sint Heinn (Kalix Louis)},
	title = {myX-StyleClassifier: A Robust Myanmar Style Classification Model},
	year = {2026},
	publisher = {Hugging Face},
	organization = {DatarrX},
	howpublished = {https://huggingface.co/DatarrX/myX-StyleClassifier}
	}
	```
	---

	## About the Author

	Khant Sint Heinn, working under the name Kalix Louis, is a Machine Learning Engineer focused on Natural Language Processing (NLP), data foundations, and open-source AI development. His work is centered on improving support for the Burmese (Myanmar) language in modern AI systems by building high-quality datasets, practical tools, and scalable infrastructure for language technology.

	He is currently the Lead Developer at DatarrX, where he develops data pipelines, manages large-scale data collection workflows, and helps create open-source resources for researchers, developers, and organizations. His experience includes data engineering, web scripting, dataset curation, and building systems that support real-world machine learning applications.

	Khant Sint Heinn is especially interested in advancing low-resource languages and making AI more accessible to underrepresented communities. Through his open-source contributions, he works to strengthen the Burmese (Myanmar) tech ecosystem and provide reliable building blocks for future language models, search systems, and intelligent applications.

	His goal is simple: to turn limited language resources into practical opportunities through clean data, useful tools, and community-driven innovation.

	Connect with the Author:
	[GitHub](https://github.com/kalixlouiis) \| [Hugging Face](https://huggingface.co/kalixlouiis) \| [Kaggle](https://www.kaggle.com/organizations/kalixlouiis)

	---
	Developed with ❤️ by [DatarrX](https://huggingface.co/DatarrX) to empower the Myanmar AI ecosystem.