Update README.md

5f64a9f verified 4 days ago

13.1 kB

	---
	license: apache-2.0
	datasets:
	- kalixlouiis/myX-Corpus
	language:
	- my
	- en
	metrics:
	- perplexity
	pipeline_tag: feature-extraction
	tags:
	- tokenizer
	- burmese
	- myanmar
	- nlp
	- sentencepiece
	- unigram
	- syllable-aware
	- datarrx
	---
	# DatarrX / myX-Tokenizer ⚔️

	myX-Tokenizer is a high-performance, syllable-aware Unigram Tokenizer specifically engineered for the Burmese language. Developed by [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) under [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX), this model is designed to bridge the gap in Myanmar Natural Language Processing (NLP) by providing efficient and linguistically meaningful text segmentation.

	## 🎯 Core Objectives

	Current tokenization methods for Burmese often suffer from excessive character-level fragmentation or a lack of understanding of syllabic structures. myX-Tokenizer addresses these issues through:

	* Syllabic Integrity: Optimized to preserve the structural meaning of Burmese syllables, preventing meaningless character splits.
	* Bilingual Optimization: Expertly handles code-mixed (Burmese + English) contexts, maintaining high efficiency for both languages within a single string.
	* LLM Compatibility: Designed to reduce token counts for Large Language Models (LLMs), effectively lowering inference latency and computational costs.

	---

	## 🛠️ Technical Specifications

	This model was trained directly on cleaned raw text without heavy pre-processing to ensure the highest degree of data fidelity.

	* Algorithm: Unigram Language Model (Offers a probabilistic approach superior to standard BPE for morphological richness).
	* Vocabulary Size: 64,000.
	* Normalization: NFKC (Normalization Form KC).
	* Key Features: * Byte-fallback: Robust handling of out-of-vocabulary (OOV) characters.
	* Split Digits: Separate tokens for numerical values for better mathematical context.
	* Dummy Prefix: Automatic handling of word boundaries.

	### Training Data
	Trained on the [kalixlouiis/myX-Corpus](https://huggingface.co/datasets/kalixlouiis/myX-Corpus), utilizing a high-quality selection of 1.5 million Burmese-English mixed sentences.



	---

	## ⚠️ Limitations & Considerations

	* Orthographic Sensitivity: Tokenization quality is highly dependent on the correct spelling of the source text.
	* English-Only Performance: While highly efficient for mixed text, token counts may be slightly higher than global tokenizers in purely English contexts.
	* Domain Variance: Rare Pali/Sanskrit loanwords or ancient scripts may revert to character-level tokenization.

	---

	## 💻 Usage Guide

	To use this model, you need the `sentencepiece` library. You can load and use the model directly using the following snippet:

	```python
	import sentencepiece as spm
	from huggingface_hub import hf_hub_download

	# Download the model from Hugging Face
	model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer", filename="myX-Tokenizer.model")

	# Initialize the processor
	sp = spm.SentencePieceProcessor(model_file=model_path)

	# Tokenize example text
	text = "မြန်မာစာ NLP နည်းပညာ ဖွံ့ဖြိုးတိုးတက်ရေးအတွက် ကျွန်တော်တို့ ကြိုးစားနေပါသည်။"
	tokens = sp.encode_as_pieces(text)

	print(f"Tokens: {tokens}")
	```
	# ✍️ Project Authors
	- Developer: [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
	- Organization: [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)

	## Citation

	If you use this tokenizer in your research or project, please cite it as follows:

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer: A Syllable-aware Bilingual Unigram Tokenizer for Burmese and English (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026myxtokenizer,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer: A Syllable-aware Bilingual Unigram Tokenizer for Burmese and English},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer},
	note = {Developed under DatarrX (Myanmar Open Source NGO)}
	}
	```

	We are committed to advancing the Burmese NLP ecosystem. For feedback or collaboration, please use the Hugging Face Discussion tab.

	---
	# DatarrX - myX-Tokenizer

	မြန်မာဘာသာစကားအတွက် အထူးရည်ရွယ်၍ တည်ဆောက်ထားသော Syllable-aware Unigram Tokenizer တစ်ခု ဖြစ်ပါသည်။ ဤ Model ကို [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီး [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) မှ အဓိက ဖန်တီးတည်ဆောက်ထားခြင်း ဖြစ်ပါသည်။

	## 🎯 Model ရဲ့ ရည်ရွယ်ချက်နှင့် အဓိက အယူအဆ (Core Concept & Motivation)

	လက်ရှိ မြန်မာ NLP နယ်ပယ်တွင် ကြုံတွေ့နေရသော Tokenization အခက်အခဲများကို ဖြေရှင်းရန်အတွက် ဤ Model ကို ဖန်တီးခြင်း ဖြစ်ပါသည်။

	* Syllable-aware Efficiency: စာသားများကို ဖြတ်တောက်ရာတွင် အဓိပ္ပာယ်မဲ့ Character များအဖြစ် မပြိုကွဲစေဘဲ ဝဏ္ဏဗေဒဆိုင်ရာ စနစ်တကျရှိမှုအပေါ် အခြေခံ၍ ဖြတ်တောက်ပေးနိုင်ရန်။
	* Bilingual Optimization: မြန်မာစာနှင့် အင်္ဂလိပ်စာ ရောနှောပါဝင်နေသော Code-mixed စာသားများတွင်ပါ ကျစ်ကျစ်လျစ်လျစ်ရှိသော Tokenization Result ကို ရရှိစေရန်။
	* LLM Inference Efficiency: Large Language Models (LLMs) များတွင် အသုံးပြုပါက Token အရေအတွက်ကို လျှော့ချပေးနိုင်သဖြင့် တွက်ချက်မှုဆိုင်ရာ ကုန်ကျစရိတ် (Inference Cost) ကို သက်သာစေရန်။

	---

	## 🛠️ နည်းပညာပိုင်းဆိုင်ရာ အချက်အလက်များ (Technical Specifications)

	ဤ Model သည် မည်သည့် အပို Pre-processing မျှ ထပ်မံလုပ်ဆောင်ထားခြင်းမရှိဘဲ သန့်စင်ပြီးသား Cleaned Raw Text များမှ တိုက်ရိုက် လေ့ကျင့်တည်ဆောက်ထားခြင်း ဖြစ်ပါသည်။

	* Algorithm: Unigram Language Model (၎င်းသည် BPE ထက် ပိုမို၍ Probabilistic ဖြစ်သောကြောင့် ဘာသာစကား၏ သဘာဝကို ပိုမိုဖော်ဆောင်နိုင်ပါသည်)
	* Vocab Size: 64,000
	* Normalization: NFKC (Normalization Form KC)
	* Features: Byte-fallback (Out-of-vocabulary စာလုံးများအတွက်) နှင့် Split Digits အင်္ဂါရပ်များ ပါဝင်ပါသည်။

	### အသုံးပြုထားသော Dataset (Training Data)
	ဤ Model ကို [kalixlouiis/myX-Corpus](https://huggingface.co/datasets/kalixlouiis/myX-Corpus) ကို အသုံးပြု၍ လေ့ကျင့်ထားခြင်း ဖြစ်ပါသည်။ ၎င်း Corpus ထဲမှ အရည်အသွေးမြင့် စာကြောင်းပေါင်း ၁.၅ သန်း (1.5 Million) ကို Random Sample ယူ၍ Train ထားခြင်း ဖြစ်ပါသည်။

	---

	## ⚠️ သိထားရန် ကန့်သတ်ချက်များ (Limitations & Bias)

	* Syllable Consistency: ရင်းမြစ်စာသားများ၏ စာလုံးပေါင်းသတ်ပုံ မမှန်ကန်ပါက Tokenization ရလဒ်အပေါ် သက်ရောက်မှု ရှိနိုင်ပါသည်။
	* Bilingual Trade-off: မြန်မာစာကို အဓိကထား တည်ဆောက်ထားသဖြင့် အင်္ဂလိပ်စာ သီးသန့်စာကြောင်းများတွင် အခြား Global Tokenizer များထက် Token count အနည်းငယ် ပိုများနိုင်သော်လည်း မြန်မာစာနှင့် ရောနှောရာတွင် ပိုမို ကောင်းမွန်ပါသည်။
	* Domain Specificity: ရှေးဟောင်းစာပေ သို့မဟုတ် အလွန်ရှားပါးသော ပါဠိ/သက္ကတ စာပေများတွင် Character level အထိ ပြန်လည် ခွဲထွက်သွားနိုင်သည့် အခြေအနေ ရှိနိုင်ပါသည်။

	---

	## 💻 စတင်အသုံးပြုပုံ (How to Use)

	ဤ Model ကို အသုံးပြုရန် `sentencepiece` library လိုအပ်ပါသည်။ အောက်ပါ Code ဖြင့် တိုက်ရိုက် ခေါ်ယူသုံးစွဲနိုင်ပါသည် -

	```python
	import sentencepiece as spm
	from huggingface_hub import hf_hub_download

	# Model file ကို download ရယူခြင်း
	model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer", filename="myX-Tokenizer.model")

	# Load Tokenizer
	sp = spm.SentencePieceProcessor(model_file=model_path)

	# Test Sentence
	text = "မြန်မာစာ NLP နည်းပညာ ဖွံ့ဖြိုးတိုးတက်ရေးအတွက် ကျွန်တော်တို့ ကြိုးစားနေပါသည်။"
	print(f"Pieces: {sp.encode_as_pieces(text)}")
	```

	# ✍️ Project Authors
	- Developer: [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
	- Organization: [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)

	ဤ Model နှင့် ပတ်သက်၍ အကြံပြုချက်များ သို့မဟုတ် မေးမြန်းလိုသည်များရှိပါက Hugging Face Discussion မှတစ်ဆင့် ဆက်သွယ်နိုင်ပါသည်။ ကျွန်တော်တို့သည် မြန်မာစာ NLP ဖွံ့ဖြိုးတိုးတက်ရေးအတွက် အမြဲမပြတ် ကြိုးစားနေပါသည်။

	## Citation

	အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer: A Syllable-aware Bilingual Unigram Tokenizer for Burmese and English (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026myxtokenizer,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer: A Syllable-aware Bilingual Unigram Tokenizer for Burmese and English},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer},
	note = {Developed under DatarrX (Myanmar Open Source NGO)}
	}
	```

	## License 📜

	This project is licensed under the Apache License 2.0.

	### What does this mean?
	The Apache License 2.0 is a permissive license that allows you to:

	* Commercial Use: You can use this tokenizer for commercial purposes.
	* Modification: You can modify the model or the code for your specific needs.
	* Distribution: You can share and distribute the original or modified versions.
	* Sublicensing: You can grant sublicenses to others.

	### Conditions:
	* Attribute: You must give appropriate credit to the author (Khant Sint Heinn) and the organization (DatarrX).
	* License Notice: You must include a copy of the license and any original copyright notice in your distribution.

	For more details, you can read the full license text at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).