Update README.md

af957fc verified 4 days ago

8.41 kB

	---
	license: apache-2.0
	datasets:
	- kalixlouiis/raw-data
	language:
	- my
	pipeline_tag: feature-extraction
	new_version: DatarrX/myX-Tokenizer
	---
	# DatarrX - myX-Tokenizer-Unigram ⚙️

	myX-Tokenizer-Unigram is a specialized tokenizer for the Burmese language based on the Unigram Language Model algorithm. Developed by [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) under [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX), this model is optimized for linguistic probabilistic segmentation.

	## 🎯 Objectives & Characteristics

	* Unigram Excellence: Utilizes a probabilistic subword tokenization method that often aligns better with the morphological structure of the Burmese language than BPE.
	* Native Burmese Specialist: Trained exclusively on a massive Burmese-only corpus to ensure high-fidelity script recognition.
	* Optimized Efficiency: Developed using high-quality sampling to balance performance and model size.

	## 🛠️ Technical Specifications

	* Algorithm: Unigram Language Model.
	* Vocabulary Size: 64,000.
	* Normalization: NFKC.
	* Features: Byte-fallback, Split Digits, and Dummy Prefix.

	### Training Data
	Trained on the [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) dataset, specifically utilizing 1.5 million cleaned Burmese sentences.

	## ⚠️ Important Considerations (Limitations)

	* Limited English Support: This model is strictly a Burmese script specialist. It has significant limitations in processing English text, which may result in excessive subword splitting for Latin characters.
	* Script Sensitivity: Optimized for modern Burmese script; performance may vary with older orthography or heavy use of specialized Pali/Sanskrit loanwords.

	## Citation

	If you use this tokenizer in your research or project, please cite it as follows:

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026unigram,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
	note = {Burmese-only training corpus}
	}
	```

	---

	# DatarrX - myX-Tokenizer-Unigram (မြန်မာဘာသာ)

	myX-Tokenizer-Unigram သည် Unigram Language Model algorithm ကို အသုံးပြု၍ မြန်မာဘာသာစကားအတွက် အထူးပြုလုပ်ထားသော Tokenizer ဖြစ်ပါသည်။ ဤ Model ကို [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီး [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis) မှ အဓိက ဖန်တီးတည်ဆောက်ထားခြင်း ဖြစ်ပါသည်။

	## 🎯 ရည်ရွယ်ချက်နှင့် ထူးခြားချက်များ

	* Unigram ၏ အားသာချက်: BPE ထက် ပိုမို၍ ဖြစ်နိုင်ခြေ (Probability) အပေါ် အခြေခံကာ ဖြတ်တောက်သဖြင့် မြန်မာစာ၏ ဝဏ္ဏဗေဒ သဘာဝနှင့် ပိုမိုကိုက်ညီစေရန်။
	* မြန်မာစာ အထူးပြု: ဤ Model ကို မြန်မာစာ သီးသန့်ဖြင့်သာ Train ထားသဖြင့် ဗမာ(မြန်မာ)စာသားများ၏ အနက်အဓိပ္ပာယ်ကို ပိုမိုတိကျစွာ ဖြတ်တောက်နိုင်ရန်။
	* စနစ်တကျ လေ့ကျင့်မှု: စာကြောင်းပေါင်း ၁.၅ သန်းကို အသုံးပြု၍ အရည်အသွေးမြင့် စံနှုန်းများဖြင့် တည်ဆောက်ထားပါသည်။

	## 🛠️ နည်းပညာဆိုင်ရာ အချက်အလက်များ

	* Algorithm: Unigram Language Model။
	* Vocab Size: 64,000။
	* Normalization: NFKC။
	* Features: Byte-fallback, Split Digits နှင့် Dummy Prefix အင်္ဂါရပ်များ ပါဝင်ပါသည်။

	### အသုံးပြုထားသော Dataset
	[kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) ထဲမှ သန့်စင်ပြီးသား မြန်မာစာကြောင်းပေါင်း ၁.၅ သန်း (1.5 Million) ကို အသုံးပြုထားပါသည်။

	## ⚠️ သိထားရန် ကန့်သတ်ချက်များ

	* အင်္ဂလိပ်စာ အားနည်းမှု: ဤ Model သည် မြန်မာစာ သီးသန့်အတွက်သာ ဖြစ်သောကြောင့် အင်္ဂလိပ်စာလုံးများကို ဖြတ်တောက်ရာတွင် အလွန်အားနည်းပြီး စာလုံးအသေးလေးများအဖြစ် ကွဲထွက်သွားတတ်ပါသည်။
	* အရေးအသား စံနှုန်း: ခေတ်သစ်မြန်မာစာ အရေးအသားအပေါ် အခြေခံထားသဖြင့် ပါဠိ/သက္ကတ အသုံးများသော စာသားများတွင် ဖြတ်တောက်ပုံ ကွဲပြားနိုင်ပါသည်။

	---

	## 💻 How to Use (အသုံးပြုနည်း)

	```python
	import sentencepiece as spm
	from huggingface_hub import hf_hub_download

	model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer-Unigram", filename="myX-Tokenizer.model")
	sp = spm.SentencePieceProcessor(model_file=model_path)

	text = "မြန်မာစာကို Unigram algorithm နဲ့ စနစ်တကျ ဖြတ်တောက်ကြည့်ခြင်း။"
	print(sp.encode_as_pieces(text))
	```

	# ✍️ Project Authors
	- Developer: [Khant Sint Heinn (Kalix Louis)](https://huggingface.co/kalixlouiis)
	- Organization: [DatarrX (Myanmar Open Source NGO)](https://huggingface.co/DatarrX)

	## Citation

	အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

	### APA 7th Edition
	Khant Sint Heinn. (2026). myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0) [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram

	### BibTeX
	```BibTeX
	@software{khantsintheinn2026unigram,
	author = {Khant Sint Heinn},
	title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
	version = {1.0},
	year = {2026},
	publisher = {Hugging Face},
	url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
	note = {Burmese-only training corpus}
	}
	```

	## License 📜

	This project is licensed under the Apache License 2.0.

	### What does this mean?
	The Apache License 2.0 is a permissive license that allows you to:

	* Commercial Use: You can use this tokenizer for commercial purposes.
	* Modification: You can modify the model or the code for your specific needs.
	* Distribution: You can share and distribute the original or modified versions.
	* Sublicensing: You can grant sublicenses to others.

	### Conditions:
	* Attribute: You must give appropriate credit to the author (Khant Sint Heinn) and the organization (DatarrX).
	* License Notice: You must include a copy of the license and any original copyright notice in your distribution.

	For more details, you can read the full license text at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).