File size: 8,405 Bytes

d1415f3
a84c824
 
 
 
 
 
9f55bc8
d1415f3
a84c824
d1415f3
a84c824
d1415f3
a84c824
d1415f3
a84c824
 
 
d1415f3
a84c824
d1415f3
a84c824
 
 
 
d1415f3
 
a84c824
d1415f3
a84c824
d1415f3
a84c824
 
d1415f3
feceef1
 
 
 
 
 
 
 
9f55bc8
feceef1
 
 
 
 
 
 
 
 
9f55bc8
feceef1
a84c824
d1415f3
a84c824
d1415f3
a84c824
d1415f3
a84c824
d1415f3
a84c824
 
 
d1415f3
a84c824
d1415f3
a84c824
 
 
 
d1415f3
a84c824
 
d1415f3
a84c824
d1415f3
a84c824
 
d1415f3
a84c824
d1415f3
a84c824
d1415f3
a84c824
 
 
d1415f3
a84c824
 
d1415f3
a84c824
 
 
d1415f3
a84c824
 
feceef1
 
 
 
 
 
 
 
 
 
9f55bc8
feceef1
 
 
 
 
 
 
 
9f55bc8
af957fc

---
license: apache-2.0
datasets:
- kalixlouiis/raw-data
language:
- my
pipeline_tag: feature-extraction
new_version: DatarrX/myX-Tokenizer
---
# DatarrX - myX-Tokenizer-Unigram ⚙️

**myX-Tokenizer-Unigram** is a specialized tokenizer for the Burmese language based on the **Unigram Language Model** algorithm. Developed by [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis) under [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX), this model is optimized for linguistic probabilistic segmentation.

## 🎯 Objectives & Characteristics

* **Unigram Excellence:** Utilizes a probabilistic subword tokenization method that often aligns better with the morphological structure of the Burmese language than BPE.
* **Native Burmese Specialist:** Trained exclusively on a massive Burmese-only corpus to ensure high-fidelity script recognition.
* **Optimized Efficiency:** Developed using high-quality sampling to balance performance and model size.

## 🛠️ Technical Specifications

* **Algorithm:** Unigram Language Model.
* **Vocabulary Size:** 64,000.
* **Normalization:** NFKC.
* **Features:** Byte-fallback, Split Digits, and Dummy Prefix.

### Training Data
Trained on the [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) dataset, specifically utilizing **1.5 million** cleaned Burmese sentences.

## ⚠️ Important Considerations (Limitations)

* **Limited English Support:** This model is strictly a Burmese script specialist. It has significant limitations in processing English text, which may result in excessive subword splitting for Latin characters.
* **Script Sensitivity:** Optimized for modern Burmese script; performance may vary with older orthography or heavy use of specialized Pali/Sanskrit loanwords.

## Citation

If you use this tokenizer in your research or project, please cite it as follows:

### APA 7th Edition
Khant Sint Heinn. (2026). *myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram

### BibTeX
```BibTeX
@software{khantsintheinn2026unigram,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
  note = {Burmese-only training corpus}
}
```

---

# DatarrX - myX-Tokenizer-Unigram (မြန်မာဘာသာ)

**myX-Tokenizer-Unigram** သည် Unigram Language Model algorithm ကို အသုံးပြု၍ မြန်မာဘာသာစကားအတွက် အထူးပြုလုပ်ထားသော Tokenizer ဖြစ်ပါသည်။ ဤ Model ကို [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီး [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis) မှ အဓိက ဖန်တီးတည်ဆောက်ထားခြင်း ဖြစ်ပါသည်။

## 🎯 ရည်ရွယ်ချက်နှင့် ထူးခြားချက်များ

* **Unigram ၏ အားသာချက်:** BPE ထက် ပိုမို၍ ဖြစ်နိုင်ခြေ (Probability) အပေါ် အခြေခံကာ ဖြတ်တောက်သဖြင့် မြန်မာစာ၏ ဝဏ္ဏဗေဒ သဘာဝနှင့် ပိုမိုကိုက်ညီစေရန်။
* **မြန်မာစာ အထူးပြု:** ဤ Model ကို မြန်မာစာ သီးသန့်ဖြင့်သာ Train ထားသဖြင့် ဗမာ(မြန်မာ)စာသားများ၏ အနက်အဓိပ္ပာယ်ကို ပိုမိုတိကျစွာ ဖြတ်တောက်နိုင်ရန်။
* **စနစ်တကျ လေ့ကျင့်မှု:** စာကြောင်းပေါင်း ၁.၅ သန်းကို အသုံးပြု၍ အရည်အသွေးမြင့် စံနှုန်းများဖြင့် တည်ဆောက်ထားပါသည်။

## 🛠️ နည်းပညာဆိုင်ရာ အချက်အလက်များ

* **Algorithm:** Unigram Language Model။
* **Vocab Size:** 64,000။
* **Normalization:** NFKC။
* **Features:** Byte-fallback, Split Digits နှင့် Dummy Prefix အင်္ဂါရပ်များ ပါဝင်ပါသည်။

### အသုံးပြုထားသော Dataset
[kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) ထဲမှ သန့်စင်ပြီးသား မြန်မာစာကြောင်းပေါင်း **၁.၅ သန်း (1.5 Million)** ကို အသုံးပြုထားပါသည်။

## ⚠️ သိထားရန် ကန့်သတ်ချက်များ

* **အင်္ဂလိပ်စာ အားနည်းမှု:** ဤ Model သည် မြန်မာစာ သီးသန့်အတွက်သာ ဖြစ်သောကြောင့် အင်္ဂလိပ်စာလုံးများကို ဖြတ်တောက်ရာတွင် အလွန်အားနည်းပြီး စာလုံးအသေးလေးများအဖြစ် ကွဲထွက်သွားတတ်ပါသည်။
* **အရေးအသား စံနှုန်း:** ခေတ်သစ်မြန်မာစာ အရေးအသားအပေါ် အခြေခံထားသဖြင့် ပါဠိ/သက္ကတ အသုံးများသော စာသားများတွင် ဖြတ်တောက်ပုံ ကွဲပြားနိုင်ပါသည်။

---

## 💻 How to Use (အသုံးပြုနည်း)

```python
import sentencepiece as spm
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer-Unigram", filename="myX-Tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

text = "မြန်မာစာကို Unigram algorithm နဲ့ စနစ်တကျ ဖြတ်တောက်ကြည့်ခြင်း။"
print(sp.encode_as_pieces(text))
```

# ✍️ Project Authors
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)

## Citation

အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

### APA 7th Edition
Khant Sint Heinn. (2026). *myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram

### BibTeX
```BibTeX
@software{khantsintheinn2026unigram,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
  note = {Burmese-only training corpus}
}
```

## License 📜

This project is licensed under the **Apache License 2.0**.

### What does this mean?
The Apache License 2.0 is a permissive license that allows you to:

* **Commercial Use:** You can use this tokenizer for commercial purposes.
* **Modification:** You can modify the model or the code for your specific needs.
* **Distribution:** You can share and distribute the original or modified versions.
* **Sublicensing:** You can grant sublicenses to others.

### Conditions:
* **Attribute:** You must give appropriate credit to the author (**Khant Sint Heinn**) and the organization (**DatarrX**).
* **License Notice:** You must include a copy of the license and any original copyright notice in your distribution.

For more details, you can read the full license text at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).