File size: 8,118 Bytes

7e645b0
 
c67e29c
 
 
 
 
134468a
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
 
 
7e645b0
c67e29c
7e645b0
c67e29c
 
 
 
7e645b0
 
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
 
7e645b0
134468a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
 
 
7e645b0
c67e29c
7e645b0
c67e29c
 
 
 
7e645b0
c67e29c
 
7e645b0
c67e29c
7e645b0
c67e29c
 
7e645b0
c67e29c
7e645b0
c67e29c
7e645b0
c67e29c
 
 
7e645b0
c67e29c
 
7e645b0
c67e29c
 
 
7e645b0
c67e29c
 
 
134468a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
832c291

---
library_name: transformers
license: apache-2.0
datasets:
- kalixlouiis/raw-data
language:
- my
new_version: DatarrX/myX-Tokenizer
pipeline_tag: feature-extraction
---
# DatarrX / myX-Tokenizer-BPE ⚙️

**myX-Tokenizer-BPE** is a Byte Pair Encoding (BPE) based tokenizer specifically trained for the Burmese language. Developed by [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis) under [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX), this model serves as a baseline for Burmese NLP tasks using the BPE algorithm.

## 🎯 Objectives & Characteristics

* **BPE Baseline:** Designed to provide a standard BPE-based segmentation for Burmese text.
* **Burmese Focus:** This model was trained exclusively on Burmese text, making it highly specialized for native scripts.
* **Memory Efficiency:** Trained using a RAM-efficient approach with a large-scale corpus.

## 🛠️ Technical Specifications

* **Algorithm:** Byte Pair Encoding (BPE).
* **Vocabulary Size:** 64,000.
* **Normalization:** NFKC.
* **Features:** Byte-fallback, Split Digits, and Dummy Prefix.

### Training Data
Trained on [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) using **1.5 million** Burmese-only sentences.

## ⚠️ Important Considerations (Limitations)

* **English Language Weakness:** Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
* **BPE Nature:** Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.

## Citation

If you use this tokenizer in your research or project, please cite it as follows:

### APA 7th Edition
Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

### BibTeX
```BibTeX
@software{khantsintheinn2026bpe,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
  note = {BPE algorithm based on Burmese raw data}
}
```

---

# DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️

**myX-Tokenizer-BPE** သည် Byte Pair Encoding (BPE) algorithm ကို အသုံးပြု၍ မြန်မာဘာသာစကားအတွက် အထူးရည်ရွယ် တည်ဆောက်ထားသော Tokenizer ဖြစ်ပါသည်။ ဤ Model ကို **DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX) မှ ထုတ်ဝေခြင်းဖြစ်ပြီ [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis) မှ အဓိက ဖန်တီးထားခြင်း ဖြစ်ပါသည်။

## 🎯 ရည်ရွယ်ချက်နှင့် ထူးခြားချက်များ

* **BPE အခြေခံ:** မြန်မာစာသားများကို BPE နည်းပညာဖြင့် ဖြတ်တောက်ရာတွင် စံနှုန်းတစ်ခုအဖြစ် အသုံးပြုနိုင်ရန်။
* **မြန်မာစာ သီးသန့်:** ဤ Model ကို မြန်မာစာသား သီးသန့်ဖြင့်သာ လေ့ကျင့်ထားသဖြင့် ဗမာ(မြန်မာ)စာအရေးအသားများအတွက် အထူးပြုထားပါသည်။
* **အရည်အသွေးမြင့် Training:** စာကြောင်းပေါင်း ၁.၅ သန်းကို အသုံးပြု၍ RAM-efficient ဖြစ်သော နည်းလမ်းဖြင့် တည်ဆောက်ထားပါသည်။

## 🛠️ နည်းပညာဆိုင်ရာ အချက်အလက်များ

* **Algorithm:** Byte Pair Encoding (BPE)။
* **Vocab Size:** 64,000။
* **Normalization:** NFKC။
* **Features:** Byte-fallback, Split Digits နှင့် Dummy Prefix အင်္ဂါရပ်များ ပါဝင်ပါသည်။

### အသုံးပြုထားသော Dataset
[kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/raw-data) ထဲမှ သန့်စင်ပြီးသား မြန်မာစာကြောင်းပေါင်း **၁.၅ သန်း (1.5 Million)** ကို အသုံးပြုထားပါသည်။

## ⚠️ သိထားရန် ကန့်သတ်ချက်များ

* **အင်္ဂလိပ်စာ အားနည်းမှု:** ဤ Model ကို မြန်မာစာ သီးသန့်ဖြင့်သာ Train ထားခြင်းကြောင့် အင်္ဂလိပ်စာလုံးများကို ဖြတ်တောက်ရာတွင် အလွန်အားနည်းပြီး စာလုံးတစ်လုံးချင်းစီ ကွဲထွက်သွားတတ်ပါသည်။
* **BPE ၏ သဘာဝ:** ကျွန်တော်တို့၏ Unigram model များနှင့် ယှဉ်ပါက ဖြတ်တောက်ပုံခြင်း ကွဲပြားနိုင်သဖြင့် မိမိအသုံးပြုမည့် task အပေါ် မူတည်၍ ရွေးချယ်ရန် လိုအပ်ပါသည်။

---

## 💻 How to Use (အသုံးပြုနည်း)

```python
import sentencepiece as spm
from huggingface_hub import hf_hub_download

model_path = hf_hub_download(repo_id="DatarrX/myX-Tokenizer-BPE", filename="myX-Tokenizer.model")
sp = spm.SentencePieceProcessor(model_file=model_path)

text = "မြန်မာစာကို BPE algorithm နဲ့ ဖြတ်တောက်ကြည့်ခြင်း။"
print(sp.encode_as_pieces(text))
```

# ✍️ Project Authors
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)

## Citation

အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။

### APA 7th Edition
Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE

### BibTeX
```BibTeX
@software{khantsintheinn2026bpe,
  author = {Khant Sint Heinn},
  title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
  version = {1.0},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
  note = {BPE algorithm based on Burmese raw data}
}
```

## License 📜

This project is licensed under the **Apache License 2.0**.

### What does this mean?
The Apache License 2.0 is a permissive license that allows you to:

* **Commercial Use:** You can use this tokenizer for commercial purposes.
* **Modification:** You can modify the model or the code for your specific needs.
* **Distribution:** You can share and distribute the original or modified versions.
* **Sublicensing:** You can grant sublicenses to others.

### Conditions:
* **Attribute:** You must give appropriate credit to the author (**Khant Sint Heinn**) and the organization (**DatarrX**).
* **License Notice:** You must include a copy of the license and any original copyright notice in your distribution.

For more details, you can read the full license text at [http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0).