Update README.md
Browse files
README.md
CHANGED
|
@@ -5,7 +5,7 @@ datasets:
|
|
| 5 |
- kalixlouiis/raw-data
|
| 6 |
language:
|
| 7 |
- my
|
| 8 |
-
new_version: DatarrX/myX-Tokenizer
|
| 9 |
pipeline_tag: feature-extraction
|
| 10 |
---
|
| 11 |
# DatarrX / myX-Tokenizer-BPE ⚙️
|
|
@@ -33,6 +33,26 @@ Trained on [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/ra
|
|
| 33 |
* **English Language Weakness:** Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
|
| 34 |
* **BPE Nature:** Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.
|
| 35 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
---
|
| 37 |
|
| 38 |
# DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️
|
|
@@ -78,3 +98,23 @@ print(sp.encode_as_pieces(text))
|
|
| 78 |
# ✍️ Project Authors
|
| 79 |
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
|
| 80 |
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
- kalixlouiis/raw-data
|
| 6 |
language:
|
| 7 |
- my
|
| 8 |
+
new_version: DatarrX/myX-Tokenizer
|
| 9 |
pipeline_tag: feature-extraction
|
| 10 |
---
|
| 11 |
# DatarrX / myX-Tokenizer-BPE ⚙️
|
|
|
|
| 33 |
* **English Language Weakness:** Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
|
| 34 |
* **BPE Nature:** Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.
|
| 35 |
|
| 36 |
+
## Citation
|
| 37 |
+
|
| 38 |
+
If you use this tokenizer in your research or project, please cite it as follows:
|
| 39 |
+
|
| 40 |
+
### APA 7th Edition
|
| 41 |
+
Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE
|
| 42 |
+
|
| 43 |
+
### BibTeX
|
| 44 |
+
```BibTeX
|
| 45 |
+
@software{khantsintheinn2026bpe,
|
| 46 |
+
author = {Khant Sint Heinn},
|
| 47 |
+
title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
|
| 48 |
+
version = {1.0},
|
| 49 |
+
year = {2026},
|
| 50 |
+
publisher = {Hugging Face},
|
| 51 |
+
url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
|
| 52 |
+
note = {BPE algorithm based on Burmese raw data}
|
| 53 |
+
}
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
---
|
| 57 |
|
| 58 |
# DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️
|
|
|
|
| 98 |
# ✍️ Project Authors
|
| 99 |
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
|
| 100 |
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
|
| 101 |
+
|
| 102 |
+
## Citation
|
| 103 |
+
|
| 104 |
+
အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။
|
| 105 |
+
|
| 106 |
+
### APA 7th Edition
|
| 107 |
+
Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE
|
| 108 |
+
|
| 109 |
+
### BibTeX
|
| 110 |
+
```BibTeX
|
| 111 |
+
@software{khantsintheinn2026bpe,
|
| 112 |
+
author = {Khant Sint Heinn},
|
| 113 |
+
title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
|
| 114 |
+
version = {1.0},
|
| 115 |
+
year = {2026},
|
| 116 |
+
publisher = {Hugging Face},
|
| 117 |
+
url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
|
| 118 |
+
note = {BPE algorithm based on Burmese raw data}
|
| 119 |
+
}
|
| 120 |
+
```
|