Update README.md
Browse files
README.md
CHANGED
|
@@ -31,6 +31,24 @@ Trained on the [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouii
|
|
| 31 |
* **Limited English Support:** This model is strictly a Burmese script specialist. It has significant limitations in processing English text, which may result in excessive subword splitting for Latin characters.
|
| 32 |
* **Script Sensitivity:** Optimized for modern Burmese script; performance may vary with older orthography or heavy use of specialized Pali/Sanskrit loanwords.
|
| 33 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
---
|
| 35 |
|
| 36 |
# DatarrX - myX-Tokenizer-Unigram (မြန်မာဘာသာ)
|
|
@@ -75,4 +93,22 @@ print(sp.encode_as_pieces(text))
|
|
| 75 |
|
| 76 |
# ✍️ Project Authors
|
| 77 |
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
|
| 78 |
-
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
* **Limited English Support:** This model is strictly a Burmese script specialist. It has significant limitations in processing English text, which may result in excessive subword splitting for Latin characters.
|
| 32 |
* **Script Sensitivity:** Optimized for modern Burmese script; performance may vary with older orthography or heavy use of specialized Pali/Sanskrit loanwords.
|
| 33 |
|
| 34 |
+
## Citation
|
| 35 |
+
|
| 36 |
+
If you use this tokenizer in your research or project, please cite it as follows:
|
| 37 |
+
|
| 38 |
+
### APA 7th Edition
|
| 39 |
+
Khant Sint Heinn. (2026). *myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram
|
| 40 |
+
|
| 41 |
+
### BibTeX
|
| 42 |
+
@software{khantsintheinn2026unigram,
|
| 43 |
+
author = {Khant Sint Heinn},
|
| 44 |
+
title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
|
| 45 |
+
version = {1.0},
|
| 46 |
+
year = {2026},
|
| 47 |
+
publisher = {Hugging Face},
|
| 48 |
+
url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
|
| 49 |
+
note = {Burmese-only training corpus}
|
| 50 |
+
}
|
| 51 |
+
|
| 52 |
---
|
| 53 |
|
| 54 |
# DatarrX - myX-Tokenizer-Unigram (မြန်မာဘာသာ)
|
|
|
|
| 93 |
|
| 94 |
# ✍️ Project Authors
|
| 95 |
- Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
|
| 96 |
+
- Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
|
| 97 |
+
|
| 98 |
+
## Citation
|
| 99 |
+
|
| 100 |
+
အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။
|
| 101 |
+
|
| 102 |
+
### APA 7th Edition
|
| 103 |
+
Khant Sint Heinn. (2026). *myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-Unigram
|
| 104 |
+
|
| 105 |
+
### BibTeX
|
| 106 |
+
@software{khantsintheinn2026unigram,
|
| 107 |
+
author = {Khant Sint Heinn},
|
| 108 |
+
title = {myX-Tokenizer-Unigram: Probabilistic Burmese Script Tokenizer},
|
| 109 |
+
version = {1.0},
|
| 110 |
+
year = {2026},
|
| 111 |
+
publisher = {Hugging Face},
|
| 112 |
+
url = {https://huggingface.co/DatarrX/myX-Tokenizer-Unigram},
|
| 113 |
+
note = {Burmese-only training corpus}
|
| 114 |
+
}
|