Feature Extraction
Transformers
Burmese
kalixlouiis commited on
Commit
134468a
·
verified ·
1 Parent(s): c67e29c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -1
README.md CHANGED
@@ -5,7 +5,7 @@ datasets:
5
  - kalixlouiis/raw-data
6
  language:
7
  - my
8
- new_version: DatarrX/myX-Tokenizer-Unigram
9
  pipeline_tag: feature-extraction
10
  ---
11
  # DatarrX / myX-Tokenizer-BPE ⚙️
@@ -33,6 +33,26 @@ Trained on [kalixlouiis/raw-data](https://huggingface.co/datasets/kalixlouiis/ra
33
  * **English Language Weakness:** Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
34
  * **BPE Nature:** Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.
35
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  ---
37
 
38
  # DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️
@@ -78,3 +98,23 @@ print(sp.encode_as_pieces(text))
78
  # ✍️ Project Authors
79
  - Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
80
  - Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  - kalixlouiis/raw-data
6
  language:
7
  - my
8
+ new_version: DatarrX/myX-Tokenizer
9
  pipeline_tag: feature-extraction
10
  ---
11
  # DatarrX / myX-Tokenizer-BPE ⚙️
 
33
  * **English Language Weakness:** Since this model was trained purely on Burmese data, it is notably weak in processing English text, often leading to excessive character-level fragmentation for Latin scripts.
34
  * **BPE Nature:** Compared to our Unigram models, this BPE version may offer different segmentation logic which might affect certain downstream NLP tasks.
35
 
36
+ ## Citation
37
+
38
+ If you use this tokenizer in your research or project, please cite it as follows:
39
+
40
+ ### APA 7th Edition
41
+ Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE
42
+
43
+ ### BibTeX
44
+ ```BibTeX
45
+ @software{khantsintheinn2026bpe,
46
+ author = {Khant Sint Heinn},
47
+ title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
48
+ version = {1.0},
49
+ year = {2026},
50
+ publisher = {Hugging Face},
51
+ url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
52
+ note = {BPE algorithm based on Burmese raw data}
53
+ }
54
+ ```
55
+
56
  ---
57
 
58
  # DatarrX - myX-Tokenizer-BPE (မြန်မာဘာသာ) ⚙️
 
98
  # ✍️ Project Authors
99
  - Developer: [**Khant Sint Heinn (Kalix Louis)**](https://huggingface.co/kalixlouiis)
100
  - Organization: [**DatarrX (Myanmar Open Source NGO)**](https://huggingface.co/DatarrX)
101
+
102
+ ## Citation
103
+
104
+ အကယ်၍ သင်သည် ဤ model ကို သင်၏ သုတေသနလုပ်ငန်းများတွင် အသုံးပြုခဲ့ပါက အောက်ပါအတိုင်း ကိုးကားပေးရန် မေတ္တာရပ်ခံအပ်ပါသည်။
105
+
106
+ ### APA 7th Edition
107
+ Khant Sint Heinn. (2026). *myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese (Version 1.0)* [Computer software]. Hugging Face. https://huggingface.co/DatarrX/myX-Tokenizer-BPE
108
+
109
+ ### BibTeX
110
+ ```BibTeX
111
+ @software{khantsintheinn2026bpe,
112
+ author = {Khant Sint Heinn},
113
+ title = {myX-Tokenizer-BPE: Byte Pair Encoding Baseline for Burmese},
114
+ version = {1.0},
115
+ year = {2026},
116
+ publisher = {Hugging Face},
117
+ url = {https://huggingface.co/DatarrX/myX-Tokenizer-BPE},
118
+ note = {BPE algorithm based on Burmese raw data}
119
+ }
120
+ ```