transhumanist-already-exists commited on
Commit
0abf5ea
·
verified ·
1 Parent(s): 64dec32

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ inference: false
3
+ library_name: transformers
4
+ base_model: google/gemma-3-12b-it
5
+ language:
6
+ - uk
7
+ datasets:
8
+ - Goader/kobza
9
+ - QIRIM/crh_monocorpus
10
+ multilinguality:
11
+ - multililingual
12
+ tags:
13
+ - gemma-3-tokenizer
14
+ - ukraine
15
+ - corpus-linguistics
16
+ pretty_name: “gemma-3 - ukrainized gemma tokenizer”
17
+ ---
18
+
19
+ ### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
20
+
21
+ ### How to possible
22
+ More than 16 of the most popular writing systems in the world were analyzed.
23
+ Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.
24
+
25
+ ### Replaced tokens
26
+ |Writing system|Tokens removed|Tokens retained|
27
+ |-|-|-|
28
+ |Han (Chinese)|16,488|4,122|
29
+ |Devanagari (Hindi)|10,976|2,743|
30
+ |Bengali|7,983|1,995|
31
+ |Arabic|6,730|1,682|
32
+ |Hiragana / Katakana (Japanese)|3,944|985|
33
+ |Hangul (Korean)|3,744|935|
34
+ |Tamil|3,080|770|
35
+ |Thai|1,740|435|
36
+ |Malayalam|1,566|391|
37
+ |Telugu|1,428|356|
38
+ |Gujarati|1,080|270|
39
+ |Kannada|1,016|253|
40
+ |Ethiopic|691|172|
41
+ |Hebrew|670|167|
42
+ |Khmer|481|119|
43
+ |Sinhala|435|108|
44
+ |Myanmar|410|102|
45
+ |Lao|243|60|
46
+ |Gurmukhi|215|53|
47
+ |Tibetan|107|26|
48
+ |Oriya|100|25|
49
+ |Cyrillic|13,398|0|
50
+ |Gemma-3 \<unused-*\>|6,139|102|
51
+
52
+
53
+ ## Feature Overview:
54
+
55
+ 1. +81,492 new Cyrillic BPE tokens trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
56
+ 2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
57
+ 3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
58
+ 4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
59
+ 5. Reasoning tokens <think></think>
60
+
61
+ ## Simple example
62
+ ```python
63
+ tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
64
+ toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
65
+ print(len(toks.input_ids)) -only 4 tokens 💪🏻
66
+ ```
67
+ ## Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
68
+
69
+ ### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.