metadata
inference: false
library_name: transformers
base_model: google/gemma-3-12b-it
language:
- uk
datasets:
- Goader/kobza
- QIRIM/crh_monocorpus
multilinguality:
- multililingual
tags:
- gemma-3-tokenizer
- ukraine
- corpus-linguistics
pretty_name: “gemma-3 - ukrainized gemma tokenizer”
Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus.
By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
How to possible
More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.
Replaced tokens
| Writing system | Tokens removed | Tokens retained |
|---|---|---|
| Han (Chinese) | 16,488 | 4,122 |
| Devanagari (Hindi) | 10,976 | 2,743 |
| Bengali | 7,983 | 1,995 |
| Arabic | 6,730 | 1,682 |
| Hiragana / Katakana (Japanese) | 3,944 | 985 |
| Hangul (Korean) | 3,744 | 935 |
| Tamil | 3,080 | 770 |
| Thai | 1,740 | 435 |
| Malayalam | 1,566 | 391 |
| Telugu | 1,428 | 356 |
| Gujarati | 1,080 | 270 |
| Kannada | 1,016 | 253 |
| Ethiopic | 691 | 172 |
| Hebrew | 670 | 167 |
| Khmer | 481 | 119 |
| Sinhala | 435 | 108 |
| Myanmar | 410 | 102 |
| Lao | 243 | 60 |
| Gurmukhi | 215 | 53 |
| Tibetan | 107 | 26 |
| Oriya | 100 | 25 |
| Cyrillic | 13,398 | 0 |
| Gemma-3 <unused-*> | 6,139 | 102 |
Feature Overview:
- +81,492 new Cyrillic BPE tokens trained on the full Kobza corpus plus the Cyrillic slice of the Crimean Tatar corpus.
- Just tokens from
Replaced tokenstable was replaced, no any tokens from other Writing system was affected. - Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
- Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
- Reasoning tokens
Simple example
tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(len(toks.input_ids)) -only 4 tokens 💪🏻