lapa-llm
/

tokenizer

gemma-3-tokenizer

corpus-linguistics

Model card Files Files and versions

tokenizer / README.md

transhumanist-already-exists's picture

transhumanist-already-exists

Update README.md

45fc51f verified about 2 months ago

|

history blame contribute delete

2.63 kB

	---
	inference: false
	library_name: transformers
	base_model: google/gemma-3-12b-it
	language:
	- uk
	datasets:
	- Goader/kobza
	- QIRIM/crh_monocorpus
	multilinguality:
	- multililingual
	tags:
	- gemma-3-tokenizer
	- ukraine
	- corpus-linguistics
	pretty_name: “gemma-3 - ukrainized gemma tokenizer”
	---

	### Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).

	By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

	### How to possible
	More than 16 of the most popular writing systems in the world were analyzed.
	Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

	### Replaced tokens
	\|Writing system\|Tokens removed\|Tokens retained\|
	\|-\|-\|-\|
	\|Han (Chinese)\|16,488\|4,122\|
	\|Devanagari (Hindi)\|10,976\|2,743\|
	\|Bengali\|7,983\|1,995\|
	\|Arabic\|6,730\|1,682\|
	\|Hiragana / Katakana (Japanese)\|3,944\|985\|
	\|Hangul (Korean)\|3,744\|935\|
	\|Tamil\|3,080\|770\|
	\|Thai\|1,740\|435\|
	\|Malayalam\|1,566\|391\|
	\|Telugu\|1,428\|356\|
	\|Gujarati\|1,080\|270\|
	\|Kannada\|1,016\|253\|
	\|Ethiopic\|691\|172\|
	\|Hebrew\|670\|167\|
	\|Khmer\|481\|119\|
	\|Sinhala\|435\|108\|
	\|Myanmar\|410\|102\|
	\|Lao\|243\|60\|
	\|Gurmukhi\|215\|53\|
	\|Tibetan\|107\|26\|
	\|Oriya\|100\|25\|
	\|Cyrillic\|13,398\|0\|
	\|Gemma-3 \<unused-*\>\|6,139\|102\|


	## Feature Overview:

	1. +81,492 new Cyrillic BPE tokens trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza) plus the Cyrillic slice of the [Crimean Tatar corpus](https://huggingface.co/datasets/QIRIM/crh_monocorpus).
	2. Just tokens from `Replaced tokens` table was replaced, no any tokens from other Writing system was affected.
	3. Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
	4. Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
	5. Reasoning tokens <think></think>

	## Simple example
	```python
	tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
	toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
	print(len(toks.input_ids)) -only 4 tokens 💪🏻
	```

	### "fixed" - means that we remove condition that allow to add empty `<think></think>` for hybrid approach. This significantly speeds up tokenization.