tokenizer / README.md

transhumanist-already-exists

Update README.md

45fc51f verified about 2 months ago

preview code

raw

history blame contribute delete

2.63 kB

metadata

inference: false
library_name: transformers
base_model: google/gemma-3-12b-it
language:
  - uk
datasets:
  - Goader/kobza
  - QIRIM/crh_monocorpus
multilinguality:
  - multililingual
tags:
  - gemma-3-tokenizer
  - ukraine
  - corpus-linguistics
pretty_name: “gemma-3 - ukrainized gemma tokenizer”

Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus.

By adding more than 80K Ukrainian tokens without removing any English or EU languages tokens, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.

How to possible

More than 16 of the most popular writing systems in the world were analyzed. Roughly four-fifths of tokens in scripts geographically and culturally distant from Ukraine—for example Bengali, Thai, Chinese, Japanese, and Korean—were pruned.

Replaced tokens

Writing system	Tokens removed	Tokens retained
Han (Chinese)	16,488	4,122
Devanagari (Hindi)	10,976	2,743
Bengali	7,983	1,995
Arabic	6,730	1,682
Hiragana / Katakana (Japanese)	3,944	985
Hangul (Korean)	3,744	935
Tamil	3,080	770
Thai	1,740	435
Malayalam	1,566	391
Telugu	1,428	356
Gujarati	1,080	270
Kannada	1,016	253
Ethiopic	691	172
Hebrew	670	167
Khmer	481	119
Sinhala	435	108
Myanmar	410	102
Lao	243	60
Gurmukhi	215	53
Tibetan	107	26
Oriya	100	25
Cyrillic	13,398	0
Gemma-3 <unused-*>	6,139	102

Feature Overview:

+81,492 new Cyrillic BPE tokens trained on the full Kobza corpus plus the Cyrillic slice of the Crimean Tatar corpus.
Just tokens from Replaced tokens table was replaced, no any tokens from other Writing system was affected.
Unchanged tokens preserve their IDs, enabling direct reuse of Gemma-3 embeddings.
Vocab size, Special-token set, pre/post-tokenisation logic, and output formatting match Gemma-3 one-for-one.
Reasoning tokens

Simple example

tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(len(toks.input_ids)) -only 4 tokens 💪🏻

lapa-llm
/

tokenizer

Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus.

How to possible

Replaced tokens

Feature Overview:

Simple example

"fixed" - means that we remove condition that allow to add empty `<think></think>` for hybrid approach. This significantly speeds up tokenization.

Using the same approach as Tereshchenko Blue, now trained on the full Kobza corpus.

How to possible

Replaced tokens

Feature Overview:

Simple example

"fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.

"fixed" - means that we remove condition that allow to add empty `<think></think>` for hybrid approach. This significantly speeds up tokenization.