lapa-llm
/

tokenizer

@@ -16,6 +16,8 @@ tags:
 pretty_name: “gemma-3 - ukrainized gemma tokenizer”
 ---
 ### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
 ### How to possible
@@ -64,6 +66,5 @@ tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
 toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
 print(len(toks.input_ids)) -only 4 tokens 💪🏻
 ```
-## Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
 ### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.

 pretty_name: “gemma-3 - ukrainized gemma tokenizer”
 ---
+### Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
 ### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
 ### How to possible
 toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
 print(len(toks.input_ids)) -only 4 tokens 💪🏻
 ```
 ### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.