transhumanist-already-exists commited on
Commit
2702d91
·
verified ·
1 Parent(s): 0abf5ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -16,6 +16,8 @@ tags:
16
  pretty_name: “gemma-3 - ukrainized gemma tokenizer”
17
  ---
18
 
 
 
19
  ### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
20
 
21
  ### How to possible
@@ -64,6 +66,5 @@ tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
64
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
65
  print(len(toks.input_ids)) -only 4 tokens 💪🏻
66
  ```
67
- ## Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
68
 
69
  ### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.
 
16
  pretty_name: “gemma-3 - ukrainized gemma tokenizer”
17
  ---
18
 
19
+ ### Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
20
+
21
  ### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
22
 
23
  ### How to possible
 
66
  toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
67
  print(len(toks.input_ids)) -only 4 tokens 💪🏻
68
  ```
 
69
 
70
  ### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.