Update README.md
Browse files
README.md
CHANGED
|
@@ -16,6 +16,8 @@ tags:
|
|
| 16 |
pretty_name: “gemma-3 - ukrainized gemma tokenizer”
|
| 17 |
---
|
| 18 |
|
|
|
|
|
|
|
| 19 |
### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
|
| 20 |
|
| 21 |
### How to possible
|
|
@@ -64,6 +66,5 @@ tokenizer = AutoTokenizer.from_pretrained("lapa-llm/tokenizer")
|
|
| 64 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 65 |
print(len(toks.input_ids)) -only 4 tokens 💪🏻
|
| 66 |
```
|
| 67 |
-
## Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
|
| 68 |
|
| 69 |
### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.
|
|
|
|
| 16 |
pretty_name: “gemma-3 - ukrainized gemma tokenizer”
|
| 17 |
---
|
| 18 |
|
| 19 |
+
### Using the same approach as [Tereshchenko Blue](https://huggingface.co/transhumanist-already-exists/tereshchenkoblue-tokenizer), now trained on the full [Kobza corpus](https://huggingface.co/datasets/Goader/kobza).
|
| 20 |
+
|
| 21 |
### By adding more than 80K Ukrainian tokens **without removing any English or EU languages tokens**, Lapa Tokenizer makes Ukrainian the core language in the multilingual Gemma-3 tokenizer while keeping the vocabulary fixed at its original size of 256K tokens.
|
| 22 |
|
| 23 |
### How to possible
|
|
|
|
| 66 |
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
|
| 67 |
print(len(toks.input_ids)) -only 4 tokens 💪🏻
|
| 68 |
```
|
|
|
|
| 69 |
|
| 70 |
### "fixed" - means that we remove condition that allow to add empty <think></think> for hybrid approach. This significantly speeds up tokenization.
|