Karpotron Tokenizer
Overview
Karpotron is a specialized tokenizer based on NVIDIA's Nemotron-3-Nano-30B tokenizer, optimized for Ukrainian language processing. It adds 28,065 Ukrainian tokens while maintaining the original 131,072 vocabulary size by pruning tokens from writing systems geographically and culturally distant from Ukraine.
Key Features
+28,065 new Cyrillic BPE tokens:
- 152 base Cyrillic letters
- 27,913 Ukrainian word tokens from lapa-llm/tokenizer
No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced
Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)
Identical specifications - vocab size (131,072), byte-level BPE encoding match original Nemotron
Replaced Tokens by Writing System
The tokenizer replaced 24,360 tokens from 17 writing systems while adding 28,065 Ukrainian tokens:
| Writing System | Original | Removed | Retained | % Removed |
|---|---|---|---|---|
| Arabic | 9,400 | 8,458 | 942 | 90.0% |
| Hangul (Korean) | 4,492 | 3,330 | 1,162 | 74.1% |
| Han (Chinese) | 3,767 | 2,364 | 1,403 | 62.8% |
| Devanagari (Hindi) | 1,554 | 1,155 | 399 | 74.3% |
| Hebrew | 1,002 | 722 | 280 | 72.1% |
| Telugu | 920 | 675 | 245 | 73.4% |
| Bengali | 839 | 615 | 224 | 73.3% |
| Armenian | 1,121 | 549 | 572 | 49.0% |
| Thai | 567 | 423 | 144 | 74.6% |
| Kannada | 570 | 419 | 151 | 73.5% |
| Tamil | 539 | 394 | 145 | 73.1% |
| Malayalam | 406 | 283 | 123 | 69.7% |
| Georgian | 513 | 240 | 273 | 46.8% |
| Hiragana/Katakana (Japanese) | 1,623 | 208 | 1,415 | 12.8% |
| Gujarati | 204 | 136 | 68 | 66.7% |
| Gurmukhi | 155 | 111 | 44 | 71.6% |
| Myanmar | 234 | 96 | 138 | 41.0% |
Fully preserved:
- Latin scripts (English, Spanish, French, German, Italian, Portuguese, Dutch, Danish, Swedish, Polish)
- Greek (1,507 tokens, 100% retained)
Metrics
Acknowledgement: evaluation results provided by Andrii Sameliuk
| lang-uk/malyuk [100k] | allenai/c4(en) [100k] | allenai/c4 (es,fr,it,de) [100k] | QIRIM/crh (Cyrillic) [94] | allenai/c4(ru) [100k] | allenai/c4(bg) [100k] | allenai/c4(be) [100k] | ||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| words count | 22,898,164 | 36,170,971 | 198,173,216 | 1,868,259 | 42,557,519 | 44,627,199 | 43,153,645 | |||||||||||||||||||||
| tokenizers | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | tokens | toks/word | ||||||||||||||
| Qwen/Qwen3-8B | 84,408,084 | 3.686 | 46,884,593 | 1.296 | 395,581,536 | 1.996 | 7,956,741 | 4.259 | 116,115,062 | 2.728 | 132,597,427 | 2.971 | 173,571,099 | 4.022 | ||||||||||||||
| meta-llama/Llama-3.1-8B-Instruct | 57,226,997 | 2.499 | 46,085,724 | 1.274 | 382,143,751 | 1.928 | 7,386,873 | 3.954 | 104,974,733 | 2.467 | 119,123,733 | 2.669 | 150,189,294 | 3.48 | ||||||||||||||
| microsoft/Phi-4-mini-instruct | 59,447,036 | 2.596 | 45,423,925 | 1.256 | 335,188,687 | 1.691 | 5,995,822 | 3.209 | 91,824,464 | 2.158 | 102,472,523 | 2.296 | 119,587,038 | 2.771 | ||||||||||||||
| CohereLabs/aya-expanse-8b | 50,973,632 | 2.226 | 47,364,187 | 1.309 | 353,221,932 | 1.782 | 6,614,719 | 3.541 | 93,089,697 | 2.187 | 112,612,668 | 2.523 | 141,262,943 | 3.273 | ||||||||||||||
| google/gemma-3-12b-it | 57,388,402 | 2.506 | 47,285,432 | 1.307 | 354,241,840 | 1.788 | 6,240,944 | 3.341 | 95,520,817 | 2.245 | 103,950,626 | 2.329 | 131,398,147 | 3.045 | ||||||||||||||
| nvidia/NVIDIA-Nemotron-3-Nano-30B | 62,087,149 | 2.711 | 47,630,139 | 1.317 | 365,218,644 | 1.843 | 6,623,516 | 3.545 | 107,233,038 | 2.520 | 108,691,963 | 2.436 | 135,489,439 | 3.140 | ||||||||||||||
| karpotron-tokenizer (Ours) | 46,456,626 | 2.029🤩 | 47,650,584 | 1.317 | 365,285,307 | 1.843 | 7,519,362 | 4.025 | 132,519,787 | 3.114 | 131,626,936 | 2.949 | 158,657,784 | 3.677 | ||||||||||||||
| Comments | ~1.34x improvement over Nemotron for Ukrainian: faster inference/training and larger effective context window | English unchanged | EU languages unchanged | QIRIM slightly worse | Russian drops (UA-centric) | Bulgarian drops slightly | Belarusian drops | |||||||||||||||||||||
Usage Example
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(
"transhumanist-already-exists/karpotron-tokenizer"
)
toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids) # Only 5 tokens 💪🏻
Model Contents
- tokenizer.json - Byte-level tokenizer spec (131,072 tokens, 252,034 merges)
- tokenizer_config.json - Configuration metadata
- special_tokens_map.json - Special token mappings (identical to Nemotron)
- merge_info.json - Information about removed and added tokens
Embedding Initialization
For newly added tokens in Nemotron models, you can:
- Use tools like Focus or Zett
- Initialize embeddings randomly with warm-up schedule training
- Unchanged tokens (103,007) retain original IDs and can reuse existing embeddings
Citation
@misc{zaduha2026post9793,
author = "{Bohdan Didenko}",
title = "{Post \#9793 on Telegram Channel Zaduha}",
howpublished = "\url{https://t.me/zaduha/9793}",
month = january,
year = {2026},
note = "[Online; accessed 31 January 2026]"
}
Base Models
Model tree for transhumanist-already-exists/karpotron-tokenizer
Base model
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16