Karpotron Tokenizer

Overview

Karpotron is a specialized tokenizer based on NVIDIA's Nemotron-3-Nano-30B tokenizer, optimized for Ukrainian language processing. It adds 28,065 Ukrainian tokens while maintaining the original 131,072 vocabulary size by pruning tokens from writing systems geographically and culturally distant from Ukraine.

Key Features

  1. +28,065 new Cyrillic BPE tokens:

  2. No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced

  3. Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)

  4. Identical specifications - vocab size (131,072), byte-level BPE encoding match original Nemotron

Replaced Tokens by Writing System

The tokenizer replaced 24,360 tokens from 17 writing systems while adding 28,065 Ukrainian tokens:

Writing System Original Removed Retained % Removed
Arabic 9,400 8,458 942 90.0%
Hangul (Korean) 4,492 3,330 1,162 74.1%
Han (Chinese) 3,767 2,364 1,403 62.8%
Devanagari (Hindi) 1,554 1,155 399 74.3%
Hebrew 1,002 722 280 72.1%
Telugu 920 675 245 73.4%
Bengali 839 615 224 73.3%
Armenian 1,121 549 572 49.0%
Thai 567 423 144 74.6%
Kannada 570 419 151 73.5%
Tamil 539 394 145 73.1%
Malayalam 406 283 123 69.7%
Georgian 513 240 273 46.8%
Hiragana/Katakana (Japanese) 1,623 208 1,415 12.8%
Gujarati 204 136 68 66.7%
Gurmukhi 155 111 44 71.6%
Myanmar 234 96 138 41.0%

Fully preserved:

  • Latin scripts (English, Spanish, French, German, Italian, Portuguese, Dutch, Danish, Swedish, Polish)
  • Greek (1,507 tokens, 100% retained)

Metrics

Acknowledgement: evaluation results provided by Andrii Sameliuk

lang-uk/malyuk [100k] allenai/c4(en) [100k] allenai/c4 (es,fr,it,de) [100k] QIRIM/crh (Cyrillic) [94] allenai/c4(ru) [100k] allenai/c4(bg) [100k] allenai/c4(be) [100k]
words count 22,898,164 36,170,971 198,173,216 1,868,259 42,557,519 44,627,199 43,153,645
tokenizers tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word tokens toks/word
Qwen/Qwen3-8B 84,408,084 3.686 46,884,593 1.296 395,581,536 1.996 7,956,741 4.259 116,115,062 2.728 132,597,427 2.971 173,571,099 4.022
meta-llama/Llama-3.1-8B-Instruct 57,226,997 2.499 46,085,724 1.274 382,143,751 1.928 7,386,873 3.954 104,974,733 2.467 119,123,733 2.669 150,189,294 3.48
microsoft/Phi-4-mini-instruct 59,447,036 2.596 45,423,925 1.256 335,188,687 1.691 5,995,822 3.209 91,824,464 2.158 102,472,523 2.296 119,587,038 2.771
CohereLabs/aya-expanse-8b 50,973,632 2.226 47,364,187 1.309 353,221,932 1.782 6,614,719 3.541 93,089,697 2.187 112,612,668 2.523 141,262,943 3.273
google/gemma-3-12b-it 57,388,402 2.506 47,285,432 1.307 354,241,840 1.788 6,240,944 3.341 95,520,817 2.245 103,950,626 2.329 131,398,147 3.045
nvidia/NVIDIA-Nemotron-3-Nano-30B 62,087,149 2.711 47,630,139 1.317 365,218,644 1.843 6,623,516 3.545 107,233,038 2.520 108,691,963 2.436 135,489,439 3.140
karpotron-tokenizer (Ours) 46,456,626 2.029🤩 47,650,584 1.317 365,285,307 1.843 7,519,362 4.025 132,519,787 3.114 131,626,936 2.949 158,657,784 3.677
Comments ~1.34x improvement over Nemotron for Ukrainian: faster inference/training and larger effective context windowEnglish unchangedEU languages unchangedQIRIM slightly worseRussian drops (UA-centric)Bulgarian drops slightlyBelarusian drops

Usage Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/karpotron-tokenizer"
)

toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids)  # Only 5 tokens 💪🏻

Model Contents

  • tokenizer.json - Byte-level tokenizer spec (131,072 tokens, 252,034 merges)
  • tokenizer_config.json - Configuration metadata
  • special_tokens_map.json - Special token mappings (identical to Nemotron)
  • merge_info.json - Information about removed and added tokens

Embedding Initialization

For newly added tokens in Nemotron models, you can:

  • Use tools like Focus or Zett
  • Initialize embeddings randomly with warm-up schedule training
  • Unchanged tokens (103,007) retain original IDs and can reuse existing embeddings

Citation

@misc{zaduha2026post9793,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9793 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9793}",
  month        = january,
  year         = {2026},
  note         = "[Online; accessed 31 January 2026]"
}

Base Models

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/karpotron-tokenizer

Finetuned
(20)
this model

Collection including transhumanist-already-exists/karpotron-tokenizer