Karpotron Tokenizer

Overview

Karpotron is a specialized tokenizer based on NVIDIA's Nemotron-3-Nano-30B tokenizer, optimized for Ukrainian language processing. It adds 28,065 Ukrainian tokens while maintaining the original 131,072 vocabulary size by pruning tokens from writing systems geographically and culturally distant from Ukraine.

Key Features

+28,065 new Cyrillic BPE tokens:
- 152 base Cyrillic letters
- 27,913 Ukrainian word tokens from lapa-llm/tokenizer
No removal of English or EU language tokens - only non-essential tokens from distant writing systems were replaced
Latin-safe removal - preserved Latin tokens with diacritics used in multiple European languages (ü, ö, ç, ã, á, é, etc.)
Identical specifications - vocab size (131,072), byte-level BPE encoding match original Nemotron

Replaced Tokens by Writing System

The tokenizer replaced 24,360 tokens from 17 writing systems while adding 28,065 Ukrainian tokens:

Writing System	Original	Removed	Retained	% Removed
Arabic	9,400	8,458	942	90.0%
Hangul (Korean)	4,492	3,330	1,162	74.1%
Han (Chinese)	3,767	2,364	1,403	62.8%
Devanagari (Hindi)	1,554	1,155	399	74.3%
Hebrew	1,002	722	280	72.1%
Telugu	920	675	245	73.4%
Bengali	839	615	224	73.3%
Armenian	1,121	549	572	49.0%
Thai	567	423	144	74.6%
Kannada	570	419	151	73.5%
Tamil	539	394	145	73.1%
Malayalam	406	283	123	69.7%
Georgian	513	240	273	46.8%
Hiragana/Katakana (Japanese)	1,623	208	1,415	12.8%
Gujarati	204	136	68	66.7%
Gurmukhi	155	111	44	71.6%
Myanmar	234	96	138	41.0%

Fully preserved:

Latin scripts (English, Spanish, French, German, Italian, Portuguese, Dutch, Danish, Swedish, Polish)
Greek (1,507 tokens, 100% retained)

Metrics

Acknowledgement: evaluation results provided by Andrii Sameliuk

	lang-uk/malyuk [100k]		allenai/c4(en) [100k]		allenai/c4 (es,fr,it,de) [100k]		QIRIM/crh (Cyrillic) [94]		allenai/c4(ru) [100k]		allenai/c4(bg) [100k]		allenai/c4(be) [100k]
words count	22,898,164		36,170,971		198,173,216		1,868,259		42,557,519		44,627,199		43,153,645

tokenizers	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word	tokens	toks/word
Qwen/Qwen3-8B	84,408,084	3.686	46,884,593	1.296	395,581,536	1.996	7,956,741	4.259	116,115,062	2.728	132,597,427	2.971	173,571,099	4.022
meta-llama/Llama-3.1-8B-Instruct	57,226,997	2.499	46,085,724	1.274	382,143,751	1.928	7,386,873	3.954	104,974,733	2.467	119,123,733	2.669	150,189,294	3.48
microsoft/Phi-4-mini-instruct	59,447,036	2.596	45,423,925	1.256	335,188,687	1.691	5,995,822	3.209	91,824,464	2.158	102,472,523	2.296	119,587,038	2.771
CohereLabs/aya-expanse-8b	50,973,632	2.226	47,364,187	1.309	353,221,932	1.782	6,614,719	3.541	93,089,697	2.187	112,612,668	2.523	141,262,943	3.273
google/gemma-3-12b-it	57,388,402	2.506	47,285,432	1.307	354,241,840	1.788	6,240,944	3.341	95,520,817	2.245	103,950,626	2.329	131,398,147	3.045
nvidia/NVIDIA-Nemotron-3-Nano-30B	62,087,149	2.711	47,630,139	1.317	365,218,644	1.843	6,623,516	3.545	107,233,038	2.520	108,691,963	2.436	135,489,439	3.140
karpotron-tokenizer (Ours)	46,456,626	2.029🤩	47,650,584	1.317	365,285,307	1.843	7,519,362	4.025	132,519,787	3.114	131,626,936	2.949	158,657,784	3.677
Comments	~1.34x improvement over Nemotron for Ukrainian: faster inference/training and larger effective context window		English unchanged		EU languages unchanged		QIRIM slightly worse		Russian drops (UA-centric)		Bulgarian drops slightly		Belarusian drops

Usage Example

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "transhumanist-already-exists/karpotron-tokenizer"
)

toks = tokenizer("Всі красиві зберігають оптимізм", add_special_tokens=False)
print(toks.input_ids)  # Only 5 tokens 💪🏻

Model Contents

tokenizer.json - Byte-level tokenizer spec (131,072 tokens, 252,034 merges)
tokenizer_config.json - Configuration metadata
special_tokens_map.json - Special token mappings (identical to Nemotron)
merge_info.json - Information about removed and added tokens

Embedding Initialization

For newly added tokens in Nemotron models, you can:

Use tools like Focus or Zett
Initialize embeddings randomly with warm-up schedule training
Unchanged tokens (103,007) retain original IDs and can reuse existing embeddings

Citation

@misc{zaduha2026post9793,
  author       = "{Bohdan Didenko}",
  title        = "{Post \#9793 on Telegram Channel Zaduha}",
  howpublished = "\url{https://t.me/zaduha/9793}",
  month        = january,
  year         = {2026},
  note         = "[Online; accessed 31 January 2026]"
}

Base Models

Base: nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16
Donor: lapa-llm/tokenizer

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for transhumanist-already-exists/karpotron-tokenizer

Base model

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16

Finetuned

(43)

this model

Collection including transhumanist-already-exists/karpotron-tokenizer

Tokenizers

Collection

6 items • Updated Jan 30 • 2