CUTE

Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE

CUTE is a code-aware tokenizer built on a single architectural idea: substitute high-savings multi-byte patterns to atomic Unicode codepoints before byte-level BPE sees them. On 1,500 held-out Python files from The Stack, CUTE produces fewer tokens per file than nine widely-used baselines — including OpenAI's cl100k_base and o200k_base, LLaMA-3's SentencePiece BPE, and three SentencePiece Unigram variants — and is the only tokenizer in this comparison that re-encodes every file to byte-identical source.

Compression (1,500 held-out Python files, The Stack)

Tokenizer	mean tok	bytes/tok	vs CUTE	roundtrip
CUTE	1,767	4.42	—	1500 / 1500
OpenAI cl100k_base	1,874	4.17	+6.0%	1500 / 1500
OpenAI o200k_base	1,886	4.14	+6.7%	1500 / 1500
LLaMA-3 (SentencePiece BPE)	1,872	4.17	+5.9%	686 / 1500
StarCoder2	2,210	3.53	+25.1%	685 / 1500
XLM-RoBERTa (SentencePiece Unigram)	2,438	3.20	+38.0%	0 / 1500
CodeLlama	2,573	3.03	+45.6%	1493 / 1500
T5 (SentencePiece Unigram)	2,706	2.89	+53.2%	0 / 1500
GPT-2	3,581	2.18	+102.7%	1500 / 1500

vs CUTE is the extra cost the baseline pays per file. LLM API spend is linear in this number.

Latency (p50 across the full 1,500-file Stack-Python holdout)

Tokenizer	encode p50	decode p50
OpenAI cl100k_base	1,338 µs	120 µs
OpenAI o200k_base	1,760 µs	126 µs
CUTE	1,822 µs	263 µs
T5 (SentencePiece Unigram)	3,121 µs	479 µs
CodeLlama	3,162 µs	1,885 µs
XLM-RoBERTa (SentencePiece Unigram)	3,272 µs	440 µs
LLaMA-3 (SentencePiece BPE)	3,753 µs	792 µs
StarCoder2	4,316 µs	775 µs
GPT-2	4,467 µs	911 µs

CUTE is third-fastest encode and third-fastest decode in the field, behind only OpenAI's cl100k_base and o200k_base. v1.0.2's cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench of the core encoder). On the full 1,500-file holdout median, CUTE beats every open-source code tokenizer (LLaMA-3, StarCoder2, CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency, while preserving the only byte-perfect 1500 / 1500 roundtrip in the comparison.

How it works

A frequency-weighted, savings-ranked selection pass mines high-value multi-byte patterns (identifiers, common slices like (self, =None, :\n) from a code corpus.
Selected patterns are mapped one-to-one to supplementary-plane Private-Use-Area (PUA) codepoints (U+F0000+). The BMP-PUA range is deliberately skipped to avoid colliding with literal PUA characters that appear in real source code.
A byte-level BPE trainer runs on the PUA-pre-substituted stream, so semantic anchors are visible to the merge algorithm and can compose freely with whitespace and punctuation (e.g. Ġ + ⟦def⟧).
A second savings pass adds the top-6,000 high-frequency compound patterns as atomic AddedTokens.
At encode time, an Aho-Corasick (leftmost-longest) Rust pass substitutes PUA codepoints; a purpose-built Rust BPE encoder (cute-bpe, modeled on tiktoken's linear-scan-min-rank merge loop) then performs the byte-level BPE pass.
At decode time, the inverse PUA map restores the original source text — byte-for-byte identical.

Use it

Via the standalone package

pip install cute-tokenizer

from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"

For tight inference loops where BatchEncoding machinery is overhead, use fast_encode / fast_decode — these go straight to the Rust cute-bpe encoder/decoder:

ids = tok.fast_encode("def hello(): return 42")
text = tok.fast_decode(ids)

Via Hugging Face AutoTokenizer

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "HusseinEid/cute-tokenizer",
    trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)

trust_remote_code=True is required because the wrapper class (CUTETokenizerFast) runs PUA pre-substitution before delegating to the byte-level BPE encoder.

Properties

Byte-equal roundtrip on 1,500 / 1,500 Python holdout files.
Deterministic tokenizer.json within a fixed (OS, python, tokenizers, _accel, corpus_hash, seed) host triple. Cross-platform byte-identity of trained artifacts is not part of the contract.
Atomicity invariants asserted on every save: model is BPE, decoder is ByteLevel, pre-tokenizer is ByteLevel, every mapping PUA codepoint has a vocab id.
No BMP-PUA collisions — mappings live in the supplementary planes only, so literal BMP-PUA characters in real source code (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.

Citation

@software{cute_tokenizer_2026,
  author  = {Eid, Hussein},
  title   = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
  year    = {2026},
  url     = {https://github.com/HusseinEid101/CUTE},
  version = {1.0.2}
}

License

MIT. Source, training scripts, benchmark suite, and full reproduction instructions live at https://github.com/HusseinEid101/CUTE.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support