CUTE
Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE
CUTE is a code-aware tokenizer built on a single architectural idea:
substitute high-savings multi-byte patterns to atomic Unicode codepoints
before byte-level BPE sees them. On 1,500 held-out Python files from
The Stack, CUTE produces fewer tokens per file than nine widely-used
baselines — including OpenAI's cl100k_base and o200k_base, LLaMA-3's
SentencePiece BPE, and three SentencePiece Unigram variants — and is
the only tokenizer in this comparison that re-encodes every file to
byte-identical source.
Compression (1,500 held-out Python files, The Stack)
| Tokenizer | mean tok | bytes/tok | vs CUTE | roundtrip |
|---|---|---|---|---|
| CUTE | 1,767 | 4.42 | — | 1500 / 1500 |
| OpenAI cl100k_base | 1,874 | 4.17 | +6.0% | 1500 / 1500 |
| OpenAI o200k_base | 1,886 | 4.14 | +6.7% | 1500 / 1500 |
| LLaMA-3 (SentencePiece BPE) | 1,872 | 4.17 | +5.9% | 686 / 1500 |
| StarCoder2 | 2,210 | 3.53 | +25.1% | 685 / 1500 |
| XLM-RoBERTa (SentencePiece Unigram) | 2,438 | 3.20 | +38.0% | 0 / 1500 |
| CodeLlama | 2,573 | 3.03 | +45.6% | 1493 / 1500 |
| T5 (SentencePiece Unigram) | 2,706 | 2.89 | +53.2% | 0 / 1500 |
| GPT-2 | 3,581 | 2.18 | +102.7% | 1500 / 1500 |
vs CUTE is the extra cost the baseline pays per file. LLM API spend
is linear in this number.
Latency (p50 across the full 1,500-file Stack-Python holdout)
| Tokenizer | encode p50 | decode p50 |
|---|---|---|
| OpenAI cl100k_base | 1,338 µs | 120 µs |
| OpenAI o200k_base | 1,760 µs | 126 µs |
| CUTE | 1,822 µs | 263 µs |
| T5 (SentencePiece Unigram) | 3,121 µs | 479 µs |
| CodeLlama | 3,162 µs | 1,885 µs |
| XLM-RoBERTa (SentencePiece Unigram) | 3,272 µs | 440 µs |
| LLaMA-3 (SentencePiece BPE) | 3,753 µs | 792 µs |
| StarCoder2 | 4,316 µs | 775 µs |
| GPT-2 | 4,467 µs | 911 µs |
CUTE is third-fastest encode and third-fastest decode in the
field, behind only OpenAI's cl100k_base and o200k_base. v1.0.2's
cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
of the core encoder). On the full 1,500-file holdout median, CUTE
beats every open-source code tokenizer (LLaMA-3, StarCoder2,
CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
while preserving the only byte-perfect 1500 / 1500 roundtrip in
the comparison.
How it works
- A frequency-weighted, savings-ranked selection pass mines
high-value multi-byte patterns (identifiers, common slices like
(self,=None,:\n) from a code corpus. - Selected patterns are mapped one-to-one to supplementary-plane
Private-Use-Area (PUA) codepoints (
U+F0000+). The BMP-PUA range is deliberately skipped to avoid colliding with literal PUA characters that appear in real source code. - A byte-level BPE trainer runs on the PUA-pre-substituted stream,
so semantic anchors are visible to the merge algorithm and can
compose freely with whitespace and punctuation (e.g.
Ġ + ⟦def⟧). - A second savings pass adds the top-6,000 high-frequency compound
patterns as atomic
AddedTokens. - At encode time, an Aho-Corasick (leftmost-longest) Rust pass
substitutes PUA codepoints; a purpose-built Rust BPE encoder
(
cute-bpe, modeled on tiktoken's linear-scan-min-rank merge loop) then performs the byte-level BPE pass. - At decode time, the inverse PUA map restores the original source text — byte-for-byte identical.
Use it
Via the standalone package
pip install cute-tokenizer
from cute_tokenizer import load_default_tokenizer
tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"
For tight inference loops where BatchEncoding machinery is overhead,
use fast_encode / fast_decode — these go straight to the Rust
cute-bpe encoder/decoder:
ids = tok.fast_encode("def hello(): return 42")
text = tok.fast_decode(ids)
Via Hugging Face AutoTokenizer
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"HusseinEid/cute-tokenizer",
trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
trust_remote_code=True is required because the wrapper class
(CUTETokenizerFast) runs PUA pre-substitution before delegating to
the byte-level BPE encoder.
Properties
- Byte-equal roundtrip on 1,500 / 1,500 Python holdout files.
- Deterministic
tokenizer.jsonwithin a fixed(OS, python, tokenizers, _accel, corpus_hash, seed)host triple. Cross-platform byte-identity of trained artifacts is not part of the contract. - Atomicity invariants asserted on every save: model is
BPE, decoder isByteLevel, pre-tokenizer isByteLevel, every mapping PUA codepoint has a vocab id. - No BMP-PUA collisions — mappings live in the supplementary planes only, so literal BMP-PUA characters in real source code (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.
Citation
@software{cute_tokenizer_2026,
author = {Eid, Hussein},
title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
year = {2026},
url = {https://github.com/HusseinEid101/CUTE},
version = {1.0.2}
}
License
MIT. Source, training scripts, benchmark suite, and full reproduction instructions live at https://github.com/HusseinEid101/CUTE.