CUTE

Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE

CUTE is a code-aware tokenizer built on a single architectural idea: substitute high-savings multi-byte patterns to atomic Unicode codepoints before byte-level BPE sees them. On 1,500 held-out Python files from The Stack, CUTE produces fewer tokens per file than nine widely-used baselines — including OpenAI's cl100k_base and o200k_base, LLaMA-3's SentencePiece BPE, and three SentencePiece Unigram variants — and is the only tokenizer in this comparison that re-encodes every file to byte-identical source.

Compression (1,500 held-out Python files, The Stack)

Tokenizer mean tok bytes/tok vs CUTE roundtrip
CUTE 1,767 4.42 1500 / 1500
OpenAI cl100k_base 1,874 4.17 +6.0% 1500 / 1500
OpenAI o200k_base 1,886 4.14 +6.7% 1500 / 1500
LLaMA-3 (SentencePiece BPE) 1,872 4.17 +5.9% 686 / 1500
StarCoder2 2,210 3.53 +25.1% 685 / 1500
XLM-RoBERTa (SentencePiece Unigram) 2,438 3.20 +38.0% 0 / 1500
CodeLlama 2,573 3.03 +45.6% 1493 / 1500
T5 (SentencePiece Unigram) 2,706 2.89 +53.2% 0 / 1500
GPT-2 3,581 2.18 +102.7% 1500 / 1500

vs CUTE is the extra cost the baseline pays per file. LLM API spend is linear in this number.

Latency (p50 across the full 1,500-file Stack-Python holdout)

Tokenizer encode p50 decode p50
OpenAI cl100k_base 1,338 µs 120 µs
OpenAI o200k_base 1,760 µs 126 µs
CUTE 1,822 µs 263 µs
T5 (SentencePiece Unigram) 3,121 µs 479 µs
CodeLlama 3,162 µs 1,885 µs
XLM-RoBERTa (SentencePiece Unigram) 3,272 µs 440 µs
LLaMA-3 (SentencePiece BPE) 3,753 µs 792 µs
StarCoder2 4,316 µs 775 µs
GPT-2 4,467 µs 911 µs

CUTE is third-fastest encode and third-fastest decode in the field, behind only OpenAI's cl100k_base and o200k_base. v1.0.2's cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench of the core encoder). On the full 1,500-file holdout median, CUTE beats every open-source code tokenizer (LLaMA-3, StarCoder2, CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency, while preserving the only byte-perfect 1500 / 1500 roundtrip in the comparison.

How it works

  1. A frequency-weighted, savings-ranked selection pass mines high-value multi-byte patterns (identifiers, common slices like (self, =None, :\n) from a code corpus.
  2. Selected patterns are mapped one-to-one to supplementary-plane Private-Use-Area (PUA) codepoints (U+F0000+). The BMP-PUA range is deliberately skipped to avoid colliding with literal PUA characters that appear in real source code.
  3. A byte-level BPE trainer runs on the PUA-pre-substituted stream, so semantic anchors are visible to the merge algorithm and can compose freely with whitespace and punctuation (e.g. Ġ + ⟦def⟧).
  4. A second savings pass adds the top-6,000 high-frequency compound patterns as atomic AddedTokens.
  5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass substitutes PUA codepoints; a purpose-built Rust BPE encoder (cute-bpe, modeled on tiktoken's linear-scan-min-rank merge loop) then performs the byte-level BPE pass.
  6. At decode time, the inverse PUA map restores the original source text — byte-for-byte identical.

Use it

Via the standalone package

pip install cute-tokenizer
from cute_tokenizer import load_default_tokenizer

tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"

For tight inference loops where BatchEncoding machinery is overhead, use fast_encode / fast_decode — these go straight to the Rust cute-bpe encoder/decoder:

ids = tok.fast_encode("def hello(): return 42")
text = tok.fast_decode(ids)

Via Hugging Face AutoTokenizer

from transformers import AutoTokenizer

tok = AutoTokenizer.from_pretrained(
    "HusseinEid/cute-tokenizer",
    trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)

trust_remote_code=True is required because the wrapper class (CUTETokenizerFast) runs PUA pre-substitution before delegating to the byte-level BPE encoder.

Properties

  • Byte-equal roundtrip on 1,500 / 1,500 Python holdout files.
  • Deterministic tokenizer.json within a fixed (OS, python, tokenizers, _accel, corpus_hash, seed) host triple. Cross-platform byte-identity of trained artifacts is not part of the contract.
  • Atomicity invariants asserted on every save: model is BPE, decoder is ByteLevel, pre-tokenizer is ByteLevel, every mapping PUA codepoint has a vocab id.
  • No BMP-PUA collisions — mappings live in the supplementary planes only, so literal BMP-PUA characters in real source code (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.

Citation

@software{cute_tokenizer_2026,
  author  = {Eid, Hussein},
  title   = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
  year    = {2026},
  url     = {https://github.com/HusseinEid101/CUTE},
  version = {1.0.2}
}

License

MIT. Source, training scripts, benchmark suite, and full reproduction instructions live at https://github.com/HusseinEid101/CUTE.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support