--- license: mit library_name: tokenizers tags: - code - tokenizer - byte-level-bpe - private-use-area - lossless-roundtrip - the-stack language: - code --- # CUTE **Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE** CUTE is a code-aware tokenizer built on a single architectural idea: substitute high-savings multi-byte patterns to atomic Unicode codepoints *before* byte-level BPE sees them. On 1,500 held-out Python files from The Stack, CUTE produces fewer tokens per file than nine widely-used baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's SentencePiece BPE, and three SentencePiece Unigram variants — and is the only tokenizer in this comparison that re-encodes every file to byte-identical source. ## Compression (1,500 held-out Python files, The Stack) | Tokenizer | mean tok | bytes/tok | vs CUTE | roundtrip | |--------------------------------------|---------:|----------:|--------:|-------------| | **CUTE** | 1,767 | 4.42 | — | 1500 / 1500 | | OpenAI cl100k_base | 1,874 | 4.17 | +6.0% | 1500 / 1500 | | OpenAI o200k_base | 1,886 | 4.14 | +6.7% | 1500 / 1500 | | LLaMA-3 (SentencePiece BPE) | 1,872 | 4.17 | +5.9% | 686 / 1500 | | StarCoder2 | 2,210 | 3.53 | +25.1% | 685 / 1500 | | XLM-RoBERTa (SentencePiece Unigram) | 2,438 | 3.20 | +38.0% | 0 / 1500 | | CodeLlama | 2,573 | 3.03 | +45.6% | 1493 / 1500 | | T5 (SentencePiece Unigram) | 2,706 | 2.89 | +53.2% | 0 / 1500 | | GPT-2 | 3,581 | 2.18 | +102.7% | 1500 / 1500 | `vs CUTE` is the extra cost the baseline pays per file. LLM API spend is linear in this number. ## Latency (p50 across the full 1,500-file Stack-Python holdout) | Tokenizer | encode p50 | decode p50 | |--------------------------------------|-----------:|-----------:| | OpenAI cl100k_base | 1,338 µs | 120 µs | | OpenAI o200k_base | 1,760 µs | 126 µs | | **CUTE** | **1,822 µs** | **263 µs** | | T5 (SentencePiece Unigram) | 3,121 µs | 479 µs | | CodeLlama | 3,162 µs | 1,885 µs | | XLM-RoBERTa (SentencePiece Unigram) | 3,272 µs | 440 µs | | LLaMA-3 (SentencePiece BPE) | 3,753 µs | 792 µs | | StarCoder2 | 4,316 µs | 775 µs | | GPT-2 | 4,467 µs | 911 µs | CUTE is **third-fastest encode** and **third-fastest decode** in the field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench of the core encoder). On the full 1,500-file holdout median, CUTE beats every open-source code tokenizer (LLaMA-3, StarCoder2, CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency, while preserving the **only** byte-perfect 1500 / 1500 roundtrip in the comparison. ## How it works 1. A frequency-weighted, savings-ranked selection pass mines high-value multi-byte patterns (identifiers, common slices like `(self`, `=None`, `:\n`) from a code corpus. 2. Selected patterns are mapped one-to-one to **supplementary-plane Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range is deliberately skipped to avoid colliding with literal PUA characters that appear in real source code. 3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**, so semantic anchors are visible to the merge algorithm and can compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`). 4. A second savings pass adds the top-6,000 high-frequency compound patterns as atomic `AddedToken`s. 5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass substitutes PUA codepoints; a purpose-built Rust BPE encoder (`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop) then performs the byte-level BPE pass. 6. At decode time, the inverse PUA map restores the original source text — byte-for-byte identical. ## Use it ### Via the standalone package ```bash pip install cute-tokenizer ``` ```python from cute_tokenizer import load_default_tokenizer tok = load_default_tokenizer() ids = tok("def hello(): return 42", add_special_tokens=False).input_ids text = tok.decode(ids, skip_special_tokens=True) assert text == "def hello(): return 42" ``` For tight inference loops where `BatchEncoding` machinery is overhead, use `fast_encode` / `fast_decode` — these go straight to the Rust `cute-bpe` encoder/decoder: ```python ids = tok.fast_encode("def hello(): return 42") text = tok.fast_decode(ids) ``` ### Via Hugging Face AutoTokenizer ```python from transformers import AutoTokenizer tok = AutoTokenizer.from_pretrained( "HusseinEid/cute-tokenizer", trust_remote_code=True, ) ids = tok("class Foo: pass", add_special_tokens=False).input_ids text = tok.decode(ids, skip_special_tokens=True) ``` `trust_remote_code=True` is required because the wrapper class (`CUTETokenizerFast`) runs PUA pre-substitution before delegating to the byte-level BPE encoder. ## Properties - **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files. - **Deterministic `tokenizer.json`** within a fixed `(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple. Cross-platform byte-identity of trained artifacts is not part of the contract. - **Atomicity invariants** asserted on every save: model is `BPE`, decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping PUA codepoint has a vocab id. - **No BMP-PUA collisions** — mappings live in the supplementary planes only, so literal BMP-PUA characters in real source code (TypeScript Unicode tables, CJK fonts) roundtrip unchanged. ## Citation ```bibtex @software{cute_tokenizer_2026, author = {Eid, Hussein}, title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE}, year = {2026}, url = {https://github.com/HusseinEid101/CUTE}, version = {1.0.2} } ``` ## License MIT. Source, training scripts, benchmark suite, and full reproduction instructions live at .