| ---
|
| license: mit
|
| library_name: tokenizers
|
| tags:
|
| - code
|
| - tokenizer
|
| - byte-level-bpe
|
| - private-use-area
|
| - lossless-roundtrip
|
| - the-stack
|
| language:
|
| - code
|
| ---
|
|
|
| # CUTE
|
|
|
| **Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE**
|
|
|
| CUTE is a code-aware tokenizer built on a single architectural idea:
|
| substitute high-savings multi-byte patterns to atomic Unicode codepoints
|
| *before* byte-level BPE sees them. On 1,500 held-out Python files from
|
| The Stack, CUTE produces fewer tokens per file than nine widely-used
|
| baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
|
| SentencePiece BPE, and three SentencePiece Unigram variants — and is
|
| the only tokenizer in this comparison that re-encodes every file to
|
| byte-identical source.
|
|
|
| ## Compression (1,500 held-out Python files, The Stack)
|
|
|
| | Tokenizer | mean tok | bytes/tok | vs CUTE | roundtrip |
|
| |--------------------------------------|---------:|----------:|--------:|-------------|
|
| | **CUTE** | 1,767 | 4.42 | — | 1500 / 1500 |
|
| | OpenAI cl100k_base | 1,874 | 4.17 | +6.0% | 1500 / 1500 |
|
| | OpenAI o200k_base | 1,886 | 4.14 | +6.7% | 1500 / 1500 |
|
| | LLaMA-3 (SentencePiece BPE) | 1,872 | 4.17 | +5.9% | 686 / 1500 |
|
| | StarCoder2 | 2,210 | 3.53 | +25.1% | 685 / 1500 |
|
| | XLM-RoBERTa (SentencePiece Unigram) | 2,438 | 3.20 | +38.0% | 0 / 1500 |
|
| | CodeLlama | 2,573 | 3.03 | +45.6% | 1493 / 1500 |
|
| | T5 (SentencePiece Unigram) | 2,706 | 2.89 | +53.2% | 0 / 1500 |
|
| | GPT-2 | 3,581 | 2.18 | +102.7% | 1500 / 1500 |
|
|
|
| `vs CUTE` is the extra cost the baseline pays per file. LLM API spend
|
| is linear in this number.
|
|
|
| ## Latency (p50 across the full 1,500-file Stack-Python holdout)
|
|
|
| | Tokenizer | encode p50 | decode p50 |
|
| |--------------------------------------|-----------:|-----------:|
|
| | OpenAI cl100k_base | 1,338 µs | 120 µs |
|
| | OpenAI o200k_base | 1,760 µs | 126 µs |
|
| | **CUTE** | **1,822 µs** | **263 µs** |
|
| | T5 (SentencePiece Unigram) | 3,121 µs | 479 µs |
|
| | CodeLlama | 3,162 µs | 1,885 µs |
|
| | XLM-RoBERTa (SentencePiece Unigram) | 3,272 µs | 440 µs |
|
| | LLaMA-3 (SentencePiece BPE) | 3,753 µs | 792 µs |
|
| | StarCoder2 | 4,316 µs | 775 µs |
|
| | GPT-2 | 4,467 µs | 911 µs |
|
|
|
| CUTE is **third-fastest encode** and **third-fastest decode** in the
|
| field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
|
| cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
|
| sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
|
| of the core encoder). On the full 1,500-file holdout median, CUTE
|
| beats every open-source code tokenizer (LLaMA-3, StarCoder2,
|
| CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
|
| while preserving the **only** byte-perfect 1500 / 1500 roundtrip in
|
| the comparison.
|
|
|
| ## How it works
|
|
|
| 1. A frequency-weighted, savings-ranked selection pass mines
|
| high-value multi-byte patterns (identifiers, common slices like
|
| `(self`, `=None`, `:\n`) from a code corpus.
|
| 2. Selected patterns are mapped one-to-one to **supplementary-plane
|
| Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
|
| is deliberately skipped to avoid colliding with literal PUA
|
| characters that appear in real source code.
|
| 3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**,
|
| so semantic anchors are visible to the merge algorithm and can
|
| compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
|
| 4. A second savings pass adds the top-6,000 high-frequency compound
|
| patterns as atomic `AddedToken`s.
|
| 5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
|
| substitutes PUA codepoints; a purpose-built Rust BPE encoder
|
| (`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
|
| then performs the byte-level BPE pass.
|
| 6. At decode time, the inverse PUA map restores the original source
|
| text — byte-for-byte identical.
|
|
|
| ## Use it
|
|
|
| ### Via the standalone package
|
|
|
| ```bash
|
| pip install cute-tokenizer
|
| ```
|
|
|
| ```python
|
| from cute_tokenizer import load_default_tokenizer
|
|
|
| tok = load_default_tokenizer()
|
| ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
|
| text = tok.decode(ids, skip_special_tokens=True)
|
| assert text == "def hello(): return 42"
|
| ```
|
|
|
| For tight inference loops where `BatchEncoding` machinery is overhead,
|
| use `fast_encode` / `fast_decode` — these go straight to the Rust
|
| `cute-bpe` encoder/decoder:
|
|
|
| ```python
|
| ids = tok.fast_encode("def hello(): return 42")
|
| text = tok.fast_decode(ids)
|
| ```
|
|
|
| ### Via Hugging Face AutoTokenizer
|
|
|
| ```python
|
| from transformers import AutoTokenizer
|
|
|
| tok = AutoTokenizer.from_pretrained(
|
| "HusseinEid/cute-tokenizer",
|
| trust_remote_code=True,
|
| )
|
| ids = tok("class Foo: pass", add_special_tokens=False).input_ids
|
| text = tok.decode(ids, skip_special_tokens=True)
|
| ```
|
|
|
| `trust_remote_code=True` is required because the wrapper class
|
| (`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
|
| the byte-level BPE encoder.
|
|
|
| ## Properties
|
|
|
| - **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files.
|
| - **Deterministic `tokenizer.json`** within a fixed
|
| `(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
|
| Cross-platform byte-identity of trained artifacts is not part of
|
| the contract.
|
| - **Atomicity invariants** asserted on every save: model is `BPE`,
|
| decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
|
| PUA codepoint has a vocab id.
|
| - **No BMP-PUA collisions** — mappings live in the supplementary
|
| planes only, so literal BMP-PUA characters in real source code
|
| (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.
|
|
|
| ## Citation
|
|
|
| ```bibtex
|
| @software{cute_tokenizer_2026,
|
| author = {Eid, Hussein},
|
| title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
|
| year = {2026},
|
| url = {https://github.com/HusseinEid101/CUTE},
|
| version = {1.0.2}
|
| }
|
| ```
|
|
|
| ## License
|
|
|
| MIT. Source, training scripts, benchmark suite, and full reproduction
|
| instructions live at <https://github.com/HusseinEid101/CUTE>.
|
|
|