cute-tokenizer / README.md
HusseinEid's picture
Super-squash branch 'main' using huggingface_hub
68a4c53
---
license: mit
library_name: tokenizers
tags:
- code
- tokenizer
- byte-level-bpe
- private-use-area
- lossless-roundtrip
- the-stack
language:
- code
---
# CUTE
**Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE**
CUTE is a code-aware tokenizer built on a single architectural idea:
substitute high-savings multi-byte patterns to atomic Unicode codepoints
*before* byte-level BPE sees them. On 1,500 held-out Python files from
The Stack, CUTE produces fewer tokens per file than nine widely-used
baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
SentencePiece BPE, and three SentencePiece Unigram variants — and is
the only tokenizer in this comparison that re-encodes every file to
byte-identical source.
## Compression (1,500 held-out Python files, The Stack)
| Tokenizer | mean tok | bytes/tok | vs CUTE | roundtrip |
|--------------------------------------|---------:|----------:|--------:|-------------|
| **CUTE** | 1,767 | 4.42 | — | 1500 / 1500 |
| OpenAI cl100k_base | 1,874 | 4.17 | +6.0% | 1500 / 1500 |
| OpenAI o200k_base | 1,886 | 4.14 | +6.7% | 1500 / 1500 |
| LLaMA-3 (SentencePiece BPE) | 1,872 | 4.17 | +5.9% | 686 / 1500 |
| StarCoder2 | 2,210 | 3.53 | +25.1% | 685 / 1500 |
| XLM-RoBERTa (SentencePiece Unigram) | 2,438 | 3.20 | +38.0% | 0 / 1500 |
| CodeLlama | 2,573 | 3.03 | +45.6% | 1493 / 1500 |
| T5 (SentencePiece Unigram) | 2,706 | 2.89 | +53.2% | 0 / 1500 |
| GPT-2 | 3,581 | 2.18 | +102.7% | 1500 / 1500 |
`vs CUTE` is the extra cost the baseline pays per file. LLM API spend
is linear in this number.
## Latency (p50 across the full 1,500-file Stack-Python holdout)
| Tokenizer | encode p50 | decode p50 |
|--------------------------------------|-----------:|-----------:|
| OpenAI cl100k_base | 1,338 µs | 120 µs |
| OpenAI o200k_base | 1,760 µs | 126 µs |
| **CUTE** | **1,822 µs** | **263 µs** |
| T5 (SentencePiece Unigram) | 3,121 µs | 479 µs |
| CodeLlama | 3,162 µs | 1,885 µs |
| XLM-RoBERTa (SentencePiece Unigram) | 3,272 µs | 440 µs |
| LLaMA-3 (SentencePiece BPE) | 3,753 µs | 792 µs |
| StarCoder2 | 4,316 µs | 775 µs |
| GPT-2 | 4,467 µs | 911 µs |
CUTE is **third-fastest encode** and **third-fastest decode** in the
field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
of the core encoder). On the full 1,500-file holdout median, CUTE
beats every open-source code tokenizer (LLaMA-3, StarCoder2,
CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
while preserving the **only** byte-perfect 1500 / 1500 roundtrip in
the comparison.
## How it works
1. A frequency-weighted, savings-ranked selection pass mines
high-value multi-byte patterns (identifiers, common slices like
`(self`, `=None`, `:\n`) from a code corpus.
2. Selected patterns are mapped one-to-one to **supplementary-plane
Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
is deliberately skipped to avoid colliding with literal PUA
characters that appear in real source code.
3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**,
so semantic anchors are visible to the merge algorithm and can
compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
4. A second savings pass adds the top-6,000 high-frequency compound
patterns as atomic `AddedToken`s.
5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
substitutes PUA codepoints; a purpose-built Rust BPE encoder
(`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
then performs the byte-level BPE pass.
6. At decode time, the inverse PUA map restores the original source
text — byte-for-byte identical.
## Use it
### Via the standalone package
```bash
pip install cute-tokenizer
```
```python
from cute_tokenizer import load_default_tokenizer
tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"
```
For tight inference loops where `BatchEncoding` machinery is overhead,
use `fast_encode` / `fast_decode` — these go straight to the Rust
`cute-bpe` encoder/decoder:
```python
ids = tok.fast_encode("def hello(): return 42")
text = tok.fast_decode(ids)
```
### Via Hugging Face AutoTokenizer
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"HusseinEid/cute-tokenizer",
trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
```
`trust_remote_code=True` is required because the wrapper class
(`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
the byte-level BPE encoder.
## Properties
- **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files.
- **Deterministic `tokenizer.json`** within a fixed
`(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
Cross-platform byte-identity of trained artifacts is not part of
the contract.
- **Atomicity invariants** asserted on every save: model is `BPE`,
decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
PUA codepoint has a vocab id.
- **No BMP-PUA collisions** — mappings live in the supplementary
planes only, so literal BMP-PUA characters in real source code
(TypeScript Unicode tables, CJK fonts) roundtrip unchanged.
## Citation
```bibtex
@software{cute_tokenizer_2026,
author = {Eid, Hussein},
title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
year = {2026},
url = {https://github.com/HusseinEid101/CUTE},
version = {1.0.2}
}
```
## License
MIT. Source, training scripts, benchmark suite, and full reproduction
instructions live at <https://github.com/HusseinEid101/CUTE>.