File size: 6,670 Bytes
68a4c53 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 | ---
license: mit
library_name: tokenizers
tags:
- code
- tokenizer
- byte-level-bpe
- private-use-area
- lossless-roundtrip
- the-stack
language:
- code
---
# CUTE
**Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE**
CUTE is a code-aware tokenizer built on a single architectural idea:
substitute high-savings multi-byte patterns to atomic Unicode codepoints
*before* byte-level BPE sees them. On 1,500 held-out Python files from
The Stack, CUTE produces fewer tokens per file than nine widely-used
baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
SentencePiece BPE, and three SentencePiece Unigram variants — and is
the only tokenizer in this comparison that re-encodes every file to
byte-identical source.
## Compression (1,500 held-out Python files, The Stack)
| Tokenizer | mean tok | bytes/tok | vs CUTE | roundtrip |
|--------------------------------------|---------:|----------:|--------:|-------------|
| **CUTE** | 1,767 | 4.42 | — | 1500 / 1500 |
| OpenAI cl100k_base | 1,874 | 4.17 | +6.0% | 1500 / 1500 |
| OpenAI o200k_base | 1,886 | 4.14 | +6.7% | 1500 / 1500 |
| LLaMA-3 (SentencePiece BPE) | 1,872 | 4.17 | +5.9% | 686 / 1500 |
| StarCoder2 | 2,210 | 3.53 | +25.1% | 685 / 1500 |
| XLM-RoBERTa (SentencePiece Unigram) | 2,438 | 3.20 | +38.0% | 0 / 1500 |
| CodeLlama | 2,573 | 3.03 | +45.6% | 1493 / 1500 |
| T5 (SentencePiece Unigram) | 2,706 | 2.89 | +53.2% | 0 / 1500 |
| GPT-2 | 3,581 | 2.18 | +102.7% | 1500 / 1500 |
`vs CUTE` is the extra cost the baseline pays per file. LLM API spend
is linear in this number.
## Latency (p50 across the full 1,500-file Stack-Python holdout)
| Tokenizer | encode p50 | decode p50 |
|--------------------------------------|-----------:|-----------:|
| OpenAI cl100k_base | 1,338 µs | 120 µs |
| OpenAI o200k_base | 1,760 µs | 126 µs |
| **CUTE** | **1,822 µs** | **263 µs** |
| T5 (SentencePiece Unigram) | 3,121 µs | 479 µs |
| CodeLlama | 3,162 µs | 1,885 µs |
| XLM-RoBERTa (SentencePiece Unigram) | 3,272 µs | 440 µs |
| LLaMA-3 (SentencePiece BPE) | 3,753 µs | 792 µs |
| StarCoder2 | 4,316 µs | 775 µs |
| GPT-2 | 4,467 µs | 911 µs |
CUTE is **third-fastest encode** and **third-fastest decode** in the
field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
of the core encoder). On the full 1,500-file holdout median, CUTE
beats every open-source code tokenizer (LLaMA-3, StarCoder2,
CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
while preserving the **only** byte-perfect 1500 / 1500 roundtrip in
the comparison.
## How it works
1. A frequency-weighted, savings-ranked selection pass mines
high-value multi-byte patterns (identifiers, common slices like
`(self`, `=None`, `:\n`) from a code corpus.
2. Selected patterns are mapped one-to-one to **supplementary-plane
Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
is deliberately skipped to avoid colliding with literal PUA
characters that appear in real source code.
3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**,
so semantic anchors are visible to the merge algorithm and can
compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
4. A second savings pass adds the top-6,000 high-frequency compound
patterns as atomic `AddedToken`s.
5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
substitutes PUA codepoints; a purpose-built Rust BPE encoder
(`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
then performs the byte-level BPE pass.
6. At decode time, the inverse PUA map restores the original source
text — byte-for-byte identical.
## Use it
### Via the standalone package
```bash
pip install cute-tokenizer
```
```python
from cute_tokenizer import load_default_tokenizer
tok = load_default_tokenizer()
ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
assert text == "def hello(): return 42"
```
For tight inference loops where `BatchEncoding` machinery is overhead,
use `fast_encode` / `fast_decode` — these go straight to the Rust
`cute-bpe` encoder/decoder:
```python
ids = tok.fast_encode("def hello(): return 42")
text = tok.fast_decode(ids)
```
### Via Hugging Face AutoTokenizer
```python
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained(
"HusseinEid/cute-tokenizer",
trust_remote_code=True,
)
ids = tok("class Foo: pass", add_special_tokens=False).input_ids
text = tok.decode(ids, skip_special_tokens=True)
```
`trust_remote_code=True` is required because the wrapper class
(`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
the byte-level BPE encoder.
## Properties
- **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files.
- **Deterministic `tokenizer.json`** within a fixed
`(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
Cross-platform byte-identity of trained artifacts is not part of
the contract.
- **Atomicity invariants** asserted on every save: model is `BPE`,
decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
PUA codepoint has a vocab id.
- **No BMP-PUA collisions** — mappings live in the supplementary
planes only, so literal BMP-PUA characters in real source code
(TypeScript Unicode tables, CJK fonts) roundtrip unchanged.
## Citation
```bibtex
@software{cute_tokenizer_2026,
author = {Eid, Hussein},
title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
year = {2026},
url = {https://github.com/HusseinEid101/CUTE},
version = {1.0.2}
}
```
## License
MIT. Source, training scripts, benchmark suite, and full reproduction
instructions live at <https://github.com/HusseinEid101/CUTE>.
|