File size: 6,670 Bytes
68a4c53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---

license: mit
library_name: tokenizers
tags:
- code
- tokenizer
- byte-level-bpe
- private-use-area
- lossless-roundtrip
- the-stack
language:
- code
---


# CUTE

**Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE**

CUTE is a code-aware tokenizer built on a single architectural idea:
substitute high-savings multi-byte patterns to atomic Unicode codepoints
*before* byte-level BPE sees them. On 1,500 held-out Python files from
The Stack, CUTE produces fewer tokens per file than nine widely-used
baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
SentencePiece BPE, and three SentencePiece Unigram variants — and is
the only tokenizer in this comparison that re-encodes every file to
byte-identical source.

## Compression (1,500 held-out Python files, The Stack)

| Tokenizer                            | mean tok | bytes/tok | vs CUTE | roundtrip   |
|--------------------------------------|---------:|----------:|--------:|-------------|
| **CUTE**                             |    1,767 |      4.42 |       — | 1500 / 1500 |
| OpenAI cl100k_base                   |    1,874 |      4.17 |   +6.0% | 1500 / 1500 |

| OpenAI o200k_base                    |    1,886 |      4.14 |   +6.7% | 1500 / 1500 |
| LLaMA-3 (SentencePiece BPE)          |    1,872 |      4.17 |   +5.9% |  686 / 1500 |
| StarCoder2                           |    2,210 |      3.53 |  +25.1% |  685 / 1500 |
| XLM-RoBERTa (SentencePiece Unigram)  |    2,438 |      3.20 |  +38.0% |    0 / 1500 |
| CodeLlama                            |    2,573 |      3.03 |  +45.6% | 1493 / 1500 |
| T5 (SentencePiece Unigram)           |    2,706 |      2.89 |  +53.2% |    0 / 1500 |
| GPT-2                                |    3,581 |      2.18 | +102.7% | 1500 / 1500 |

`vs CUTE` is the extra cost the baseline pays per file. LLM API spend
is linear in this number.

## Latency (p50 across the full 1,500-file Stack-Python holdout)

| Tokenizer                            | encode p50 | decode p50 |
|--------------------------------------|-----------:|-----------:|
| OpenAI cl100k_base                   |   1,338 µs |     120 µs |

| OpenAI o200k_base                    |   1,760 µs |     126 µs |
| **CUTE**                             | **1,822 µs** | **263 µs** |
| T5 (SentencePiece Unigram)           |   3,121 µs |     479 µs |
| CodeLlama                            |   3,162 µs |   1,885 µs |
| XLM-RoBERTa (SentencePiece Unigram)  |   3,272 µs |     440 µs |
| LLaMA-3 (SentencePiece BPE)          |   3,753 µs |     792 µs |
| StarCoder2                           |   4,316 µs |     775 µs |
| GPT-2                                |   4,467 µs |     911 µs |

CUTE is **third-fastest encode** and **third-fastest decode** in the
field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
of the core encoder). On the full 1,500-file holdout median, CUTE
beats every open-source code tokenizer (LLaMA-3, StarCoder2,
CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
while preserving the **only** byte-perfect 1500 / 1500 roundtrip in
the comparison.

## How it works

1. A frequency-weighted, savings-ranked selection pass mines
   high-value multi-byte patterns (identifiers, common slices like
   `(self`, `=None`, `:\n`) from a code corpus.
2. Selected patterns are mapped one-to-one to **supplementary-plane

   Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
   is deliberately skipped to avoid colliding with literal PUA
   characters that appear in real source code.
3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**,
   so semantic anchors are visible to the merge algorithm and can
   compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
4. A second savings pass adds the top-6,000 high-frequency compound
   patterns as atomic `AddedToken`s.
5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
   substitutes PUA codepoints; a purpose-built Rust BPE encoder
   (`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
   then performs the byte-level BPE pass.
6. At decode time, the inverse PUA map restores the original source
   text — byte-for-byte identical.

## Use it

### Via the standalone package

```bash

pip install cute-tokenizer

```

```python

from cute_tokenizer import load_default_tokenizer



tok = load_default_tokenizer()

ids = tok("def hello(): return 42", add_special_tokens=False).input_ids

text = tok.decode(ids, skip_special_tokens=True)

assert text == "def hello(): return 42"

```

For tight inference loops where `BatchEncoding` machinery is overhead,
use `fast_encode` / `fast_decode` — these go straight to the Rust
`cute-bpe` encoder/decoder:

```python

ids = tok.fast_encode("def hello(): return 42")

text = tok.fast_decode(ids)

```

### Via Hugging Face AutoTokenizer

```python

from transformers import AutoTokenizer



tok = AutoTokenizer.from_pretrained(

    "HusseinEid/cute-tokenizer",

    trust_remote_code=True,

)

ids = tok("class Foo: pass", add_special_tokens=False).input_ids

text = tok.decode(ids, skip_special_tokens=True)

```

`trust_remote_code=True` is required because the wrapper class
(`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
the byte-level BPE encoder.

## Properties

- **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files.
- **Deterministic `tokenizer.json`** within a fixed
  `(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
  Cross-platform byte-identity of trained artifacts is not part of
  the contract.
- **Atomicity invariants** asserted on every save: model is `BPE`,
  decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
  PUA codepoint has a vocab id.
- **No BMP-PUA collisions** — mappings live in the supplementary
  planes only, so literal BMP-PUA characters in real source code
  (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.

## Citation

```bibtex

@software{cute_tokenizer_2026,

  author  = {Eid, Hussein},

  title   = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},

  year    = {2026},

  url     = {https://github.com/HusseinEid101/CUTE},

  version = {1.0.2}

}

```

## License

MIT. Source, training scripts, benchmark suite, and full reproduction
instructions live at <https://github.com/HusseinEid101/CUTE>.