File size: 6,670 Bytes

68a4c53

---

license: mit
library_name: tokenizers
tags:
- code
- tokenizer
- byte-level-bpe
- private-use-area
- lossless-roundtrip
- the-stack
language:
- code
---


# CUTE

**Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE**

CUTE is a code-aware tokenizer built on a single architectural idea:
substitute high-savings multi-byte patterns to atomic Unicode codepoints
*before* byte-level BPE sees them. On 1,500 held-out Python files from
The Stack, CUTE produces fewer tokens per file than nine widely-used
baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
SentencePiece BPE, and three SentencePiece Unigram variants — and is
the only tokenizer in this comparison that re-encodes every file to
byte-identical source.

## Compression (1,500 held-out Python files, The Stack)

| Tokenizer                            | mean tok | bytes/tok | vs CUTE | roundtrip   |
|--------------------------------------|---------:|----------:|--------:|-------------|
| **CUTE**                             |    1,767 |      4.42 |       — | 1500 / 1500 |
| OpenAI cl100k_base                   |    1,874 |      4.17 |   +6.0% | 1500 / 1500 |

| OpenAI o200k_base                    |    1,886 |      4.14 |   +6.7% | 1500 / 1500 |
| LLaMA-3 (SentencePiece BPE)          |    1,872 |      4.17 |   +5.9% |  686 / 1500 |
| StarCoder2                           |    2,210 |      3.53 |  +25.1% |  685 / 1500 |
| XLM-RoBERTa (SentencePiece Unigram)  |    2,438 |      3.20 |  +38.0% |    0 / 1500 |
| CodeLlama                            |    2,573 |      3.03 |  +45.6% | 1493 / 1500 |
| T5 (SentencePiece Unigram)           |    2,706 |      2.89 |  +53.2% |    0 / 1500 |
| GPT-2                                |    3,581 |      2.18 | +102.7% | 1500 / 1500 |

`vs CUTE` is the extra cost the baseline pays per file. LLM API spend
is linear in this number.

## Latency (p50 across the full 1,500-file Stack-Python holdout)

| Tokenizer                            | encode p50 | decode p50 |
|--------------------------------------|-----------:|-----------:|
| OpenAI cl100k_base                   |   1,338 µs |     120 µs |

| OpenAI o200k_base                    |   1,760 µs |     126 µs |
| **CUTE**                             | **1,822 µs** | **263 µs** |
| T5 (SentencePiece Unigram)           |   3,121 µs |     479 µs |
| CodeLlama                            |   3,162 µs |   1,885 µs |
| XLM-RoBERTa (SentencePiece Unigram)  |   3,272 µs |     440 µs |
| LLaMA-3 (SentencePiece BPE)          |   3,753 µs |     792 µs |
| StarCoder2                           |   4,316 µs |     775 µs |
| GPT-2                                |   4,467 µs |     911 µs |

CUTE is **third-fastest encode** and **third-fastest decode** in the
field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
of the core encoder). On the full 1,500-file holdout median, CUTE
beats every open-source code tokenizer (LLaMA-3, StarCoder2,
CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
while preserving the **only** byte-perfect 1500 / 1500 roundtrip in
the comparison.

## How it works

1. A frequency-weighted, savings-ranked selection pass mines
   high-value multi-byte patterns (identifiers, common slices like
   `(self`, `=None`, `:\n`) from a code corpus.
2. Selected patterns are mapped one-to-one to **supplementary-plane

   Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
   is deliberately skipped to avoid colliding with literal PUA
   characters that appear in real source code.
3. A byte-level BPE trainer runs on the **PUA-pre-substituted stream**,
   so semantic anchors are visible to the merge algorithm and can
   compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
4. A second savings pass adds the top-6,000 high-frequency compound
   patterns as atomic `AddedToken`s.
5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
   substitutes PUA codepoints; a purpose-built Rust BPE encoder
   (`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
   then performs the byte-level BPE pass.
6. At decode time, the inverse PUA map restores the original source
   text — byte-for-byte identical.

## Use it

### Via the standalone package

```bash

pip install cute-tokenizer

```

```python

from cute_tokenizer import load_default_tokenizer



tok = load_default_tokenizer()

ids = tok("def hello(): return 42", add_special_tokens=False).input_ids

text = tok.decode(ids, skip_special_tokens=True)

assert text == "def hello(): return 42"

```

For tight inference loops where `BatchEncoding` machinery is overhead,
use `fast_encode` / `fast_decode` — these go straight to the Rust
`cute-bpe` encoder/decoder:

```python

ids = tok.fast_encode("def hello(): return 42")

text = tok.fast_decode(ids)

```

### Via Hugging Face AutoTokenizer

```python

from transformers import AutoTokenizer



tok = AutoTokenizer.from_pretrained(

    "HusseinEid/cute-tokenizer",

    trust_remote_code=True,

)

ids = tok("class Foo: pass", add_special_tokens=False).input_ids

text = tok.decode(ids, skip_special_tokens=True)

```

`trust_remote_code=True` is required because the wrapper class
(`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
the byte-level BPE encoder.

## Properties

- **Byte-equal roundtrip** on 1,500 / 1,500 Python holdout files.
- **Deterministic `tokenizer.json`** within a fixed
  `(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
  Cross-platform byte-identity of trained artifacts is not part of
  the contract.
- **Atomicity invariants** asserted on every save: model is `BPE`,
  decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
  PUA codepoint has a vocab id.
- **No BMP-PUA collisions** — mappings live in the supplementary
  planes only, so literal BMP-PUA characters in real source code
  (TypeScript Unicode tables, CJK fonts) roundtrip unchanged.

## Citation

```bibtex

@software{cute_tokenizer_2026,

  author  = {Eid, Hussein},

  title   = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},

  year    = {2026},

  url     = {https://github.com/HusseinEid101/CUTE},

  version = {1.0.2}

}

```

## License

MIT. Source, training scripts, benchmark suite, and full reproduction
instructions live at <https://github.com/HusseinEid101/CUTE>.