Super-squash branch 'main' using huggingface_hub

68a4c53 10 days ago

6.67 kB

	---
	license: mit
	library_name: tokenizers
	tags:
	- code
	- tokenizer
	- byte-level-bpe
	- private-use-area
	- lossless-roundtrip
	- the-stack
	language:
	- code
	---

	# CUTE

	Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE

	CUTE is a code-aware tokenizer built on a single architectural idea:
	substitute high-savings multi-byte patterns to atomic Unicode codepoints
	before byte-level BPE sees them. On 1,500 held-out Python files from
	The Stack, CUTE produces fewer tokens per file than nine widely-used
	baselines — including OpenAI's `cl100k_base` and `o200k_base`, LLaMA-3's
	SentencePiece BPE, and three SentencePiece Unigram variants — and is
	the only tokenizer in this comparison that re-encodes every file to
	byte-identical source.

	## Compression (1,500 held-out Python files, The Stack)

	\| Tokenizer \| mean tok \| bytes/tok \| vs CUTE \| roundtrip \|
	\|--------------------------------------\|---------:\|----------:\|--------:\|-------------\|
	\| CUTE \| 1,767 \| 4.42 \| — \| 1500 / 1500 \|
	\| OpenAI cl100k_base \| 1,874 \| 4.17 \| +6.0% \| 1500 / 1500 \|
	\| OpenAI o200k_base \| 1,886 \| 4.14 \| +6.7% \| 1500 / 1500 \|
	\| LLaMA-3 (SentencePiece BPE) \| 1,872 \| 4.17 \| +5.9% \| 686 / 1500 \|
	\| StarCoder2 \| 2,210 \| 3.53 \| +25.1% \| 685 / 1500 \|
	\| XLM-RoBERTa (SentencePiece Unigram) \| 2,438 \| 3.20 \| +38.0% \| 0 / 1500 \|
	\| CodeLlama \| 2,573 \| 3.03 \| +45.6% \| 1493 / 1500 \|
	\| T5 (SentencePiece Unigram) \| 2,706 \| 2.89 \| +53.2% \| 0 / 1500 \|
	\| GPT-2 \| 3,581 \| 2.18 \| +102.7% \| 1500 / 1500 \|

	`vs CUTE` is the extra cost the baseline pays per file. LLM API spend
	is linear in this number.

	## Latency (p50 across the full 1,500-file Stack-Python holdout)

	\| Tokenizer \| encode p50 \| decode p50 \|
	\|--------------------------------------\|-----------:\|-----------:\|
	\| OpenAI cl100k_base \| 1,338 µs \| 120 µs \|
	\| OpenAI o200k_base \| 1,760 µs \| 126 µs \|
	\| CUTE \| 1,822 µs \| 263 µs \|
	\| T5 (SentencePiece Unigram) \| 3,121 µs \| 479 µs \|
	\| CodeLlama \| 3,162 µs \| 1,885 µs \|
	\| XLM-RoBERTa (SentencePiece Unigram) \| 3,272 µs \| 440 µs \|
	\| LLaMA-3 (SentencePiece BPE) \| 3,753 µs \| 792 µs \|
	\| StarCoder2 \| 4,316 µs \| 775 µs \|
	\| GPT-2 \| 4,467 µs \| 911 µs \|

	CUTE is third-fastest encode and third-fastest decode in the
	field, behind only OpenAI's `cl100k_base` and `o200k_base`. v1.0.2's
	cute-bpe Rust hot path runs ~6× faster than v1.0.1 on a short Python
	sample (1,526 µs → 254 µs end-to-end; ~5× faster on the cargo bench
	of the core encoder). On the full 1,500-file holdout median, CUTE
	beats every open-source code tokenizer (LLaMA-3, StarCoder2,
	CodeLlama, GPT-2, T5, XLM-RoBERTa) on both encode and decode latency,
	while preserving the only byte-perfect 1500 / 1500 roundtrip in
	the comparison.

	## How it works

	1. A frequency-weighted, savings-ranked selection pass mines
	high-value multi-byte patterns (identifiers, common slices like
	`(self`, `=None`, `:\n`) from a code corpus.
	2. Selected patterns are mapped one-to-one to **supplementary-plane
	Private-Use-Area (PUA) codepoints** (`U+F0000+`). The BMP-PUA range
	is deliberately skipped to avoid colliding with literal PUA
	characters that appear in real source code.
	3. A byte-level BPE trainer runs on the PUA-pre-substituted stream,
	so semantic anchors are visible to the merge algorithm and can
	compose freely with whitespace and punctuation (e.g. `Ġ + ⟦def⟧`).
	4. A second savings pass adds the top-6,000 high-frequency compound
	patterns as atomic `AddedToken`s.
	5. At encode time, an Aho-Corasick (leftmost-longest) Rust pass
	substitutes PUA codepoints; a purpose-built Rust BPE encoder
	(`cute-bpe`, modeled on tiktoken's linear-scan-min-rank merge loop)
	then performs the byte-level BPE pass.
	6. At decode time, the inverse PUA map restores the original source
	text — byte-for-byte identical.

	## Use it

	### Via the standalone package

	```bash
	pip install cute-tokenizer
	```

	```python
	from cute_tokenizer import load_default_tokenizer

	tok = load_default_tokenizer()
	ids = tok("def hello(): return 42", add_special_tokens=False).input_ids
	text = tok.decode(ids, skip_special_tokens=True)
	assert text == "def hello(): return 42"
	```

	For tight inference loops where `BatchEncoding` machinery is overhead,
	use `fast_encode` / `fast_decode` — these go straight to the Rust
	`cute-bpe` encoder/decoder:

	```python
	ids = tok.fast_encode("def hello(): return 42")
	text = tok.fast_decode(ids)
	```

	### Via Hugging Face AutoTokenizer

	```python
	from transformers import AutoTokenizer

	tok = AutoTokenizer.from_pretrained(
	"HusseinEid/cute-tokenizer",
	trust_remote_code=True,
	)
	ids = tok("class Foo: pass", add_special_tokens=False).input_ids
	text = tok.decode(ids, skip_special_tokens=True)
	```

	`trust_remote_code=True` is required because the wrapper class
	(`CUTETokenizerFast`) runs PUA pre-substitution before delegating to
	the byte-level BPE encoder.

	## Properties

	- Byte-equal roundtrip on 1,500 / 1,500 Python holdout files.
	- Deterministic `tokenizer.json` within a fixed
	`(OS, python, tokenizers, _accel, corpus_hash, seed)` host triple.
	Cross-platform byte-identity of trained artifacts is not part of
	the contract.
	- Atomicity invariants asserted on every save: model is `BPE`,
	decoder is `ByteLevel`, pre-tokenizer is `ByteLevel`, every mapping
	PUA codepoint has a vocab id.
	- No BMP-PUA collisions — mappings live in the supplementary
	planes only, so literal BMP-PUA characters in real source code
	(TypeScript Unicode tables, CJK fonts) roundtrip unchanged.

	## Citation

	```bibtex
	@software{cute_tokenizer_2026,
	author = {Eid, Hussein},
	title = {CUTE: Compact Unicode Token Encoding via Semantic-Anchored Byte-level BPE},
	year = {2026},
	url = {https://github.com/HusseinEid101/CUTE},
	version = {1.0.2}
	}
	```

	## License

	MIT. Source, training scripts, benchmark suite, and full reproduction
	instructions live at <https://github.com/HusseinEid101/CUTE>.