toke BPE Tokenizer

A purpose-built 16K BPE tokenizer for the toke programming language, achieving 52% average token reduction vs cl100k_base across 42 benchmark programs.

Key Facts

Property	Value
Vocab size	16,384
Training data	25,953 toke programs + 698 loke production modules
Average reduction	52% vs cl100k_base
Best case	76% reduction (simple loop)
String handling	Contents replaced with placeholder before training

This is NOT the model's tokenizer

The toke code generation model (karwalski/toke) uses Qwen's 151K vocab tokenizer internally. This tokenizer measures how efficiently toke code could be tokenized by a future toke-native model.

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer_v03.json")
code = 'm=fib;f=fib(n:i64):i64{if(n<2){<n};<fib(n-1)+fib(n-2)};'
result = tok.encode(code)
print(f"{len(result.ids)} tokens")  # 19 tokens (vs 49 cl100k)

Interactive Demo

Try the tokenizer in your browser at tokelang.dev/tokenizer — see token boundaries highlighted with colours, side-by-side with cl100k.

karwalski
/

toke-tokenizer

toke BPE Tokenizer

Key Facts

This is NOT the model's tokenizer

Usage

Interactive Demo

Links