toke BPE Tokenizer

A purpose-built 16K BPE tokenizer for the toke programming language, achieving 52% average token reduction vs cl100k_base across 42 benchmark programs.

Key Facts

Property Value
Vocab size 16,384
Training data 25,953 toke programs + 698 loke production modules
Average reduction 52% vs cl100k_base
Best case 76% reduction (simple loop)
String handling Contents replaced with placeholder before training

This is NOT the model's tokenizer

The toke code generation model (karwalski/toke) uses Qwen's 151K vocab tokenizer internally. This tokenizer measures how efficiently toke code could be tokenized by a future toke-native model.

Usage

from tokenizers import Tokenizer

tok = Tokenizer.from_file("tokenizer_v03.json")
code = 'm=fib;f=fib(n:i64):i64{if(n<2){<n};<fib(n-1)+fib(n-2)};'
result = tok.encode(code)
print(f"{len(result.ids)} tokens")  # 19 tokens (vs 49 cl100k)

Interactive Demo

Try the tokenizer in your browser at tokelang.dev/tokenizer โ€” see token boundaries highlighted with colours, side-by-side with cl100k.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support