Nanochat tokenizer (this copy)
This directory is a copy of the tokenizer produced by the speedrun (trained with scripts/tok_train). The canonical location used at runtime is $NANOCHAT_BASE_DIR/tokenizer (default: ~/.cache/nanochat/tokenizer/).
Details
| Property | Value |
|---|---|
| Type | BPE (Byte Pair Encoding), GPT-4 style |
| Implementation | RustBPETokenizer: trained with rustbpe, inference with tiktoken |
| Vocab size | 32,768 (2^15) |
| Merge count | 32,503 merges (vocab_size − num_special_tokens) |
Special tokens (8)
Used for document boundaries and chat turn structure:
<|bos|>— beginning of sequence (document start)<|user_start|>,<|user_end|>— user message boundaries<|assistant_start|>,<|assistant_end|>— assistant message boundaries<|python_start|>,<|python_end|>— Python REPL tool call<|output_start|>,<|output_end|>— Python REPL output
(Defined in nanochat/tokenizer.py as SPECIAL_TOKENS.)
Pre-tokenizer pattern
GPT-4-style regex (with \p{N}{1,2} for numbers instead of GPT-4’s \p{N}{1,3} for smaller vocabs). See SPLIT_PATTERN in nanochat/tokenizer.py.
Files in this directory
| File | Description |
|---|---|
tokenizer.pkl |
Pickled tiktoken.Encoding (mergeable_ranks, special_tokens, pattern). Load with RustBPETokenizer.from_directory(this_dir). |
token_bytes.pt |
torch.tensor of shape (vocab_size,), dtype int32: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. |
Training (speedrun)
- Script:
python -m scripts.tok_train - Args (defaults):
--vocab-size=32768,--max-chars=2_000_000_000,--doc-cap=10_000 - Data: Base data shards via
parquets_iter_batched(split="train")(e.g. 8-shard quick + 370-shard full download) - Report: Tokenizer training stats are logged to the nanochat report (e.g.
token_bytes_min/max/mean/std,train_time).
Loading in code
from nanochat.tokenizer import RustBPETokenizer
# From this directory
tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")
# From default base dir (used by get_tokenizer())
# tokenizer = get_tokenizer() # uses get_base_dir() + "tokenizer"
To use this copy as the default, set NANOCHAT_BASE_DIR to a path whose tokenizer subdir contains these files, or symlink ~/.cache/nanochat/tokenizer to models/tokenizer.