Nanochat tokenizer (this copy)

This directory is a copy of the tokenizer produced by the speedrun (trained with scripts/tok_train). The canonical location used at runtime is $NANOCHAT_BASE_DIR/tokenizer (default: ~/.cache/nanochat/tokenizer/).

Details

Property	Value
Type	BPE (Byte Pair Encoding), GPT-4 style
Implementation	`RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken`
Vocab size	32,768 (2^15)
Merge count	32,503 merges (vocab_size − num_special_tokens)

Special tokens (8)

Used for document boundaries and chat turn structure:

<|bos|> — beginning of sequence (document start)
<|user_start|>, <|user_end|> — user message boundaries
<|assistant_start|>, <|assistant_end|> — assistant message boundaries
<|python_start|>, <|python_end|> — Python REPL tool call
<|output_start|>, <|output_end|> — Python REPL output

(Defined in nanochat/tokenizer.py as SPECIAL_TOKENS.)

Pre-tokenizer pattern

GPT-4-style regex (with \p{N}{1,2} for numbers instead of GPT-4’s \p{N}{1,3} for smaller vocabs). See SPLIT_PATTERN in nanochat/tokenizer.py.

Files in this directory

File	Description
`tokenizer.pkl`	Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`.
`token_bytes.pt`	`torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation.

Training (speedrun)

Script: python -m scripts.tok_train
Args (defaults): --vocab-size=32768, --max-chars=2_000_000_000, --doc-cap=10_000
Data: Base data shards via parquets_iter_batched(split="train") (e.g. 8-shard quick + 370-shard full download)
Report: Tokenizer training stats are logged to the nanochat report (e.g. token_bytes_min/max/mean/std, train_time).

Loading in code

from nanochat.tokenizer import RustBPETokenizer

# From this directory
tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")

# From default base dir (used by get_tokenizer())
# tokenizer = get_tokenizer()  # uses get_base_dir() + "tokenizer"

To use this copy as the default, set NANOCHAT_BASE_DIR to a path whose tokenizer subdir contains these files, or symlink ~/.cache/nanochat/tokenizer to models/tokenizer.