ksjpswaroop's picture
Upload folder using huggingface_hub
50ebd92 verified

Nanochat tokenizer (this copy)

This directory is a copy of the tokenizer produced by the speedrun (trained with scripts/tok_train). The canonical location used at runtime is $NANOCHAT_BASE_DIR/tokenizer (default: ~/.cache/nanochat/tokenizer/).

Details

Property Value
Type BPE (Byte Pair Encoding), GPT-4 style
Implementation RustBPETokenizer: trained with rustbpe, inference with tiktoken
Vocab size 32,768 (2^15)
Merge count 32,503 merges (vocab_size − num_special_tokens)

Special tokens (8)

Used for document boundaries and chat turn structure:

  • <|bos|> — beginning of sequence (document start)
  • <|user_start|>, <|user_end|> — user message boundaries
  • <|assistant_start|>, <|assistant_end|> — assistant message boundaries
  • <|python_start|>, <|python_end|> — Python REPL tool call
  • <|output_start|>, <|output_end|> — Python REPL output

(Defined in nanochat/tokenizer.py as SPECIAL_TOKENS.)

Pre-tokenizer pattern

GPT-4-style regex (with \p{N}{1,2} for numbers instead of GPT-4’s \p{N}{1,3} for smaller vocabs). See SPLIT_PATTERN in nanochat/tokenizer.py.

Files in this directory

File Description
tokenizer.pkl Pickled tiktoken.Encoding (mergeable_ranks, special_tokens, pattern). Load with RustBPETokenizer.from_directory(this_dir).
token_bytes.pt torch.tensor of shape (vocab_size,), dtype int32: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation.

Training (speedrun)

  • Script: python -m scripts.tok_train
  • Args (defaults): --vocab-size=32768, --max-chars=2_000_000_000, --doc-cap=10_000
  • Data: Base data shards via parquets_iter_batched(split="train") (e.g. 8-shard quick + 370-shard full download)
  • Report: Tokenizer training stats are logged to the nanochat report (e.g. token_bytes_min/max/mean/std, train_time).

Loading in code

from nanochat.tokenizer import RustBPETokenizer

# From this directory
tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")

# From default base dir (used by get_tokenizer())
# tokenizer = get_tokenizer()  # uses get_base_dir() + "tokenizer"

To use this copy as the default, set NANOCHAT_BASE_DIR to a path whose tokenizer subdir contains these files, or symlink ~/.cache/nanochat/tokenizer to models/tokenizer.