# Nanochat tokenizer (this copy) This directory is a copy of the tokenizer produced by the speedrun (trained with `scripts/tok_train`). The canonical location used at runtime is `$NANOCHAT_BASE_DIR/tokenizer` (default: `~/.cache/nanochat/tokenizer/`). ## Details | Property | Value | |----------|--------| | **Type** | BPE (Byte Pair Encoding), GPT-4 style | | **Implementation** | `RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken` | | **Vocab size** | 32,768 (2^15) | | **Merge count** | 32,503 merges (vocab_size − num_special_tokens) | ### Special tokens (8) Used for document boundaries and chat turn structure: - `<|bos|>` — beginning of sequence (document start) - `<|user_start|>`, `<|user_end|>` — user message boundaries - `<|assistant_start|>`, `<|assistant_end|>` — assistant message boundaries - `<|python_start|>`, `<|python_end|>` — Python REPL tool call - `<|output_start|>`, `<|output_end|>` — Python REPL output (Defined in `nanochat/tokenizer.py` as `SPECIAL_TOKENS`.) ### Pre-tokenizer pattern GPT-4-style regex (with `\p{N}{1,2}` for numbers instead of GPT-4’s `\p{N}{1,3}` for smaller vocabs). See `SPLIT_PATTERN` in `nanochat/tokenizer.py`. ### Files in this directory | File | Description | |------|-------------| | `tokenizer.pkl` | Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`. | | `token_bytes.pt` | `torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. | ### Training (speedrun) - **Script:** `python -m scripts.tok_train` - **Args (defaults):** `--vocab-size=32768`, `--max-chars=2_000_000_000`, `--doc-cap=10_000` - **Data:** Base data shards via `parquets_iter_batched(split="train")` (e.g. 8-shard quick + 370-shard full download) - **Report:** Tokenizer training stats are logged to the nanochat report (e.g. `token_bytes_min/max/mean/std`, `train_time`). ### Loading in code ```python from nanochat.tokenizer import RustBPETokenizer # From this directory tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer") # From default base dir (used by get_tokenizer()) # tokenizer = get_tokenizer() # uses get_base_dir() + "tokenizer" ``` To use this copy as the default, set `NANOCHAT_BASE_DIR` to a path whose `tokenizer` subdir contains these files, or symlink `~/.cache/nanochat/tokenizer` to `models/tokenizer`.