| # Nanochat tokenizer (this copy) |
|
|
| This directory is a copy of the tokenizer produced by the speedrun (trained with `scripts/tok_train`). The canonical location used at runtime is `$NANOCHAT_BASE_DIR/tokenizer` (default: `~/.cache/nanochat/tokenizer/`). |
|
|
| ## Details |
|
|
| | Property | Value | |
| |----------|--------| |
| | **Type** | BPE (Byte Pair Encoding), GPT-4 style | |
| | **Implementation** | `RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken` | |
| | **Vocab size** | 32,768 (2^15) | |
| | **Merge count** | 32,503 merges (vocab_size β num_special_tokens) | |
| |
| ### Special tokens (8) |
| |
| Used for document boundaries and chat turn structure: |
| |
| - `<|bos|>` β beginning of sequence (document start) |
| - `<|user_start|>`, `<|user_end|>` β user message boundaries |
| - `<|assistant_start|>`, `<|assistant_end|>` β assistant message boundaries |
| - `<|python_start|>`, `<|python_end|>` β Python REPL tool call |
| - `<|output_start|>`, `<|output_end|>` β Python REPL output |
| |
| (Defined in `nanochat/tokenizer.py` as `SPECIAL_TOKENS`.) |
|
|
| ### Pre-tokenizer pattern |
|
|
| GPT-4-style regex (with `\p{N}{1,2}` for numbers instead of GPT-4βs `\p{N}{1,3}` for smaller vocabs). See `SPLIT_PATTERN` in `nanochat/tokenizer.py`. |
|
|
| ### Files in this directory |
|
|
| | File | Description | |
| |------|-------------| |
| | `tokenizer.pkl` | Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`. | |
| | `token_bytes.pt` | `torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. | |
|
|
| ### Training (speedrun) |
|
|
| - **Script:** `python -m scripts.tok_train` |
| - **Args (defaults):** `--vocab-size=32768`, `--max-chars=2_000_000_000`, `--doc-cap=10_000` |
| - **Data:** Base data shards via `parquets_iter_batched(split="train")` (e.g. 8-shard quick + 370-shard full download) |
| - **Report:** Tokenizer training stats are logged to the nanochat report (e.g. `token_bytes_min/max/mean/std`, `train_time`). |
|
|
| ### Loading in code |
|
|
| ```python |
| from nanochat.tokenizer import RustBPETokenizer |
| |
| # From this directory |
| tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer") |
| |
| # From default base dir (used by get_tokenizer()) |
| # tokenizer = get_tokenizer() # uses get_base_dir() + "tokenizer" |
| ``` |
|
|
| To use this copy as the default, set `NANOCHAT_BASE_DIR` to a path whose `tokenizer` subdir contains these files, or symlink `~/.cache/nanochat/tokenizer` to `models/tokenizer`. |
|
|