# Nanochat tokenizer (this copy)

This directory is a copy of the tokenizer produced by the speedrun (trained with `scripts/tok_train`). The canonical location used at runtime is `$NANOCHAT_BASE_DIR/tokenizer` (default: `~/.cache/nanochat/tokenizer/`).

## Details

| Property | Value |
|----------|--------|
| **Type** | BPE (Byte Pair Encoding), GPT-4 style |
| **Implementation** | `RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken` |
| **Vocab size** | 32,768 (2^15) |
| **Merge count** | 32,503 merges (vocab_size − num_special_tokens) |

### Special tokens (8)

Used for document boundaries and chat turn structure:

- `<|bos|>` — beginning of sequence (document start)
- `<|user_start|>`, `<|user_end|>` — user message boundaries
- `<|assistant_start|>`, `<|assistant_end|>` — assistant message boundaries
- `<|python_start|>`, `<|python_end|>` — Python REPL tool call
- `<|output_start|>`, `<|output_end|>` — Python REPL output

(Defined in `nanochat/tokenizer.py` as `SPECIAL_TOKENS`.)

### Pre-tokenizer pattern

GPT-4-style regex (with `\p{N}{1,2}` for numbers instead of GPT-4’s `\p{N}{1,3}` for smaller vocabs). See `SPLIT_PATTERN` in `nanochat/tokenizer.py`.

### Files in this directory

| File | Description |
|------|-------------|
| `tokenizer.pkl` | Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`. |
| `token_bytes.pt` | `torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. |

### Training (speedrun)

- **Script:** `python -m scripts.tok_train`
- **Args (defaults):** `--vocab-size=32768`, `--max-chars=2_000_000_000`, `--doc-cap=10_000`
- **Data:** Base data shards via `parquets_iter_batched(split="train")` (e.g. 8-shard quick + 370-shard full download)
- **Report:** Tokenizer training stats are logged to the nanochat report (e.g. `token_bytes_min/max/mean/std`, `train_time`).

### Loading in code

```python
from nanochat.tokenizer import RustBPETokenizer

# From this directory
tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")

# From default base dir (used by get_tokenizer())
# tokenizer = get_tokenizer()  # uses get_base_dir() + "tokenizer"
```

To use this copy as the default, set `NANOCHAT_BASE_DIR` to a path whose `tokenizer` subdir contains these files, or symlink `~/.cache/nanochat/tokenizer` to `models/tokenizer`.