ksjpswaroop's picture
Upload folder using huggingface_hub
50ebd92 verified
# Nanochat tokenizer (this copy)
This directory is a copy of the tokenizer produced by the speedrun (trained with `scripts/tok_train`). The canonical location used at runtime is `$NANOCHAT_BASE_DIR/tokenizer` (default: `~/.cache/nanochat/tokenizer/`).
## Details
| Property | Value |
|----------|--------|
| **Type** | BPE (Byte Pair Encoding), GPT-4 style |
| **Implementation** | `RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken` |
| **Vocab size** | 32,768 (2^15) |
| **Merge count** | 32,503 merges (vocab_size βˆ’ num_special_tokens) |
### Special tokens (8)
Used for document boundaries and chat turn structure:
- `<|bos|>` β€” beginning of sequence (document start)
- `<|user_start|>`, `<|user_end|>` β€” user message boundaries
- `<|assistant_start|>`, `<|assistant_end|>` β€” assistant message boundaries
- `<|python_start|>`, `<|python_end|>` β€” Python REPL tool call
- `<|output_start|>`, `<|output_end|>` β€” Python REPL output
(Defined in `nanochat/tokenizer.py` as `SPECIAL_TOKENS`.)
### Pre-tokenizer pattern
GPT-4-style regex (with `\p{N}{1,2}` for numbers instead of GPT-4’s `\p{N}{1,3}` for smaller vocabs). See `SPLIT_PATTERN` in `nanochat/tokenizer.py`.
### Files in this directory
| File | Description |
|------|-------------|
| `tokenizer.pkl` | Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`. |
| `token_bytes.pt` | `torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. |
### Training (speedrun)
- **Script:** `python -m scripts.tok_train`
- **Args (defaults):** `--vocab-size=32768`, `--max-chars=2_000_000_000`, `--doc-cap=10_000`
- **Data:** Base data shards via `parquets_iter_batched(split="train")` (e.g. 8-shard quick + 370-shard full download)
- **Report:** Tokenizer training stats are logged to the nanochat report (e.g. `token_bytes_min/max/mean/std`, `train_time`).
### Loading in code
```python
from nanochat.tokenizer import RustBPETokenizer
# From this directory
tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")
# From default base dir (used by get_tokenizer())
# tokenizer = get_tokenizer() # uses get_base_dir() + "tokenizer"
```
To use this copy as the default, set `NANOCHAT_BASE_DIR` to a path whose `tokenizer` subdir contains these files, or symlink `~/.cache/nanochat/tokenizer` to `models/tokenizer`.