ksjpswaroop
/

nanochat-eos

Model card Files Files and versions

nanochat-eos / models /tokenizer /README.md

ksjpswaroop's picture

Upload folder using huggingface_hub

50ebd92 verified 4 months ago

|

history blame contribute delete

2.49 kB

	# Nanochat tokenizer (this copy)

	This directory is a copy of the tokenizer produced by the speedrun (trained with `scripts/tok_train`). The canonical location used at runtime is `$NANOCHAT_BASE_DIR/tokenizer` (default: `~/.cache/nanochat/tokenizer/`).

	## Details

	\| Property \| Value \|
	\|----------\|--------\|
	\| Type \| BPE (Byte Pair Encoding), GPT-4 style \|
	\| Implementation \| `RustBPETokenizer`: trained with `rustbpe`, inference with `tiktoken` \|
	\| Vocab size \| 32,768 (2^15) \|
	\| Merge count \| 32,503 merges (vocab_size − num_special_tokens) \|

	### Special tokens (8)

	Used for document boundaries and chat turn structure:

	- `<\|bos\|>` — beginning of sequence (document start)
	- `<\|user_start\|>`, `<\|user_end\|>` — user message boundaries
	- `<\|assistant_start\|>`, `<\|assistant_end\|>` — assistant message boundaries
	- `<\|python_start\|>`, `<\|python_end\|>` — Python REPL tool call
	- `<\|output_start\|>`, `<\|output_end\|>` — Python REPL output

	(Defined in `nanochat/tokenizer.py` as `SPECIAL_TOKENS`.)

	### Pre-tokenizer pattern

	GPT-4-style regex (with `\p{N}{1,2}` for numbers instead of GPT-4’s `\p{N}{1,3}` for smaller vocabs). See `SPLIT_PATTERN` in `nanochat/tokenizer.py`.

	### Files in this directory

	\| File \| Description \|
	\|------\|-------------\|
	\| `tokenizer.pkl` \| Pickled `tiktoken.Encoding` (mergeable_ranks, special_tokens, pattern). Load with `RustBPETokenizer.from_directory(this_dir)`. \|
	\| `token_bytes.pt` \| `torch.tensor` of shape `(vocab_size,)`, dtype `int32`: byte length per token id (0 for special tokens). Used for bits-per-byte evaluation. \|

	### Training (speedrun)

	- Script: `python -m scripts.tok_train`
	- Args (defaults): `--vocab-size=32768`, `--max-chars=2_000_000_000`, `--doc-cap=10_000`
	- Data: Base data shards via `parquets_iter_batched(split="train")` (e.g. 8-shard quick + 370-shard full download)
	- Report: Tokenizer training stats are logged to the nanochat report (e.g. `token_bytes_min/max/mean/std`, `train_time`).

	### Loading in code

	```python
	from nanochat.tokenizer import RustBPETokenizer

	# From this directory
	tokenizer = RustBPETokenizer.from_directory("/home/piren/nanochat/models/tokenizer")

	# From default base dir (used by get_tokenizer())
	# tokenizer = get_tokenizer() # uses get_base_dir() + "tokenizer"
	```

	To use this copy as the default, set `NANOCHAT_BASE_DIR` to a path whose `tokenizer` subdir contains these files, or symlink `~/.cache/nanochat/tokenizer` to `models/tokenizer`.