slayer-gpt-tokenizer-model / docs /REPLICATION_GUIDE.md

Upload Slayer GPT tokenizer model archive

78c54ec verified about 16 hours ago

8.12 kB

	# Replication Guide

	This guide explains how to replicate the Slayer GPT-style Polish language-model experiment from raw text to a runnable checkpoint.

	The repo contains two related tracks:

	1. `model/ckpt.pt` is a small, runnable GPT checkpoint paired with `tokenizers/polish_bpe_32k.json`.
	2. `training/` contains the larger Slayer H100 training code derived from modded-nanoGPT. Its full optimizer checkpoints are documented but not committed.

	Use the small track for teaching and local demos. Use the H100 track to explain how the larger remote run was structured.

	## 1. What Was Built

	The local model is a GPT-2-style decoder-only Transformer:

	- 12 layers
	- 12 attention heads
	- 768 embedding dimension
	- 1024 token context
	- 32768-token custom Polish byte-level BPE vocabulary
	- bias-free linear/layernorm setup
	- about 136M parameters in the checkpoint state dict

	It is not a fine-tune of OpenAI GPT-2 weights. The important idea is the recipe:

	1. collect Polish text,
	2. train a custom byte-level BPE tokenizer,
	3. tokenize the corpus into token-id shards,
	4. train a causal language model on next-token prediction,
	5. sample and evaluate with the exact same tokenizer.

	## 2. Repository Map

	- `model/ckpt.pt` - runnable model checkpoint.
	- `tokenizers/polish_bpe_32k.json` - tokenizer paired with `model/ckpt.pt`.
	- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer. Do not use it with `model/ckpt.pt`.
	- `scripts/model.py` - GPT implementation for the local checkpoint.
	- `scripts/sample_mac.py` - generation script.
	- `scripts/knowbench_mac.py` and `scripts/syntaxbench_mac.py` - simple evaluation probes.
	- `examples/prepare_corpus.py` - reference script for tokenizer training and `.bin` shard creation.
	- `training/train_gpt.py` - H100 Slayer training script.
	- `training/run_polish.sh` - remote launch script used on `ssh slayer`.
	- `logs/` and `metadata/` - evidence from the saved local and remote runs.

	## 3. Local Demo

	```bash
	python3 -m venv .venv
	source .venv/bin/activate
	pip install -r requirements.txt
	python scripts/sample_mac.py "Polska jest" 80
	```

	Expected behavior:

	- On Apple Silicon, the sampler uses MPS.
	- On other machines, it falls back to CPU.
	- The model should load `model/ckpt.pt` and `tokenizers/polish_bpe_32k.json` without path changes.

	Run the lightweight probes:

	```bash
	python scripts/syntaxbench_mac.py
	python scripts/knowbench_mac.py
	```

	## 4. Corpus Preparation

	For teaching, start with a plain UTF-8 text corpus:

	```text
	data/raw/doc_0001.txt
	data/raw/doc_0002.txt
	...
	```

	One document per file is easiest to reason about. Clean enough to remove boilerplate and encoding damage, but do not over-normalize language. The tokenizer here is byte-level BPE, so it can represent arbitrary text.

	Recommended minimum for a useful class demo:

	- tokenizer demo: 10 MB to 100 MB of text,
	- tiny GPT training demo: 100 MB to 1 GB,
	- meaningful GPT run: many GB.

	Keep validation data separate before tokenization.

	## 5. Tokenizer Training

	The model checkpoint in this repo uses:

	- byte-level BPE,
	- vocab size 32768,
	- no Unicode normalizer,
	- `add_prefix_space=False`,
	- `<\|endoftext\|>` as the document separator / BOS-style token.

	Train a compatible tokenizer and create token-id shards with:

	```bash
	python examples/prepare_corpus.py \
	--raw-dir data/raw \
	--out-dir data/processed \
	--vocab-size 32768 \
	--train-tokenizer
	```

	This writes:

	- `data/processed/tokenizer.json`
	- `data/processed/shards/polish_train_000000.bin`
	- `data/processed/shards/polish_val_000000.bin`

	The `.bin` shards are raw `uint16` token ids. This matters because `training/train_gpt.py` expects token ids to fit under 65536 and loads shards as `torch.uint16`.

	For a fresh 65k experiment you may use a 65536-token tokenizer, but then the model config and training code must match that vocabulary size. Do not use the 65k tokenizer with the checked-in 32768-vocab checkpoint.

	## 6. Training A Small Local GPT

	This repo includes inference code for the saved checkpoint, not a full clean-room tiny trainer. For teaching, the simplest path is:

	1. Use `examples/prepare_corpus.py` to create tokenizer and shards.
	2. Use an existing nanoGPT trainer or your own minimal causal LM trainer.
	3. Match these model settings for compatibility with `scripts/model.py`:

	```python
	GPTConfig(
	n_layer=12,
	n_head=12,
	n_embd=768,
	block_size=1024,
	vocab_size=32768,
	bias=False,
	dropout=0.0,
	)
	```

	4. Save checkpoints in this format:

	```python
	torch.save(
	{
	"model": model.state_dict(),
	"model_args": config_dict,
	"iter": step,
	},
	"ckpt.pt",
	)
	```

	Then put the checkpoint at `model/ckpt.pt`, the tokenizer at `tokenizers/polish_bpe_32k.json`, and run:

	```bash
	python scripts/sample_mac.py "Dawno temu w Polsce" 100
	```

	## 7. Replicating The Slayer H100 Run

	The remote run used a modified fast GPT trainer under:

	```text
	ssh slayer:/home/ubuntu/modded-nanogpt
	```

	The local copy of the relevant training files is in `training/`.

	Important paths from the run:

	```text
	~/dynaword/shards/polish_train_*.bin
	~/dynaword/shards/polish_val_*.bin
	~/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
	```

	The H100 trainer expects:

	- CUDA-capable Linux host,
	- PyTorch 2.10-compatible environment,
	- Triton,
	- `kernels`,
	- token-id shards as raw `uint16`,
	- train files matching `~/dynaword/shards/polish_train_*.bin`,
	- validation files matching `~/dynaword/shards/polish_val_*.bin`.

	The launch script:

	```bash
	cd training
	sed -n '1,120p' run_polish.sh
	```

	The core launch pattern is:

	```bash
	export TORCHINDUCTOR_CACHE_DIR="$HOME/.cache/torchinductor_polish"
	export TORCHINDUCTOR_FX_GRAPH_CACHE=1
	export TORCHINDUCTOR_AUTOGRAD_CACHE=1
	cd "$HOME/modded-nanogpt"
	.venv/bin/torchrun --standalone --nproc_per_node=1 train_gpt.py
	```

	The saved full training states include model and optimizer state:

	```text
	state_step000500.pt
	state_step001000.pt
	state_step001500.pt
	```

	They are about 3.6 GB each and are not committed. Fetch them only when needed:

	```bash
	mkdir -p training-states
	rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
	```

	## 8. Evaluation And Sanity Checks

	Always validate these before trusting a run:

	- Tokenizer round trip: encode and decode sample Polish text.
	- Vocabulary compatibility: checkpoint `vocab_size` equals tokenizer vocab size.
	- Loss curve: training loss should decrease smoothly, not only memorize a tiny sample.
	- Sample quality: inspect repeated n-grams and broken Unicode.
	- Validation loss: keep validation shards separate from training shards.

	The included metadata shows the local training loss dropping from about `10.54` at step 0 to around `4.63` at step 500, with later probe rows in `metadata/traj.csv`.

	## 9. Common Mistakes

	- Mixing tokenizer files. A model trained with `polish_bpe_32k.json` must be sampled with that tokenizer.
	- Saving only weights but losing `model_args`. The loader needs architecture parameters.
	- Tokenizing train and validation together. Split first, tokenize second.
	- Using `int32` shards with the Slayer trainer. Its loader is built around raw `uint16` token ids.
	- Treating full optimizer checkpoints as deployable model artifacts. For inference, export a model-only checkpoint when possible.
	- Teaching from the H100 script first. Start with the local checkpoint and tokenizer, then show the larger training script as the scaled version.

	## 10. Suggested Lesson Flow

	1. Show `scripts/sample_mac.py` generating from the saved model.
	2. Open `metadata/artifact_manifest.json` and explain the model/tokenizer pairing.
	3. Train a toy tokenizer on a small Polish corpus with `examples/prepare_corpus.py`.
	4. Inspect tokenization of Polish words, punctuation, and diacritics.
	5. Explain next-token prediction and the checkpoint format.
	6. Show how `training/train_gpt.py` scales the same idea to H100 training.
	7. End with failure modes: wrong tokenizer, data leakage, repeated text, and no validation split.