File size: 8,122 Bytes
78c54ec | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 | # Replication Guide
This guide explains how to replicate the Slayer GPT-style Polish language-model experiment from raw text to a runnable checkpoint.
The repo contains two related tracks:
1. `model/ckpt.pt` is a small, runnable GPT checkpoint paired with `tokenizers/polish_bpe_32k.json`.
2. `training/` contains the larger Slayer H100 training code derived from modded-nanoGPT. Its full optimizer checkpoints are documented but not committed.
Use the small track for teaching and local demos. Use the H100 track to explain how the larger remote run was structured.
## 1. What Was Built
The local model is a GPT-2-style decoder-only Transformer:
- 12 layers
- 12 attention heads
- 768 embedding dimension
- 1024 token context
- 32768-token custom Polish byte-level BPE vocabulary
- bias-free linear/layernorm setup
- about 136M parameters in the checkpoint state dict
It is not a fine-tune of OpenAI GPT-2 weights. The important idea is the recipe:
1. collect Polish text,
2. train a custom byte-level BPE tokenizer,
3. tokenize the corpus into token-id shards,
4. train a causal language model on next-token prediction,
5. sample and evaluate with the exact same tokenizer.
## 2. Repository Map
- `model/ckpt.pt` - runnable model checkpoint.
- `tokenizers/polish_bpe_32k.json` - tokenizer paired with `model/ckpt.pt`.
- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer. Do not use it with `model/ckpt.pt`.
- `scripts/model.py` - GPT implementation for the local checkpoint.
- `scripts/sample_mac.py` - generation script.
- `scripts/knowbench_mac.py` and `scripts/syntaxbench_mac.py` - simple evaluation probes.
- `examples/prepare_corpus.py` - reference script for tokenizer training and `.bin` shard creation.
- `training/train_gpt.py` - H100 Slayer training script.
- `training/run_polish.sh` - remote launch script used on `ssh slayer`.
- `logs/` and `metadata/` - evidence from the saved local and remote runs.
## 3. Local Demo
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```
Expected behavior:
- On Apple Silicon, the sampler uses MPS.
- On other machines, it falls back to CPU.
- The model should load `model/ckpt.pt` and `tokenizers/polish_bpe_32k.json` without path changes.
Run the lightweight probes:
```bash
python scripts/syntaxbench_mac.py
python scripts/knowbench_mac.py
```
## 4. Corpus Preparation
For teaching, start with a plain UTF-8 text corpus:
```text
data/raw/doc_0001.txt
data/raw/doc_0002.txt
...
```
One document per file is easiest to reason about. Clean enough to remove boilerplate and encoding damage, but do not over-normalize language. The tokenizer here is byte-level BPE, so it can represent arbitrary text.
Recommended minimum for a useful class demo:
- tokenizer demo: 10 MB to 100 MB of text,
- tiny GPT training demo: 100 MB to 1 GB,
- meaningful GPT run: many GB.
Keep validation data separate before tokenization.
## 5. Tokenizer Training
The model checkpoint in this repo uses:
- byte-level BPE,
- vocab size 32768,
- no Unicode normalizer,
- `add_prefix_space=False`,
- `<|endoftext|>` as the document separator / BOS-style token.
Train a compatible tokenizer and create token-id shards with:
```bash
python examples/prepare_corpus.py \
--raw-dir data/raw \
--out-dir data/processed \
--vocab-size 32768 \
--train-tokenizer
```
This writes:
- `data/processed/tokenizer.json`
- `data/processed/shards/polish_train_000000.bin`
- `data/processed/shards/polish_val_000000.bin`
The `.bin` shards are raw `uint16` token ids. This matters because `training/train_gpt.py` expects token ids to fit under 65536 and loads shards as `torch.uint16`.
For a fresh 65k experiment you may use a 65536-token tokenizer, but then the model config and training code must match that vocabulary size. Do not use the 65k tokenizer with the checked-in 32768-vocab checkpoint.
## 6. Training A Small Local GPT
This repo includes inference code for the saved checkpoint, not a full clean-room tiny trainer. For teaching, the simplest path is:
1. Use `examples/prepare_corpus.py` to create tokenizer and shards.
2. Use an existing nanoGPT trainer or your own minimal causal LM trainer.
3. Match these model settings for compatibility with `scripts/model.py`:
```python
GPTConfig(
n_layer=12,
n_head=12,
n_embd=768,
block_size=1024,
vocab_size=32768,
bias=False,
dropout=0.0,
)
```
4. Save checkpoints in this format:
```python
torch.save(
{
"model": model.state_dict(),
"model_args": config_dict,
"iter": step,
},
"ckpt.pt",
)
```
Then put the checkpoint at `model/ckpt.pt`, the tokenizer at `tokenizers/polish_bpe_32k.json`, and run:
```bash
python scripts/sample_mac.py "Dawno temu w Polsce" 100
```
## 7. Replicating The Slayer H100 Run
The remote run used a modified fast GPT trainer under:
```text
ssh slayer:/home/ubuntu/modded-nanogpt
```
The local copy of the relevant training files is in `training/`.
Important paths from the run:
```text
~/dynaword/shards/polish_train_*.bin
~/dynaword/shards/polish_val_*.bin
~/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
```
The H100 trainer expects:
- CUDA-capable Linux host,
- PyTorch 2.10-compatible environment,
- Triton,
- `kernels`,
- token-id shards as raw `uint16`,
- train files matching `~/dynaword/shards/polish_train_*.bin`,
- validation files matching `~/dynaword/shards/polish_val_*.bin`.
The launch script:
```bash
cd training
sed -n '1,120p' run_polish.sh
```
The core launch pattern is:
```bash
export TORCHINDUCTOR_CACHE_DIR="$HOME/.cache/torchinductor_polish"
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=1
cd "$HOME/modded-nanogpt"
.venv/bin/torchrun --standalone --nproc_per_node=1 train_gpt.py
```
The saved full training states include model and optimizer state:
```text
state_step000500.pt
state_step001000.pt
state_step001500.pt
```
They are about 3.6 GB each and are not committed. Fetch them only when needed:
```bash
mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
```
## 8. Evaluation And Sanity Checks
Always validate these before trusting a run:
- Tokenizer round trip: encode and decode sample Polish text.
- Vocabulary compatibility: checkpoint `vocab_size` equals tokenizer vocab size.
- Loss curve: training loss should decrease smoothly, not only memorize a tiny sample.
- Sample quality: inspect repeated n-grams and broken Unicode.
- Validation loss: keep validation shards separate from training shards.
The included metadata shows the local training loss dropping from about `10.54` at step 0 to around `4.63` at step 500, with later probe rows in `metadata/traj.csv`.
## 9. Common Mistakes
- Mixing tokenizer files. A model trained with `polish_bpe_32k.json` must be sampled with that tokenizer.
- Saving only weights but losing `model_args`. The loader needs architecture parameters.
- Tokenizing train and validation together. Split first, tokenize second.
- Using `int32` shards with the Slayer trainer. Its loader is built around raw `uint16` token ids.
- Treating full optimizer checkpoints as deployable model artifacts. For inference, export a model-only checkpoint when possible.
- Teaching from the H100 script first. Start with the local checkpoint and tokenizer, then show the larger training script as the scaled version.
## 10. Suggested Lesson Flow
1. Show `scripts/sample_mac.py` generating from the saved model.
2. Open `metadata/artifact_manifest.json` and explain the model/tokenizer pairing.
3. Train a toy tokenizer on a small Polish corpus with `examples/prepare_corpus.py`.
4. Inspect tokenization of Polish words, punctuation, and diacritics.
5. Explain next-token prediction and the checkpoint format.
6. Show how `training/train_gpt.py` scales the same idea to H100 training.
7. End with failure modes: wrong tokenizer, data leakage, repeated text, and no validation split.
|