| # Replication Guide |
|
|
| This guide explains how to replicate the Slayer GPT-style Polish language-model experiment from raw text to a runnable checkpoint. |
|
|
| The repo contains two related tracks: |
|
|
| 1. `model/ckpt.pt` is a small, runnable GPT checkpoint paired with `tokenizers/polish_bpe_32k.json`. |
| 2. `training/` contains the larger Slayer H100 training code derived from modded-nanoGPT. Its full optimizer checkpoints are documented but not committed. |
|
|
| Use the small track for teaching and local demos. Use the H100 track to explain how the larger remote run was structured. |
|
|
| ## 1. What Was Built |
|
|
| The local model is a GPT-2-style decoder-only Transformer: |
|
|
| - 12 layers |
| - 12 attention heads |
| - 768 embedding dimension |
| - 1024 token context |
| - 32768-token custom Polish byte-level BPE vocabulary |
| - bias-free linear/layernorm setup |
| - about 136M parameters in the checkpoint state dict |
|
|
| It is not a fine-tune of OpenAI GPT-2 weights. The important idea is the recipe: |
|
|
| 1. collect Polish text, |
| 2. train a custom byte-level BPE tokenizer, |
| 3. tokenize the corpus into token-id shards, |
| 4. train a causal language model on next-token prediction, |
| 5. sample and evaluate with the exact same tokenizer. |
|
|
| ## 2. Repository Map |
|
|
| - `model/ckpt.pt` - runnable model checkpoint. |
| - `tokenizers/polish_bpe_32k.json` - tokenizer paired with `model/ckpt.pt`. |
| - `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer. Do not use it with `model/ckpt.pt`. |
| - `scripts/model.py` - GPT implementation for the local checkpoint. |
| - `scripts/sample_mac.py` - generation script. |
| - `scripts/knowbench_mac.py` and `scripts/syntaxbench_mac.py` - simple evaluation probes. |
| - `examples/prepare_corpus.py` - reference script for tokenizer training and `.bin` shard creation. |
| - `training/train_gpt.py` - H100 Slayer training script. |
| - `training/run_polish.sh` - remote launch script used on `ssh slayer`. |
| - `logs/` and `metadata/` - evidence from the saved local and remote runs. |
|
|
| ## 3. Local Demo |
|
|
| ```bash |
| python3 -m venv .venv |
| source .venv/bin/activate |
| pip install -r requirements.txt |
| python scripts/sample_mac.py "Polska jest" 80 |
| ``` |
|
|
| Expected behavior: |
|
|
| - On Apple Silicon, the sampler uses MPS. |
| - On other machines, it falls back to CPU. |
| - The model should load `model/ckpt.pt` and `tokenizers/polish_bpe_32k.json` without path changes. |
|
|
| Run the lightweight probes: |
|
|
| ```bash |
| python scripts/syntaxbench_mac.py |
| python scripts/knowbench_mac.py |
| ``` |
|
|
| ## 4. Corpus Preparation |
|
|
| For teaching, start with a plain UTF-8 text corpus: |
|
|
| ```text |
| data/raw/doc_0001.txt |
| data/raw/doc_0002.txt |
| ... |
| ``` |
|
|
| One document per file is easiest to reason about. Clean enough to remove boilerplate and encoding damage, but do not over-normalize language. The tokenizer here is byte-level BPE, so it can represent arbitrary text. |
|
|
| Recommended minimum for a useful class demo: |
|
|
| - tokenizer demo: 10 MB to 100 MB of text, |
| - tiny GPT training demo: 100 MB to 1 GB, |
| - meaningful GPT run: many GB. |
|
|
| Keep validation data separate before tokenization. |
|
|
| ## 5. Tokenizer Training |
|
|
| The model checkpoint in this repo uses: |
|
|
| - byte-level BPE, |
| - vocab size 32768, |
| - no Unicode normalizer, |
| - `add_prefix_space=False`, |
| - `<|endoftext|>` as the document separator / BOS-style token. |
|
|
| Train a compatible tokenizer and create token-id shards with: |
|
|
| ```bash |
| python examples/prepare_corpus.py \ |
| --raw-dir data/raw \ |
| --out-dir data/processed \ |
| --vocab-size 32768 \ |
| --train-tokenizer |
| ``` |
|
|
| This writes: |
|
|
| - `data/processed/tokenizer.json` |
| - `data/processed/shards/polish_train_000000.bin` |
| - `data/processed/shards/polish_val_000000.bin` |
|
|
| The `.bin` shards are raw `uint16` token ids. This matters because `training/train_gpt.py` expects token ids to fit under 65536 and loads shards as `torch.uint16`. |
|
|
| For a fresh 65k experiment you may use a 65536-token tokenizer, but then the model config and training code must match that vocabulary size. Do not use the 65k tokenizer with the checked-in 32768-vocab checkpoint. |
|
|
| ## 6. Training A Small Local GPT |
|
|
| This repo includes inference code for the saved checkpoint, not a full clean-room tiny trainer. For teaching, the simplest path is: |
|
|
| 1. Use `examples/prepare_corpus.py` to create tokenizer and shards. |
| 2. Use an existing nanoGPT trainer or your own minimal causal LM trainer. |
| 3. Match these model settings for compatibility with `scripts/model.py`: |
|
|
| ```python |
| GPTConfig( |
| n_layer=12, |
| n_head=12, |
| n_embd=768, |
| block_size=1024, |
| vocab_size=32768, |
| bias=False, |
| dropout=0.0, |
| ) |
| ``` |
|
|
| 4. Save checkpoints in this format: |
|
|
| ```python |
| torch.save( |
| { |
| "model": model.state_dict(), |
| "model_args": config_dict, |
| "iter": step, |
| }, |
| "ckpt.pt", |
| ) |
| ``` |
|
|
| Then put the checkpoint at `model/ckpt.pt`, the tokenizer at `tokenizers/polish_bpe_32k.json`, and run: |
|
|
| ```bash |
| python scripts/sample_mac.py "Dawno temu w Polsce" 100 |
| ``` |
|
|
| ## 7. Replicating The Slayer H100 Run |
|
|
| The remote run used a modified fast GPT trainer under: |
|
|
| ```text |
| ssh slayer:/home/ubuntu/modded-nanogpt |
| ``` |
|
|
| The local copy of the relevant training files is in `training/`. |
|
|
| Important paths from the run: |
|
|
| ```text |
| ~/dynaword/shards/polish_train_*.bin |
| ~/dynaword/shards/polish_val_*.bin |
| ~/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt |
| ``` |
|
|
| The H100 trainer expects: |
|
|
| - CUDA-capable Linux host, |
| - PyTorch 2.10-compatible environment, |
| - Triton, |
| - `kernels`, |
| - token-id shards as raw `uint16`, |
| - train files matching `~/dynaword/shards/polish_train_*.bin`, |
| - validation files matching `~/dynaword/shards/polish_val_*.bin`. |
|
|
| The launch script: |
|
|
| ```bash |
| cd training |
| sed -n '1,120p' run_polish.sh |
| ``` |
|
|
| The core launch pattern is: |
|
|
| ```bash |
| export TORCHINDUCTOR_CACHE_DIR="$HOME/.cache/torchinductor_polish" |
| export TORCHINDUCTOR_FX_GRAPH_CACHE=1 |
| export TORCHINDUCTOR_AUTOGRAD_CACHE=1 |
| cd "$HOME/modded-nanogpt" |
| .venv/bin/torchrun --standalone --nproc_per_node=1 train_gpt.py |
| ``` |
|
|
| The saved full training states include model and optimizer state: |
|
|
| ```text |
| state_step000500.pt |
| state_step001000.pt |
| state_step001500.pt |
| ``` |
|
|
| They are about 3.6 GB each and are not committed. Fetch them only when needed: |
|
|
| ```bash |
| mkdir -p training-states |
| rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/ |
| ``` |
|
|
| ## 8. Evaluation And Sanity Checks |
|
|
| Always validate these before trusting a run: |
|
|
| - Tokenizer round trip: encode and decode sample Polish text. |
| - Vocabulary compatibility: checkpoint `vocab_size` equals tokenizer vocab size. |
| - Loss curve: training loss should decrease smoothly, not only memorize a tiny sample. |
| - Sample quality: inspect repeated n-grams and broken Unicode. |
| - Validation loss: keep validation shards separate from training shards. |
|
|
| The included metadata shows the local training loss dropping from about `10.54` at step 0 to around `4.63` at step 500, with later probe rows in `metadata/traj.csv`. |
|
|
| ## 9. Common Mistakes |
|
|
| - Mixing tokenizer files. A model trained with `polish_bpe_32k.json` must be sampled with that tokenizer. |
| - Saving only weights but losing `model_args`. The loader needs architecture parameters. |
| - Tokenizing train and validation together. Split first, tokenize second. |
| - Using `int32` shards with the Slayer trainer. Its loader is built around raw `uint16` token ids. |
| - Treating full optimizer checkpoints as deployable model artifacts. For inference, export a model-only checkpoint when possible. |
| - Teaching from the H100 script first. Start with the local checkpoint and tokenizer, then show the larger training script as the scaled version. |
|
|
| ## 10. Suggested Lesson Flow |
|
|
| 1. Show `scripts/sample_mac.py` generating from the saved model. |
| 2. Open `metadata/artifact_manifest.json` and explain the model/tokenizer pairing. |
| 3. Train a toy tokenizer on a small Polish corpus with `examples/prepare_corpus.py`. |
| 4. Inspect tokenization of Polish words, punctuation, and diacritics. |
| 5. Explain next-token prediction and the checkpoint format. |
| 6. Show how `training/train_gpt.py` scales the same idea to H100 training. |
| 7. End with failure modes: wrong tokenizer, data leakage, repeated text, and no validation split. |
|
|
|
|