slayer-gpt-tokenizer-model / docs /REPLICATION_GUIDE.md

Upload Slayer GPT tokenizer model archive

78c54ec verified about 15 hours ago

8.12 kB

Replication Guide

This guide explains how to replicate the Slayer GPT-style Polish language-model experiment from raw text to a runnable checkpoint.

The repo contains two related tracks:

model/ckpt.pt is a small, runnable GPT checkpoint paired with tokenizers/polish_bpe_32k.json.
training/ contains the larger Slayer H100 training code derived from modded-nanoGPT. Its full optimizer checkpoints are documented but not committed.

Use the small track for teaching and local demos. Use the H100 track to explain how the larger remote run was structured.

1. What Was Built

The local model is a GPT-2-style decoder-only Transformer:

12 layers
12 attention heads
768 embedding dimension
1024 token context
32768-token custom Polish byte-level BPE vocabulary
bias-free linear/layernorm setup
about 136M parameters in the checkpoint state dict

It is not a fine-tune of OpenAI GPT-2 weights. The important idea is the recipe:

collect Polish text,
train a custom byte-level BPE tokenizer,
tokenize the corpus into token-id shards,
train a causal language model on next-token prediction,
sample and evaluate with the exact same tokenizer.

2. Repository Map

model/ckpt.pt - runnable model checkpoint.
tokenizers/polish_bpe_32k.json - tokenizer paired with model/ckpt.pt.
tokenizers/rxlm_polish_bpe_65k.json - separate later 65k custom tokenizer. Do not use it with model/ckpt.pt.
scripts/model.py - GPT implementation for the local checkpoint.
scripts/sample_mac.py - generation script.
scripts/knowbench_mac.py and scripts/syntaxbench_mac.py - simple evaluation probes.
examples/prepare_corpus.py - reference script for tokenizer training and .bin shard creation.
training/train_gpt.py - H100 Slayer training script.
training/run_polish.sh - remote launch script used on ssh slayer.
logs/ and metadata/ - evidence from the saved local and remote runs.

3. Local Demo

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80

Expected behavior:

On Apple Silicon, the sampler uses MPS.
On other machines, it falls back to CPU.
The model should load model/ckpt.pt and tokenizers/polish_bpe_32k.json without path changes.

Run the lightweight probes:

python scripts/syntaxbench_mac.py
python scripts/knowbench_mac.py

4. Corpus Preparation

For teaching, start with a plain UTF-8 text corpus:

data/raw/doc_0001.txt
data/raw/doc_0002.txt
...

One document per file is easiest to reason about. Clean enough to remove boilerplate and encoding damage, but do not over-normalize language. The tokenizer here is byte-level BPE, so it can represent arbitrary text.

Recommended minimum for a useful class demo:

tokenizer demo: 10 MB to 100 MB of text,
tiny GPT training demo: 100 MB to 1 GB,
meaningful GPT run: many GB.

Keep validation data separate before tokenization.

5. Tokenizer Training

The model checkpoint in this repo uses:

byte-level BPE,
vocab size 32768,
no Unicode normalizer,
add_prefix_space=False,
<|endoftext|> as the document separator / BOS-style token.

Train a compatible tokenizer and create token-id shards with:

python examples/prepare_corpus.py \
  --raw-dir data/raw \
  --out-dir data/processed \
  --vocab-size 32768 \
  --train-tokenizer

This writes:

data/processed/tokenizer.json
data/processed/shards/polish_train_000000.bin
data/processed/shards/polish_val_000000.bin

The .bin shards are raw uint16 token ids. This matters because training/train_gpt.py expects token ids to fit under 65536 and loads shards as torch.uint16.

For a fresh 65k experiment you may use a 65536-token tokenizer, but then the model config and training code must match that vocabulary size. Do not use the 65k tokenizer with the checked-in 32768-vocab checkpoint.

6. Training A Small Local GPT

This repo includes inference code for the saved checkpoint, not a full clean-room tiny trainer. For teaching, the simplest path is:

Use examples/prepare_corpus.py to create tokenizer and shards.
Use an existing nanoGPT trainer or your own minimal causal LM trainer.
Match these model settings for compatibility with scripts/model.py:

GPTConfig(
    n_layer=12,
    n_head=12,
    n_embd=768,
    block_size=1024,
    vocab_size=32768,
    bias=False,
    dropout=0.0,
)

Save checkpoints in this format:

torch.save(
    {
        "model": model.state_dict(),
        "model_args": config_dict,
        "iter": step,
    },
    "ckpt.pt",
)

Then put the checkpoint at model/ckpt.pt, the tokenizer at tokenizers/polish_bpe_32k.json, and run:

python scripts/sample_mac.py "Dawno temu w Polsce" 100

7. Replicating The Slayer H100 Run

The remote run used a modified fast GPT trainer under:

ssh slayer:/home/ubuntu/modded-nanogpt

The local copy of the relevant training files is in training/.

Important paths from the run:

~/dynaword/shards/polish_train_*.bin
~/dynaword/shards/polish_val_*.bin
~/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt

The H100 trainer expects:

CUDA-capable Linux host,
PyTorch 2.10-compatible environment,
Triton,
kernels,
token-id shards as raw uint16,
train files matching ~/dynaword/shards/polish_train_*.bin,
validation files matching ~/dynaword/shards/polish_val_*.bin.

The launch script:

cd training
sed -n '1,120p' run_polish.sh

The core launch pattern is:

export TORCHINDUCTOR_CACHE_DIR="$HOME/.cache/torchinductor_polish"
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=1
cd "$HOME/modded-nanogpt"
.venv/bin/torchrun --standalone --nproc_per_node=1 train_gpt.py

The saved full training states include model and optimizer state:

state_step000500.pt
state_step001000.pt
state_step001500.pt

They are about 3.6 GB each and are not committed. Fetch them only when needed:

mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/

8. Evaluation And Sanity Checks

Always validate these before trusting a run:

Tokenizer round trip: encode and decode sample Polish text.
Vocabulary compatibility: checkpoint vocab_size equals tokenizer vocab size.
Loss curve: training loss should decrease smoothly, not only memorize a tiny sample.
Sample quality: inspect repeated n-grams and broken Unicode.
Validation loss: keep validation shards separate from training shards.

The included metadata shows the local training loss dropping from about 10.54 at step 0 to around 4.63 at step 500, with later probe rows in metadata/traj.csv.

9. Common Mistakes

Mixing tokenizer files. A model trained with polish_bpe_32k.json must be sampled with that tokenizer.
Saving only weights but losing model_args. The loader needs architecture parameters.
Tokenizing train and validation together. Split first, tokenize second.
Using int32 shards with the Slayer trainer. Its loader is built around raw uint16 token ids.
Treating full optimizer checkpoints as deployable model artifacts. For inference, export a model-only checkpoint when possible.
Teaching from the H100 script first. Start with the local checkpoint and tokenizer, then show the larger training script as the scaled version.

10. Suggested Lesson Flow

Show scripts/sample_mac.py generating from the saved model.
Open metadata/artifact_manifest.json and explain the model/tokenizer pairing.
Train a toy tokenizer on a small Polish corpus with examples/prepare_corpus.py.
Inspect tokenization of Polish words, punctuation, and diacritics.
Explain next-token prediction and the checkpoint format.
Show how training/train_gpt.py scales the same idea to H100 training.
End with failure modes: wrong tokenizer, data leakage, repeated text, and no validation split.