File size: 8,122 Bytes
78c54ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
# Replication Guide

This guide explains how to replicate the Slayer GPT-style Polish language-model experiment from raw text to a runnable checkpoint.

The repo contains two related tracks:

1. `model/ckpt.pt` is a small, runnable GPT checkpoint paired with `tokenizers/polish_bpe_32k.json`.
2. `training/` contains the larger Slayer H100 training code derived from modded-nanoGPT. Its full optimizer checkpoints are documented but not committed.

Use the small track for teaching and local demos. Use the H100 track to explain how the larger remote run was structured.

## 1. What Was Built

The local model is a GPT-2-style decoder-only Transformer:

- 12 layers
- 12 attention heads
- 768 embedding dimension
- 1024 token context
- 32768-token custom Polish byte-level BPE vocabulary
- bias-free linear/layernorm setup
- about 136M parameters in the checkpoint state dict

It is not a fine-tune of OpenAI GPT-2 weights. The important idea is the recipe:

1. collect Polish text,
2. train a custom byte-level BPE tokenizer,
3. tokenize the corpus into token-id shards,
4. train a causal language model on next-token prediction,
5. sample and evaluate with the exact same tokenizer.

## 2. Repository Map

- `model/ckpt.pt` - runnable model checkpoint.
- `tokenizers/polish_bpe_32k.json` - tokenizer paired with `model/ckpt.pt`.
- `tokenizers/rxlm_polish_bpe_65k.json` - separate later 65k custom tokenizer. Do not use it with `model/ckpt.pt`.
- `scripts/model.py` - GPT implementation for the local checkpoint.
- `scripts/sample_mac.py` - generation script.
- `scripts/knowbench_mac.py` and `scripts/syntaxbench_mac.py` - simple evaluation probes.
- `examples/prepare_corpus.py` - reference script for tokenizer training and `.bin` shard creation.
- `training/train_gpt.py` - H100 Slayer training script.
- `training/run_polish.sh` - remote launch script used on `ssh slayer`.
- `logs/` and `metadata/` - evidence from the saved local and remote runs.

## 3. Local Demo

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
python scripts/sample_mac.py "Polska jest" 80
```

Expected behavior:

- On Apple Silicon, the sampler uses MPS.
- On other machines, it falls back to CPU.
- The model should load `model/ckpt.pt` and `tokenizers/polish_bpe_32k.json` without path changes.

Run the lightweight probes:

```bash
python scripts/syntaxbench_mac.py
python scripts/knowbench_mac.py
```

## 4. Corpus Preparation

For teaching, start with a plain UTF-8 text corpus:

```text
data/raw/doc_0001.txt
data/raw/doc_0002.txt
...
```

One document per file is easiest to reason about. Clean enough to remove boilerplate and encoding damage, but do not over-normalize language. The tokenizer here is byte-level BPE, so it can represent arbitrary text.

Recommended minimum for a useful class demo:

- tokenizer demo: 10 MB to 100 MB of text,
- tiny GPT training demo: 100 MB to 1 GB,
- meaningful GPT run: many GB.

Keep validation data separate before tokenization.

## 5. Tokenizer Training

The model checkpoint in this repo uses:

- byte-level BPE,
- vocab size 32768,
- no Unicode normalizer,
- `add_prefix_space=False`,
- `<|endoftext|>` as the document separator / BOS-style token.

Train a compatible tokenizer and create token-id shards with:

```bash
python examples/prepare_corpus.py \
  --raw-dir data/raw \
  --out-dir data/processed \
  --vocab-size 32768 \
  --train-tokenizer
```

This writes:

- `data/processed/tokenizer.json`
- `data/processed/shards/polish_train_000000.bin`
- `data/processed/shards/polish_val_000000.bin`

The `.bin` shards are raw `uint16` token ids. This matters because `training/train_gpt.py` expects token ids to fit under 65536 and loads shards as `torch.uint16`.

For a fresh 65k experiment you may use a 65536-token tokenizer, but then the model config and training code must match that vocabulary size. Do not use the 65k tokenizer with the checked-in 32768-vocab checkpoint.

## 6. Training A Small Local GPT

This repo includes inference code for the saved checkpoint, not a full clean-room tiny trainer. For teaching, the simplest path is:

1. Use `examples/prepare_corpus.py` to create tokenizer and shards.
2. Use an existing nanoGPT trainer or your own minimal causal LM trainer.
3. Match these model settings for compatibility with `scripts/model.py`:

```python
GPTConfig(
    n_layer=12,
    n_head=12,
    n_embd=768,
    block_size=1024,
    vocab_size=32768,
    bias=False,
    dropout=0.0,
)
```

4. Save checkpoints in this format:

```python
torch.save(
    {
        "model": model.state_dict(),
        "model_args": config_dict,
        "iter": step,
    },
    "ckpt.pt",
)
```

Then put the checkpoint at `model/ckpt.pt`, the tokenizer at `tokenizers/polish_bpe_32k.json`, and run:

```bash
python scripts/sample_mac.py "Dawno temu w Polsce" 100
```

## 7. Replicating The Slayer H100 Run

The remote run used a modified fast GPT trainer under:

```text
ssh slayer:/home/ubuntu/modded-nanogpt
```

The local copy of the relevant training files is in `training/`.

Important paths from the run:

```text
~/dynaword/shards/polish_train_*.bin
~/dynaword/shards/polish_val_*.bin
~/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt
```

The H100 trainer expects:

- CUDA-capable Linux host,
- PyTorch 2.10-compatible environment,
- Triton,
- `kernels`,
- token-id shards as raw `uint16`,
- train files matching `~/dynaword/shards/polish_train_*.bin`,
- validation files matching `~/dynaword/shards/polish_val_*.bin`.

The launch script:

```bash
cd training
sed -n '1,120p' run_polish.sh
```

The core launch pattern is:

```bash
export TORCHINDUCTOR_CACHE_DIR="$HOME/.cache/torchinductor_polish"
export TORCHINDUCTOR_FX_GRAPH_CACHE=1
export TORCHINDUCTOR_AUTOGRAD_CACHE=1
cd "$HOME/modded-nanogpt"
.venv/bin/torchrun --standalone --nproc_per_node=1 train_gpt.py
```

The saved full training states include model and optimizer state:

```text
state_step000500.pt
state_step001000.pt
state_step001500.pt
```

They are about 3.6 GB each and are not committed. Fetch them only when needed:

```bash
mkdir -p training-states
rsync -avP slayer:/home/ubuntu/modded-nanogpt/logs/f143baf2-dbfc-406e-b13b-a9fcc166b31b/state_step001500.pt training-states/
```

## 8. Evaluation And Sanity Checks

Always validate these before trusting a run:

- Tokenizer round trip: encode and decode sample Polish text.
- Vocabulary compatibility: checkpoint `vocab_size` equals tokenizer vocab size.
- Loss curve: training loss should decrease smoothly, not only memorize a tiny sample.
- Sample quality: inspect repeated n-grams and broken Unicode.
- Validation loss: keep validation shards separate from training shards.

The included metadata shows the local training loss dropping from about `10.54` at step 0 to around `4.63` at step 500, with later probe rows in `metadata/traj.csv`.

## 9. Common Mistakes

- Mixing tokenizer files. A model trained with `polish_bpe_32k.json` must be sampled with that tokenizer.
- Saving only weights but losing `model_args`. The loader needs architecture parameters.
- Tokenizing train and validation together. Split first, tokenize second.
- Using `int32` shards with the Slayer trainer. Its loader is built around raw `uint16` token ids.
- Treating full optimizer checkpoints as deployable model artifacts. For inference, export a model-only checkpoint when possible.
- Teaching from the H100 script first. Start with the local checkpoint and tokenizer, then show the larger training script as the scaled version.

## 10. Suggested Lesson Flow

1. Show `scripts/sample_mac.py` generating from the saved model.
2. Open `metadata/artifact_manifest.json` and explain the model/tokenizer pairing.
3. Train a toy tokenizer on a small Polish corpus with `examples/prepare_corpus.py`.
4. Inspect tokenization of Polish words, punctuation, and diacritics.
5. Explain next-token prediction and the checkpoint format.
6. Show how `training/train_gpt.py` scales the same idea to H100 training.
7. End with failure modes: wrong tokenizer, data leakage, repeated text, and no validation split.