thomas-schweich commited on
Commit
d189f4f
Β·
1 Parent(s): 3436086

Rewrite CLAUDE.md with operational guidance for training, eval, RunPod, and monitoring

Browse files

Add concrete commands for pretraining, adapter training, and evaluation.
Add RunPod operations section with Docker build, pod lifecycle, GPU cost
table, and safety rules. Add monitoring section emphasizing write-to-disk
pattern for /loop pre-approval.

Files changed (1) hide show
  1. CLAUDE.md +232 -60
CLAUDE.md CHANGED
@@ -15,6 +15,7 @@ pawn/
15
  β”‚ β”œβ”€β”€ trainer.py # Pretraining loop
16
  β”‚ β”œβ”€β”€ gpu.py # GPU auto-detection (compile/AMP/SDPA backend)
17
  β”‚ β”œβ”€β”€ logging.py # MetricsLogger (JSONL output)
 
18
  β”‚ β”œβ”€β”€ adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
19
  β”‚ β”œβ”€β”€ eval_suite/ # Probes, generation tests, diagnostics, lichess eval
20
  β”‚ └── dashboard/ # Solara training dashboard (metrics, charts, runner)
@@ -32,30 +33,32 @@ This is a uv workspace. The root project is the `pawn` Python package; `engine/`
32
  # Build the Rust chess engine (required before anything else)
33
  cd engine && uv run --with maturin maturin develop --release && cd ..
34
 
35
- # Install Python deps (dev tools like pytest are in base dependencies):
36
  uv sync --extra rocm # AMD (ROCm 7.1)
37
  uv sync --extra cu128 # NVIDIA (CUDA 12.8)
38
 
39
  # Run tests
40
  uv run pytest tests/
41
 
42
- # Pretrain from scratch
43
  uv run python scripts/train.py --variant base --local-checkpoints
44
  ```
45
 
 
 
46
  ## Engine (`engine/`)
47
 
48
  **Single source of truth** for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.
49
 
50
  - Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
51
  - PyO3 bindings expose `chess_engine` module to Python
52
- - Key functions: `generate_random_games()`, `parse_pgn_file()`, `compute_legal_token_masks_sparse()`, `extract_board_states()`, `export_move_vocabulary()`
53
 
54
- ## Model (`pawn/`)
55
 
56
  ### Architecture
57
- - Decoder-only transformer, next-token prediction over 4278 tokens
58
- - Token vocabulary: 1 PAD + 4096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
59
  - Factored embeddings: `src_embed[s] + dst_embed[d] + promo_embed[p]`
60
  - Sequence format: `[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD]` (256 tokens)
61
 
@@ -65,60 +68,105 @@ uv run python scripts/train.py --variant base --local-checkpoints
65
  - `CLMConfig.large()`: d=640, 10 layers, 8 heads, ~68.4M params
66
  - `CLMConfig.toy()`: d=64, 2 layers, for tests only
67
 
68
- ### Key Patterns
69
 
70
- - **Sparse logit projection**: `forward_hidden()` returns `(B,T,d_model)`, then only loss-masked positions project through `lm_head` -- avoids full `(B,T,V)` materialization
71
- - **Legal mask via Rust**: `LegalMaskBuilder` replays games in Rust, returns sparse indices scattered into a pre-allocated GPU buffer
72
- - **DataLoader workers**: Must use `multiprocessing_context='spawn'` -- the engine uses rayon, and fork after rayon init causes deadlocks
73
- - **GPU auto-detection**: `pawn.gpu.configure_gpu()` selects compile/AMP/SDPA settings. ROCm uses MATH SDPA backend (flash attention backward has stride issues with torch.compile)
74
- - **SDPA backend global**: `pawn.model.SDPA_BACKEND` is set by `apply_gpu_config()` and used in `Attention.forward()` via `sdpa_kernel()` context
75
 
76
- ## Adapters (`pawn/adapters/`)
77
 
78
- All adapters freeze the backbone and initialize to identity (zero-init or gamma=1, beta=0):
79
- - `bottleneck.py` -- Houlsby-style down/up MLP, best parameter efficiency below ~1M params
80
- - `lora.py` -- Low-rank attention projection adapters
81
- - `film.py` -- Feature-wise Linear Modulation (lightest, ~17K params)
82
- - `sparse.py` -- Random binary mask on frozen weights
83
- - `hybrid.py` -- LoRA + FiLM combined
84
 
85
- ## Scripts (`scripts/`)
 
86
 
87
- - `train.py` -- Pretrain from scratch (`--variant small|base|large|toy`)
88
- - `train_all.py` -- Train small/base/large simultaneously on shared data batches. Supports `--run-evals` for automatic post-training probes, diagnostics, and Lichess eval, and `--publish-results` to push eval results to HF.
89
- - `train_bottleneck.py`, `train_film.py`, `train_lora.py`, `train_sparse.py`, `train_hybrid.py` -- Adapter behavioral cloning on Lichess PGN
90
- - `train_tiny.py` -- Standalone tiny transformer baseline (no frozen backbone)
91
- - `eval_accuracy.py` -- MAIA-compatible evaluation (per-phase, per-ply accuracy)
92
- - `eval_probes.py` -- Run linear probes on all checkpoints
93
- - `export_hf_repo.py` -- Convert training run to HuggingFace repo format (safetensors + metrics)
94
 
95
- All training scripts require `--hf-repo REPO` or `--local-checkpoints`.
 
 
 
 
 
 
96
 
97
- ## Deploy (`deploy/`)
 
 
 
 
98
 
99
- Docker-based deployment to Runpod GPU VMs:
100
- - `Dockerfile` -- Multi-target build: `interactive` (SSH+Jupyter, default) and `runner` (auto-stop)
101
- - `entrypoint-run.sh` -- Runner entrypoint, pulls from HF via `PAWN_MODEL` env var
102
- - `sync.sh` -- Pull latest checkpoints/metrics from HuggingFace submodules
103
- - `pod.sh` -- Pod lifecycle (create/start/stop/delete/ssh)
104
 
105
- Code lives at `/opt/pawn` on pods (outside the `/workspace` volume mount).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
- ## Dashboard (`pawn/dashboard/`)
108
 
109
- Solara-based training dashboard. Reads `metrics.jsonl` files, no dependency on training packages.
 
 
 
 
 
 
 
 
 
110
 
111
  ```bash
112
- uv sync --extra dashboard
113
- python -m pawn.dashboard --log-dir logs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
114
  ```
115
 
116
- - `metrics.py` -- Load runs, parse JSONL, detect run type from config record
117
- - `charts.py` -- Plotly chart builders (loss, accuracy, LR, GPU, adapter-specific diagnostics)
118
- - `sol.py` -- Solara components: RunSelector, ConfigSummary, MetricsCharts, Runner, Dashboard
119
- - `__main__.py` -- CLI entry point, sets `PAWN_LOG_DIR` env var, launches `solara run`
120
 
121
- Auto-detects run type from config fields (`run_type`, `formulation`, `pgn_file`). Dashboard requires restart for code changes (no hot reload).
 
 
 
 
122
 
123
  ## Checkpoints
124
 
@@ -169,28 +217,152 @@ the training loop checks it between steps, saves a checkpoint, pushes to HF, and
169
  **Never rsync checkpoint files from running pods.** Checkpoints are pushed to HuggingFace
170
  from the trainer. Pull via `deploy/sync.sh` (submodule update).
171
 
172
- ## Logs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
173
 
174
- Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl`.
 
175
 
176
- ## Runpod Pod Management
 
177
 
178
- ### Setup
 
179
 
180
- - Docker image: multi-target build in `Dockerfile`
181
- - `interactive` (default) β€” SSH + Jupyter, stays alive
182
- - `runner` β€” executes command then exits (pod auto-stops)
183
- - Build: `docker build --target runner --build-arg GIT_HASH=$(git rev-parse HEAD) ...`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
184
 
185
- ### Required Configuration
 
 
186
 
187
  - **Always attach a network volume.** Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
188
- - **Set `HF_TOKEN` as a pod environment variable** for automatic HuggingFace authentication.
189
- - Set `PAWN_MODEL=thomas-schweich/pawn-base` env var in the runner to auto-pull a checkpoint on startup.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
190
 
191
- ### Lifecycle
192
 
193
- - Create: `runpodctl pod create --name pawn-exp --gpu-id "NVIDIA RTX A5000" --image thomasschweich/pawn:<tag> --volume-in-gb 75 --ports "8888/http,22/tcp"`
194
- - Stop: `runpodctl pod stop <ID>` β€” sends SIGTERM β†’ trainer saves and pushes before exiting
195
- - **Never `runpodctl pod delete` while training is running** β€” data loss risk
196
- - Monitor: pull HF submodule (`deploy/sync.sh`) and read `metrics.jsonl`
 
 
 
 
15
  β”‚ β”œβ”€β”€ trainer.py # Pretraining loop
16
  β”‚ β”œβ”€β”€ gpu.py # GPU auto-detection (compile/AMP/SDPA backend)
17
  β”‚ β”œβ”€β”€ logging.py # MetricsLogger (JSONL output)
18
+ β”‚ β”œβ”€β”€ checkpoint.py # Atomic save/load, .complete sentinel, HF push
19
  β”‚ β”œβ”€β”€ adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
20
  β”‚ β”œβ”€β”€ eval_suite/ # Probes, generation tests, diagnostics, lichess eval
21
  β”‚ └── dashboard/ # Solara training dashboard (metrics, charts, runner)
 
33
  # Build the Rust chess engine (required before anything else)
34
  cd engine && uv run --with maturin maturin develop --release && cd ..
35
 
36
+ # Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
37
  uv sync --extra rocm # AMD (ROCm 7.1)
38
  uv sync --extra cu128 # NVIDIA (CUDA 12.8)
39
 
40
  # Run tests
41
  uv run pytest tests/
42
 
43
+ # Pretrain from scratch (local dev)
44
  uv run python scripts/train.py --variant base --local-checkpoints
45
  ```
46
 
47
+ The only extras are GPU backends (`rocm` or `cu128`). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies.
48
+
49
  ## Engine (`engine/`)
50
 
51
  **Single source of truth** for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.
52
 
53
  - Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
54
  - PyO3 bindings expose `chess_engine` module to Python
55
+ - Key functions: `generate_random_games()`, `parse_pgn_file()`, `compute_legal_token_masks_sparse()`, `extract_board_states()`, `export_move_vocabulary()`, `compute_accuracy_ceiling()`
56
 
57
+ ## Model
58
 
59
  ### Architecture
60
+ - Decoder-only transformer, next-token prediction over 4,278 tokens
61
+ - Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
62
  - Factored embeddings: `src_embed[s] + dst_embed[d] + promo_embed[p]`
63
  - Sequence format: `[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD]` (256 tokens)
64
 
 
68
  - `CLMConfig.large()`: d=640, 10 layers, 8 heads, ~68.4M params
69
  - `CLMConfig.toy()`: d=64, 2 layers, for tests only
70
 
71
+ ## Training
72
 
73
+ All training scripts require one of `--hf-repo REPO_ID` or `--local-checkpoints` (mutually exclusive). Use `--local-checkpoints` for local dev; use `--hf-repo` for any run where you need durable checkpoints.
 
 
 
 
74
 
75
+ ### Pretraining
76
 
77
+ ```bash
78
+ # Single model
79
+ uv run python scripts/train.py --variant base --local-checkpoints
 
 
 
80
 
81
+ # All three variants simultaneously (shared data batches, sequential GPU)
82
+ uv run python scripts/train_all.py --local-checkpoints
83
 
84
+ # Resume from checkpoint
85
+ uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints
86
+ ```
 
 
 
 
87
 
88
+ **`scripts/train.py`** key args:
89
+ - `--variant {small|base|large|toy}` β€” model size (default: base)
90
+ - `--resume PATH` β€” resume from checkpoint directory
91
+ - `--total-steps N` β€” training steps (default: 100,000)
92
+ - `--batch-size N` β€” batch size (default: 256)
93
+ - `--discard-ply-limit` β€” only train on naturally-ended games (no ply-limit truncation)
94
+ - Architecture overrides: `--d-model`, `--n-layers`, `--n-heads`, `--d-ff`, `--lr`, `--weight-decay`, `--warmup-steps`
95
 
96
+ **`scripts/train_all.py`** additional args:
97
+ - `--shm-checkpoints` β€” write checkpoints to `/dev/shm` (requires `--hf-repo`, volatile)
98
+ - `--run-evals` β€” auto-run probes + diagnostics after training completes
99
+ - `--publish-results` β€” push eval results to HF
100
+ - `--patience N` β€” per-model early stopping patience (eval intervals without improvement)
101
 
102
+ ### Adapter Training
 
 
 
 
103
 
104
+ All adapter scripts require `--checkpoint PATH` (pretrained weights) and `--pgn PATH` (Lichess PGN file). They freeze the backbone and train only adapter parameters.
105
+
106
+ ```bash
107
+ # Example: train a LoRA adapter on Lichess 1800-1900 games
108
+ uv run python scripts/train_lora.py \
109
+ --checkpoint checkpoints/pawn-base \
110
+ --pgn data/lichess_1800_1900.pgn \
111
+ --lora-rank 4 --lr 3e-4 --local-checkpoints
112
+ ```
113
+
114
+ | Script | Adapter | Key args | Typical params |
115
+ |--------|---------|----------|----------------|
116
+ | `train_bottleneck.py` | Houlsby MLP | `--bottleneck-dim 8` | ~131K |
117
+ | `train_lora.py` | Low-rank attention | `--lora-rank 4 --lora-targets qkvo` | ~65K |
118
+ | `train_film.py` | Channel-wise affine | `--no-output-film` | ~17K |
119
+ | `train_sparse.py` | Binary mask | `--density 0.01 --sparse-targets qkvo` | ~503K-2.7M |
120
+ | `train_hybrid.py` | LoRA + FiLM | `--lora-rank 4 --film-lr 1e-3` | ~65K |
121
+ | `train_tiny.py` | None (from scratch) | `--d-model 84 --n-layers 2` | ~524K |
122
+
123
+ Common adapter args: `--epochs 50`, `--batch-size 64`, `--lr 3e-4`, `--patience 10`, `--val-every 1`, `--max-games 12000`, `--min-ply 10`
124
 
125
+ ### Common CLI Patterns
126
 
127
+ - `--sdpa-math` β€” force MATH SDPA backend (required for ROCm + torch.compile)
128
+ - `--no-compile` β€” disable torch.compile
129
+ - `--no-amp` β€” disable mixed precision
130
+ - `--num-workers N` β€” DataLoader workers (default: 8 for adapters, 4 for pretraining)
131
+ - `--device {cuda|cpu}` β€” device selection
132
+ - `--wandb` β€” enable Weights & Biases logging
133
+
134
+ ## Evaluation & Metrics
135
+
136
+ ### Linear Probes
137
 
138
  ```bash
139
+ uv run python scripts/eval_probes.py --log-dir logs --device cuda
140
+ ```
141
+
142
+ Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: `--n-games 4096`, `--n-val-games 1024`, `--n-epochs 20`, `--run RUN_NAME` (specific run).
143
+
144
+ ### Move Prediction Accuracy
145
+
146
+ ```bash
147
+ uv run python scripts/eval_accuracy.py \
148
+ --checkpoint checkpoints/pawn-base \
149
+ --pgn data/lichess_1800_1900.pgn \
150
+ --adapter-checkpoint logs/run_*/checkpoints/best
151
+ ```
152
+
153
+ MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eval-ply 10`, `--max-games 50000`, `--per-ply`.
154
+
155
+ ### Theoretical Accuracy Ceilings
156
+
157
+ ```bash
158
+ uv run python scripts/compute_theoretical_ceiling.py
159
  ```
160
 
161
+ Computes upper bounds on top-1 accuracy for random games: unconditional (E[1/N_legal] = 6.43%), naive-conditioned (1-ply filter = 6.44%), MCTS-conditioned (32 rollouts = 7.92%). CPU-intensive.
162
+
163
+ ### Export to HuggingFace
 
164
 
165
+ ```bash
166
+ uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS
167
+ ```
168
+
169
+ Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.
170
 
171
  ## Checkpoints
172
 
 
217
  **Never rsync checkpoint files from running pods.** Checkpoints are pushed to HuggingFace
218
  from the trainer. Pull via `deploy/sync.sh` (submodule update).
219
 
220
+ ## RunPod Operations
221
+
222
+ ### Docker Build & Push
223
+
224
+ ```bash
225
+ # Build runner target (auto-stop after training completes)
226
+ docker build --platform linux/amd64 \
227
+ --build-arg GIT_HASH=$(git rev-parse HEAD) \
228
+ --build-arg GIT_TAG=$(git tag --points-at HEAD) \
229
+ --target runner \
230
+ -t thomasschweich/pawn:latest-runner .
231
+
232
+ # Build interactive target (SSH + Jupyter, stays alive)
233
+ docker build --platform linux/amd64 \
234
+ --build-arg GIT_HASH=$(git rev-parse HEAD) \
235
+ --target interactive \
236
+ -t thomasschweich/pawn:latest .
237
+
238
+ docker push thomasschweich/pawn:latest-runner
239
+ ```
240
+
241
+ Code lives at `/opt/pawn` on pods (outside the `/workspace` volume mount).
242
+
243
+ ### Pod Lifecycle
244
+
245
+ Use `deploy/pod.sh` for all pod management. Requires `runpodctl` (`wget -qO- cli.runpod.net | sudo bash`).
246
+
247
+ ```bash
248
+ # Create a pod
249
+ bash deploy/pod.sh create myexp --gpu h100
250
 
251
+ # SSH into it
252
+ bash deploy/pod.sh ssh myexp
253
 
254
+ # Launch training
255
+ bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}
256
 
257
+ # Stop (preserves volume, stops billing)
258
+ bash deploy/pod.sh stop myexp
259
 
260
+ # Delete (destroys everything)
261
+ bash deploy/pod.sh delete myexp
262
+ ```
263
+
264
+ GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`. Pod configs are cached in `~/.config/pawn/pods/<name>.env`.
265
+
266
+ ### GPU Selection
267
+
268
+ Benchmarks from pretraining 3 models concurrently (`train_all.py`, batch=256):
269
+
270
+ | GPU | VRAM | $/hr | Step time | 100K cost | Notes |
271
+ |-----|------|------|-----------|-----------|-------|
272
+ | B200 | 192GB | $4.99 | 0.28s | ~$39 | Fastest |
273
+ | H200 SXM | 80GB | $3.59 | 0.34s | ~$34 | Best wall-clock/cost balance |
274
+ | RTX PRO 6000 | 48GB | $1.89 | 0.62s | ~$33 | Cheapest viable |
275
+ | A100 PCIe | 80GB | $1.39 | 0.79s | ~$30 | Cheapest overall |
276
+ | L40S | 48GB | $0.86 | 1.37s | ~$33 | Slow but cheap |
277
+ | RTX 5090/4090/3090 | 24-32GB | β€” | OOM | β€” | Insufficient VRAM for 3 models |
278
 
279
+ Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.
280
+
281
+ ### Required Pod Configuration
282
 
283
  - **Always attach a network volume.** Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
284
+ - **Set `HF_TOKEN` as a pod environment variable** for automatic HuggingFace authentication. The entrypoint persists it to `~/.cache/huggingface/token`.
285
+ - `PAWN_MODEL=thomas-schweich/pawn-base` β€” auto-pull a checkpoint on startup (runner target).
286
+ - `PAWN_CMD` β€” training command to execute (alternative to Docker CMD args).
287
+
288
+ ### Pod Safety
289
+
290
+ - Stop pods with `runpodctl pod stop` or `bash deploy/pod.sh stop` β€” sends SIGTERM, trainer saves and pushes before exiting.
291
+ - **Never `runpodctl pod delete` while training is running** β€” data loss risk.
292
+ - **Never `kill -9` training processes** β€” use SIGTERM (plain `kill`), which triggers graceful shutdown.
293
+ - **Never rsync checkpoint files from running pods** β€” pull via HF submodule instead.
294
+
295
+ ## Monitoring Training Progress
296
+
297
+ ### Key Principle: Write Scripts to Disk for Pre-Approval
298
+
299
+ When setting up recurring monitoring, **always write the monitoring script to a file first** so the user can review and pre-approve it. This avoids repeated permission prompts when `/loop` fires.
300
+
301
+ **Pattern:**
302
+ 1. Write a bash script to disk (e.g., `scripts/check_my_run.sh`)
303
+ 2. User reviews and approves the script
304
+ 3. Schedule with `/loop 15m bash scripts/check_my_run.sh`
305
+
306
+ **Example monitoring script:**
307
+
308
+ ```bash
309
+ #!/usr/bin/env bash
310
+ # scripts/check_my_run.sh β€” monitor a specific training run
311
+ set -euo pipefail
312
+ bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>
313
+ ```
314
+
315
+ Or for local-only monitoring:
316
+
317
+ ```bash
318
+ #!/usr/bin/env bash
319
+ set -euo pipefail
320
+ bash /home/tas/pawn/scripts/check_progress.sh --sync
321
+ ```
322
+
323
+ ### Available Monitoring Tools
324
+
325
+ | Tool | What it does |
326
+ |------|-------------|
327
+ | `scripts/monitor_training.sh [POD_ID]` | SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches |
328
+ | `scripts/check_progress.sh [--sync]` | Show progress from local `logs/` and HF submodules. `--sync` pulls submodules first. |
329
+ | `deploy/sync.sh [name]` | Pull latest checkpoints/metrics from HuggingFace submodules |
330
+ | `python -m pawn.dashboard --log-dir logs` | Solara web dashboard with interactive charts |
331
+
332
+ ### Dashboard
333
+
334
+ ```bash
335
+ python -m pawn.dashboard --log-dir logs
336
+ ```
337
+
338
+ Reads `metrics.jsonl` files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).
339
+
340
+ ## Logs
341
+
342
+ Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl` and a random slug (e.g., `run_20260325_140000_zesty-osprey/`).
343
+
344
+ `MetricsLogger` (`pawn/logging.py`) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.
345
+
346
+ ## Hyperparameter Sweeps
347
+
348
+ Optuna integration via `pawn/sweep.py` and `scripts/sweep.py`:
349
+
350
+ ```bash
351
+ uv run python scripts/sweep.py \
352
+ --adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
353
+ --total-steps 20000 --pruner hyperband \
354
+ --checkpoint checkpoints/pawn-base --pgn data/lichess_1800_1900.pgn \
355
+ --local-checkpoints
356
+ ```
357
+
358
+ Supports all adapter types + architecture search. GPU affinity assigns `CUDA_VISIBLE_DEVICES = trial.number % n_gpus`. SQLite-backed study persistence. Pruner options: `hyperband`, `median`, `none`.
359
 
360
+ ## Key Patterns & Gotchas
361
 
362
+ - **DataLoader workers must use `multiprocessing_context='spawn'`** β€” the Rust engine uses rayon, and fork after rayon init causes deadlocks.
363
+ - **`SDPA_BACKEND` must be set before `torch.compile()`** β€” compiled code captures the backend at trace time. `apply_gpu_config()` handles this.
364
+ - **ROCm flash attention bug**: with `torch.compile`, flash attention backward has stride issues. Use `--sdpa-math` to force the MATH SDPA backend.
365
+ - **Sparse logit projection**: `forward_hidden()` returns `(B,T,d_model)`, then only loss-masked positions project through `lm_head` β€” avoids full `(B,T,V)` materialization.
366
+ - **Legal mask via Rust**: `LegalMaskBuilder` replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense).
367
+ - **GPU auto-detection**: `pawn.gpu.configure_gpu()` selects compile/AMP/SDPA settings. `apply_gpu_config()` applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile.
368
+ - **Factored embeddings**: each move token decomposes into `src_embed[s] + dst_embed[d] + promo_embed[p]`, reducing embedding parameters by ~32x.