Buckets:

thomas-schweich
/

pretraining-ablations

Files

xet

thomas-schweich/pretraining-ablations / CLAUDE.md

thomas-schweich

2 months ago

preview code

download

raw

28.6 kB

	# PAWN (Playstyle-Agnostic World-model Network for Chess)

	A causal transformer trained on random chess games, designed as a testbed for finetuning and augmentation methods at small scales. Apache 2.0.

	## Repository Structure

	```
	pawn/
	├── engine/ # Rust chess engine with PyO3 bindings (via shakmaty)
	├── pawn/ # Core Python package
	│ ├── config.py # CLMConfig (small/base/large), TrainingConfig
	│ ├── model.py # PAWNCLM transformer (RMSNorm, SwiGLU, RoPE, factored embeddings)
	│ ├── data.py # On-the-fly random game data pipeline
	│ ├── lichess_data.py # Lichess PGN data pipeline + legal mask computation
	│ ├── trainer.py # Pretraining loop
	│ ├── gpu.py # GPU auto-detection (compile/AMP/SDPA backend)
	│ ├── logging.py # MetricsLogger (JSONL output)
	│ ├── checkpoint.py # Atomic save/load, .complete sentinel, HF push
	│ ├── adapters/ # Bottleneck, LoRA, FiLM, sparse, hybrid
	│ ├── eval_suite/ # Probes, generation tests, diagnostics, lichess eval
	│ └── dashboard/ # Solara training dashboard (metrics, charts, runner)
	├── scripts/ # Training and evaluation entry points
	├── tests/ # Unit tests
	├── deploy/ # Runpod deployment scripts
	└── docs/ # Architecture, training, adapter docs
	```

	## Building

	This is a uv workspace. The root project is the `pawn` Python package; `engine/` is the sole workspace member.

	```bash
	# Build the Rust chess engine (required before anything else)
	cd engine && uv run --with maturin maturin develop --release && cd ..

	# Install Python deps (dev tools like pytest, seaborn, solara are in base dependencies):
	uv sync --extra rocm # AMD (ROCm 7.1)
	uv sync --extra cu128 # NVIDIA (CUDA 12.8)

	# Run tests
	uv run pytest tests/

	# Pretrain from scratch (local dev)
	uv run python scripts/train.py --variant base --local-checkpoints
	```

	The only extras are GPU backends (`rocm` or `cu128`). Everything else (pytest, solara, optuna, seaborn, etc.) is in base dependencies. PyTorch lives in the extras because uv can't resolve CPU/CUDA/ROCm from a single lockfile — always specify `--extra rocm` or `--extra cu128`.

	GPU requirement: `configure_gpu()` (called by every training and eval script) raises `RuntimeError` if no CUDA/ROCm GPU is detected. This prevents accidentally running GPU workloads on CPU, which is almost always a mistake. The environment variable `PAWN_ALLOW_CPU=1` overrides this check as a last resort for the rare case where CPU execution is genuinely intended (e.g. a lightweight backfill script). Unit tests do not call `configure_gpu()` and run fine on CPU without the override.

	## Engine (`engine/`)

	Single source of truth for all chess logic. All game simulation, move generation, legality checks, tokenization, PGN parsing, and board state extraction happen in Rust. No Python chess libraries.

	- Uses rayon for parallel game generation (~43K games/sec, 150M+/hr)
	- PyO3 bindings expose `chess_engine` module to Python
	- Key functions: `generate_random_games()`, `parse_pgn_file()`, `compute_legal_token_masks_sparse()`, `extract_board_states()`, `export_move_vocabulary()`, `compute_accuracy_ceiling()`

	## Model

	### Architecture
	- Decoder-only transformer, next-token prediction over 4,278 tokens
	- Token vocabulary: 1 PAD + 4,096 grid (64x64 src/dst) + 176 promotions + 5 outcomes
	- Factored embeddings: `src_embed[s] + dst_embed[d] + promo_embed[p]`
	- Sequence format: `[outcome] [ply_1] ... [ply_N] [PAD] ... [PAD]` (256 tokens)

	### Variants
	- `CLMConfig.small()`: d=256, 8 layers, 4 heads, ~9.5M params
	- `CLMConfig.base()`: d=512, 8 layers, 8 heads, ~35.8M params (default)
	- `CLMConfig.large()`: d=640, 10 layers, 8 heads, ~68.4M params
	- `CLMConfig.toy()`: d=64, 2 layers, for tests only

	## Training

	All training scripts require one of `--hf-repo REPO_ID` or `--local-checkpoints` (mutually exclusive). Use `--local-checkpoints` for local dev; use `--hf-repo` for any run where you need durable checkpoints.

	### Pretraining

	```bash
	# Single model
	uv run python scripts/train.py --variant base --local-checkpoints

	# All three variants simultaneously (shared data batches, sequential GPU)
	uv run python scripts/train_all.py --local-checkpoints

	# Resume from checkpoint
	uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 --local-checkpoints
	```

	`scripts/train.py` key args:
	- `--variant {small\|base\|large\|toy}` — model size (default: base)
	- `--resume PATH` — resume from checkpoint directory
	- `--total-steps N` — training steps (default: 100,000)
	- `--batch-size N` — batch size (default: 256)
	- `--discard-ply-limit` — only train on naturally-ended games (no ply-limit truncation)
	- Architecture overrides: `--d-model`, `--n-layers`, `--n-heads`, `--d-ff`, `--lr`, `--weight-decay`, `--warmup-steps`

	`scripts/train_all.py` additional args:
	- `--shm-checkpoints` — write checkpoints to `/dev/shm` (requires `--hf-repo`, volatile)
	- `--run-evals` — auto-run probes + diagnostics after training completes
	- `--publish-results` — push eval results to HF
	- `--patience N` — per-model early stopping patience (eval intervals without improvement)

	### Adapter Training

	All adapter scripts require `--checkpoint PATH` (pretrained weights) and `--pgn PATH` (Lichess PGN file). They freeze the backbone and train only adapter parameters.

	```bash
	# Example: train a LoRA adapter on Lichess 1800-1900 games
	uv run python scripts/train_lora.py \
	--checkpoint thomas-schweich/pawn-base \
	--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
	--lora-rank 4 --lr 3e-4 --local-checkpoints
	```

	\| Script \| Adapter \| Key args \| Typical params \|
	\|--------\|---------\|----------\|----------------\|
	\| `train_bottleneck.py` \| Houlsby MLP \| `--bottleneck-dim 8` \| ~131K \|
	\| `train_lora.py` \| Low-rank attention \| `--lora-rank 4 --lora-targets qkvo` \| ~65K \|
	\| `train_film.py` \| Channel-wise affine \| `--no-output-film` \| ~17K \|
	\| `train_sparse.py` \| Binary mask \| `--density 0.01 --sparse-targets qkvo` \| ~503K-2.7M \|
	\| `train_hybrid.py` \| LoRA + FiLM \| `--lora-rank 4 --film-lr 1e-3` \| ~65K \|
	\| `train_tiny.py` \| None (from scratch) \| `--d-model 84 --n-layers 2` \| ~524K \|

	Common adapter args: `--epochs 50`, `--batch-size 64`, `--lr 3e-4`, `--patience 10`, `--val-every 1`, `--max-games 12000`, `--min-ply 10`

	### Common CLI Patterns

	- `--sdpa-math` — force MATH SDPA backend (required for ROCm + torch.compile)
	- `--no-compile` — disable torch.compile
	- `--no-amp` — disable mixed precision
	- `--num-workers N` — DataLoader workers (default: 8 for adapters, 4 for pretraining)
	- `--device {cuda\|cpu}` — device selection
	- `--wandb` — enable Weights & Biases logging

	## Evaluation & Metrics

	### Linear Probes

	```bash
	uv run python scripts/eval_probes.py --log-dir logs --device cuda
	```

	Trains linear probes on frozen hidden states to measure internal representations (piece type, check status, castling rights, material count, game phase, etc.). Args: `--n-games 4096`, `--n-val-games 1024`, `--n-epochs 20`, `--run RUN_NAME` (specific run).

	### Move Prediction Accuracy

	```bash
	uv run python scripts/eval_accuracy.py \
	--checkpoint thomas-schweich/pawn-base \
	--pgn thomas-schweich/pawn-lichess-full --elo-min 1800 --elo-max 1900 \
	--adapter-checkpoint logs/run_*/checkpoints/best
	```

	MAIA-compatible evaluation with per-phase and per-ply accuracy. Args: `--min-eval-ply 10`, `--max-games 50000`, `--per-ply`.

	### Theoretical Accuracy Ceilings

	```bash
	uv run python scripts/compute_theoretical_ceiling.py
	```

	Computes theoretical accuracy ceilings for random games via Monte Carlo rollouts: unconditional (E[1/N_legal]), naive-conditioned (1-ply filter), and MC-conditioned (Bayes-optimal with outcome knowledge). Reports a bias bracket (naive vs split-half corrected estimates) and bootstrap 95% CIs clustered by game. CPU-intensive.

	### Export to HuggingFace

	```bash
	uv run python scripts/export_hf_repo.py --run-dir logs/run_YYYYMMDD_HHMMSS
	```

	Converts a training run to HuggingFace repo format (safetensors + metrics). Finds best checkpoint by val loss.

	## Checkpoints

	Pre-trained weights are hosted on HuggingFace and loaded directly by repo ID:
	- `thomas-schweich/pawn-small` — 9.5M params, `CLMConfig.small()`
	- `thomas-schweich/pawn-base` — 35.8M params, `CLMConfig.base()`
	- `thomas-schweich/pawn-large` — 68.4M params, `CLMConfig.large()`

	All scripts accept HF repo IDs for `--checkpoint` (e.g. `--checkpoint thomas-schweich/pawn-base`). Weights are downloaded and cached automatically via `huggingface_hub`.

	### Checkpoint Format (safetensors)

	Checkpoints are directories, not single files:
	```
	step_00065000/
	├── model.safetensors # model weights
	├── optimizer.safetensors # flattened optimizer state
	├── training_state.json # step, scheduler, scaler, RNG (base64)
	├── config.json # model + training config
	└── .complete # SHA-256 hashes of all files (integrity sentinel)
	```

	Central module: `pawn/checkpoint.py`. All save/load goes through this module.
	Legacy `.pt` files are still loadable (backward compatible).

	### Checkpoint Storage Modes

	All training scripts require one of:
	- `--hf-repo REPO_ID` — push checkpoints to a HuggingFace branch as they're written (durable)
	- `--local-checkpoints` — save locally only (for development without an HF account)

	HF mode creates a `run/{run_id}` branch. HF pushes happen in background threads (one per model slot) so training is not blocked by uploads. Squash-merge into main when satisfied.

	Optional: `--shm-checkpoints` writes checkpoints to `/dev/shm` (RAM-backed filesystem, instant writes). Requires `--hf-repo` since `/dev/shm` is volatile. Old checkpoints are cleaned up after successful HF push, keeping only the latest and the best (by val loss) for post-training evals.

	### Data Integrity

	Every checkpoint write is atomic: files are written to a `.tmp` directory, then renamed.
	The `.complete` sentinel contains SHA-256 hashes of every file in the checkpoint.
	Hashes are always verified on load — no exceptions.

	- `IncompleteCheckpointError` — raised when `.complete` sentinel is missing
	- `CheckpointIntegrityError` — raised when any hash mismatches

	Never use `kill -9` on training processes. SIGTERM is handled gracefully: a flag is set,
	the training loop checks it between steps, saves a checkpoint, pushes to HF, and exits cleanly.

	Never rsync checkpoint files from running pods. Checkpoints are pushed to HuggingFace
	from the trainer. Load via HF repo ID (e.g. `--checkpoint thomas-schweich/pawn-base`).

	## RunPod Operations

	### Docker Image

	A single Docker image (`thomasschweich/pawn:latest`) is automatically built and pushed to Docker Hub by CI on every merge to main. No manual builds needed.

	The image is based on `runpod/pytorch` (CUDA + SSH + Jupyter) with all Python deps pre-installed. Code lives at `/opt/pawn` on pods. SSH in and run experiments directly.

	To build locally (rarely needed):
	```bash
	docker build --platform linux/amd64 \
	--build-arg GIT_HASH=$(git rev-parse HEAD) \
	-t thomasschweich/pawn:latest .
	```

	### Pod Lifecycle

	Use `deploy/pod.sh` for all pod management. Requires `runpodctl` (`wget -qO- cli.runpod.net \| sudo bash`).

	```bash
	# Create a pod
	bash deploy/pod.sh create myexp --gpu h100

	# SSH into it
	bash deploy/pod.sh ssh myexp

	# Launch training
	bash deploy/pod.sh launch myexp scripts/train_all.py --hf-repo thomas-schweich/pawn-{variant}

	# Stop (preserves volume, stops billing)
	bash deploy/pod.sh stop myexp

	# Delete (destroys everything)
	bash deploy/pod.sh delete myexp
	```

	GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`. Pod configs are cached in `~/.config/pawn/pods/<name>.env`.

	### GPU Selection

	Benchmarks from pretraining 3 models concurrently (`train_all.py`, batch=256):

	\| GPU \| VRAM \| $/hr \| Step time \| 100K cost \| Notes \|
	\|-----\|------\|------\|-----------\|-----------\|-------\|
	\| B200 \| 192GB \| $4.99 \| 0.28s \| ~$39 \| Fastest \|
	\| H200 SXM \| 80GB \| $3.59 \| 0.34s \| ~$34 \| Best wall-clock/cost balance \|
	\| RTX PRO 6000 \| 48GB \| $1.89 \| 0.62s \| ~$33 \| Cheapest viable \|
	\| A100 PCIe \| 80GB \| $1.39 \| 0.79s \| ~$30 \| Cheapest overall \|
	\| L40S \| 48GB \| $0.86 \| 1.37s \| ~$33 \| Slow but cheap \|
	\| RTX 5090/4090/3090 \| 24-32GB \| — \| OOM \| — \| Insufficient VRAM for 3 models \|

	Total cost is remarkably consistent ($30-39) across viable GPUs. The choice is wall-clock time vs cost, not cost vs cost. Single-model training fits on 24GB GPUs.

	### Required Pod Configuration

	- Always attach a network volume. Checkpoints write to disk during atomic rename and HF push. Ephemeral container disk is lost on pod termination.
	- Set `HF_TOKEN` as a pod environment variable for automatic HuggingFace authentication. The entrypoint persists it to `~/.cache/huggingface/token`.
	- `PAWN_MODEL=thomas-schweich/pawn-base` — auto-pull a checkpoint on startup (runner target).
	- `PAWN_CMD` — training command to execute (alternative to Docker CMD args).

	### Pod Safety

	- Stop pods with `runpodctl pod stop` or `bash deploy/pod.sh stop` — sends SIGTERM, trainer saves and pushes before exiting.
	- Never `runpodctl pod delete` while training is running — data loss risk.
	- Never `kill -9` training processes — use SIGTERM (plain `kill`), which triggers graceful shutdown.
	- Never rsync checkpoint files from running pods — load via HF repo ID instead.

	### HuggingFace Bucket Backups

	Use HF buckets (`hf://buckets/...`) to back up experiment data from pods. Buckets are not datasets or repos — they use the `sync_bucket` API, not `upload_file` with `repo_type`.

	Key constraint: upload bandwidth from pods is ~1.8 MB/s. Checkpoints are ~430MB each (143MB model + 287MB optimizer). A full training run with 5K-step checkpoint intervals produces ~17GB per trial. Sync selectively.

	During training — sync metrics only (instant):
	```python
	from huggingface_hub import HfApi
	api = HfApi(token=HF_TOKEN)
	api.sync_bucket(
	source="/workspace/logs",
	dest="hf://buckets/OWNER/BUCKET/logs",
	exclude=["/checkpoints/"],
	)
	```

	For individual files (lab notes, transcripts) — stage in a temp dir:
	```python
	import tempfile, shutil, os
	with tempfile.TemporaryDirectory() as td:
	shutil.copy("/workspace/runs/lab-notes.md", os.path.join(td, "lab-notes.md"))
	api.sync_bucket(source=td, dest="hf://buckets/OWNER/BUCKET")
	```

	`upload_file(..., repo_type="bucket")` does not work — buckets are not a valid repo type for that API. Always use `sync_bucket`.

	After training — sync only best/final checkpoints:
	```python
	api.sync_bucket(
	source="/workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST",
	dest="hf://buckets/OWNER/BUCKET/checkpoints/trial_XXXX/best",
	)
	```

	HF_TOKEN: Stored in `/opt/pawn/.env` on pods. Source it or `export` before calling the API.

	## Monitoring Training Progress

	### Key Principle: Write Scripts to Disk for Pre-Approval

	When setting up recurring monitoring, always write the monitoring script to a file first so the user can review and pre-approve it. This avoids repeated permission prompts when `/loop` fires.

	Pattern:
	1. Write a bash script to disk (e.g., `scripts/check_my_run.sh`)
	2. User reviews and approves the script
	3. Schedule with `/loop 15m bash scripts/check_my_run.sh`

	Example monitoring script:

	```bash
	#!/usr/bin/env bash
	# scripts/check_my_run.sh — monitor a specific training run
	set -euo pipefail
	bash /home/tas/pawn/scripts/monitor_training.sh <POD_ID>
	```

	Or for local-only monitoring:

	```bash
	#!/usr/bin/env bash
	set -euo pipefail
	bash /home/tas/pawn/scripts/check_progress.sh --sync
	```

	### Available Monitoring Tools

	\| Tool \| What it does \|
	\|------\|-------------\|
	\| `scripts/monitor_training.sh [POD_ID]` \| SSH to pod, sync metrics via rsync, show per-variant step/loss/acc/ETA, check HF checkpoint branches \|
	\| `scripts/check_progress.sh [LOG_DIR]` \| Show progress from local `logs/` directory \|
	\| `python -m pawn.dashboard --log-dir logs` \| Solara web dashboard with interactive charts \|

	### Dashboard

	```bash
	python -m pawn.dashboard --log-dir logs
	```

	Reads `metrics.jsonl` files, no dependency on training packages. Auto-detects run type from config fields. Shows loss curves, accuracy, LR schedules, GPU utilization, patience clocks, and adapter-specific diagnostics. Requires restart for code changes (no hot reload).

	## Logs

	Training metrics in `logs/` (gitignored). Each run gets a timestamped directory with `metrics.jsonl` and a random slug (e.g., `run_20260325_140000_zesty-osprey/`).

	`MetricsLogger` (`pawn/logging.py`) writes one JSON object per line. Every record includes timestamp, step, elapsed time, and memory stats. Config records include hostname, git hash, git tag, and run slug.

	## Hyperparameter Sweeps

	Optuna integration via `pawn/sweep.py` and `scripts/sweep.py`:

	```bash
	uv run python scripts/sweep.py \
	--adapter lora --n-trials 30 --n-jobs 2 --n-gpus 2 \
	--total-steps 20000 --pruner hyperband \
	--checkpoint thomas-schweich/pawn-base --pgn thomas-schweich/pawn-lichess-full \
	--local-checkpoints
	```

	Supports all adapter types + architecture search. GPU affinity assigns `CUDA_VISIBLE_DEVICES = trial.number % n_gpus`. SQLite-backed study persistence. Pruner options: `hyperband`, `median`, `none`.

	## Key Patterns & Gotchas

	- DataLoader workers must use `multiprocessing_context='spawn'` — the Rust engine uses rayon, and fork after rayon init causes deadlocks.
	- `SDPA_BACKEND` must be set before `torch.compile()` — compiled code captures the backend at trace time. `apply_gpu_config()` handles this.
	- ROCm works: The only known ROCm issue is a stride mismatch in flash attention backward when combined with `torch.compile` + AMP. The workaround is `--sdpa-math` (use the MATH SDPA backend instead of flash), which `configure_gpu()` applies automatically on AMD GPUs. Everything else — training, eval, adapters, data loading — works identically on ROCm and CUDA. Do not assume bugs are ROCm-specific. Every other time something has failed on AMD it turned out to be a bug in our code (wrong torch version installed, stale lockfile, missing dependency, etc.), not a ROCm issue.
	- Sparse logit projection: `forward_hidden()` returns `(B,T,d_model)`, then only loss-masked positions project through `lm_head` — avoids full `(B,T,V)` materialization.
	- Legal mask via Rust: `LegalMaskBuilder` replays games in Rust, returns sparse indices (~2 MB) scattered into a pre-allocated GPU buffer (vs ~70 MB dense).
	- GPU auto-detection: `pawn.gpu.configure_gpu()` selects compile/AMP/SDPA settings. `apply_gpu_config()` applies them. NVIDIA uses flash attention + compile; AMD uses MATH SDPA + compile. Both paths are tested and production-validated.
	- Factored embeddings: each move token decomposes into `src_embed[s] + dst_embed[d] + promo_embed[p]`, reducing embedding parameters by ~32x.


	---

	# Current task

	# Pretraining Ablations: mate-boost, no-outcome, discard-ply-limit

	You are running three pretraining ablation experiments on PAWN-Base. Each ablation modifies one aspect of the random game generation to understand its effect on the learned world model.

	You are on a RunPod with 3x RTX 6000 Ada (48GB VRAM each, $0.77/hr each = $2.31/hr total). Use the pawn-lab MCP server and the manage-pod skill workflow.

	As your first action, set up the check-in crons per the manage-pod skill, then proceed.

	## The Three Ablations

	\| Ablation \| Config \| Hypothesis \|
	\|----------\|--------\|------------\|
	\| mate-boost \| `mate_boost: 1.0` \| Always taking mate-in-1 produces shorter, more decisive games (~134 avg ply vs ~238). More checkmate patterns, fewer aimless endgames. \|
	\| no-outcome \| `no_outcome_token: true` \| Stripping the outcome token forces the model to infer game result from moves alone. Tests whether outcome conditioning helps or hurts. \|
	\| discard-ply-limit \| `discard_ply_limit: true` \| Only naturally-ended games (no truncation at 255 plies). All games have meaningful endings. \|

	There is no separate baseline run — the existing PAWN-Base checkpoint (`thomas-schweich/pawn-base`) trained at 100K steps is the baseline reference.

	## Unified Config

	All training uses the unified `scripts/train.py` with Pydantic `RunConfig`. Call `lab_schema` to see all available fields. Configs are JSON dicts passed to `lab_launch`.

	Base config shared by all ablations:
	```json
	{
	"run_type": "pretrain",
	"variant": "base",
	"total_steps": 200000,
	"batch_size": 256,
	"lr": 3e-4,
	"local_checkpoints": true,
	"amp_dtype": "bfloat16",
	"eval_interval": 5000,
	"num_workers": 4
	}
	```

	Each ablation adds its flag on top. Example for mate-boost:
	```json
	{
	"run_type": "pretrain",
	"variant": "base",
	"mate_boost": 1.0,
	"total_steps": 200000,
	"batch_size": 256,
	"lr": 3e-4,
	"local_checkpoints": true,
	"amp_dtype": "bfloat16",
	"eval_interval": 5000,
	"num_workers": 4
	}
	```

	## Procedure

	### Phase 1: Learning Rate Exploration (~1 hour)

	For each of the 3 ablations, launch 3 trials with different learning rates using `pause_after_steps`:

	```
	lab_launch(config={
	"run_type": "pretrain",
	"variant": "base",
	"mate_boost": 1.0, # ← ablation flag
	"total_steps": 200000,
	"pause_after_steps": 5000, # ← pause for comparison
	"batch_size": 256,
	"lr": 1e-4, # ← vary this: 1e-4, 3e-4, 1e-3
	"local_checkpoints": true,
	"amp_dtype": "bfloat16",
	"no_compile": true, # ← skip compile for short exploration
	"eval_interval": 2500,
	"num_workers": 4
	})
	```

	That's 9 trials total (3 LRs × 3 ablations). With 3 GPUs, run 3 trials concurrently (one per GPU). Run each ablation's 3 LRs sequentially on its GPU, or interleave — the lab runner assigns GPUs automatically. Each pauses at 5K steps. Use `no_compile: true` to skip the 15-30 min compile overhead for these short runs.

	After all 9 pause, call `lab_results` to compare val_loss at 5K steps. Pick the best LR for each ablation.

	### Phase 2: Full Training (~10-14 hours)

	Resume the best LR for each ablation to 200K steps with torch.compile enabled:

	```
	lab_resume(trial_id=BEST_MATE_BOOST) # no pause → runs to total_steps
	lab_resume(trial_id=BEST_NO_OUTCOME)
	lab_resume(trial_id=BEST_DISCARD_PLY)
	```

	`lab_resume` clears `pause_after_steps` by default, so the resumed trials run to completion at `total_steps=200000`. The resumed trials pick up from the 5K-step checkpoint with optimizer state intact.

	Important: Do NOT pass `no_compile: true` for the resumed runs. Let torch.compile run — the 15-30 min compile overhead amortizes over 195K remaining steps.

	With 1 model per GPU (no sharing), expect ~0.4s/step → ~22 hours for 195K steps. All 3 run in parallel. Total cost: ~$51.

	### Phase 3: Evaluation

	After all 3 runs converge, compare:
	- Final val_loss and accuracy vs baseline (PAWN-Base at 100K steps)
	- Learning curves (loss vs step for all 3, plotted together)
	- Per-ply accuracy via `scripts/eval_accuracy.py` if time permits

	## Monitoring

	Use the manage-pod skill check-in pattern:
	- Mini check-in (every 5 min): `lab_events` — launch next trial if GPU idle
	- Full check-in (hourly): `lab_status` + `lab_log` for each running trial + update lab notes

	During Phase 1 (9 short trials), use 5-min check-ins — trials finish fast.
	During Phase 2 (3 long runs), switch to hourly after confirming stability.

	### What to watch for

	- NaN loss: Kill immediately. Try lr/3. Use `lab_log` to check.
	- discard-ply-limit throughput: This ablation discards ~60% of generated games. If step times are much slower than the other two, reduce `num_workers` to free CPU for data generation.
	- mate-boost game lengths: Check `lab_log` — mate-boost games are ~134 plies vs ~238 normal. Each batch processes more games, which may cause higher memory usage. If OOM, reduce batch_size to 128.
	- GPU assignment: The lab runner assigns one trial per GPU via CUDA_VISIBLE_DEVICES. If a trial fails and a GPU sits idle, launch a replacement promptly.

	## Cost Budget

	\| Phase \| Trials \| Time \| Cost \|
	\|-------\|--------\|------\|------\|
	\| LR exploration \| 9 × 5K steps \| ~1h \| ~$3 \|
	\| Full training \| 3 × 195K steps (parallel) \| ~22h \| ~$51 \|
	\| Total \| \| ~23h \| ~$53 \|

	Set cost tracking: `lab_set_cost(cost_per_hour=2.31)`

	## Lab Notes

	Write experiment state to `runs/lab-notes.md` (survives context compaction via PostCompact hook). Include:
	- Phase 1 results table (LR × ablation → val_loss at 5K steps)
	- Which LR was selected for each ablation and why
	- Phase 2 progress (step, loss, ETA per ablation)
	- Any anomalies or adjustments made

	## Current State (as of 2026-04-03 19:30 UTC)

	### Phase 2: Full Training (RUNNING)

	Three ablation runs training to 200K steps, torch.compile ON, resumed from 5K:

	\| Trial \| GPU \| Ablation \| LR \| Step (~19:00) \| train_loss \| acc \|
	\|-------\|-----\|----------\|-----\|---------------\|-----------\|------\|
	\| 24 \| 0 \| mate-boost \| 1e-3 \| ~120K \| 3.19 \| 6.8% \|
	\| 25 \| 1 \| no-outcome \| 3e-4 (no warmup) \| ~116K \| 3.09 \| 7.0% \|
	\| 26 \| 2 \| discard-ply-limit \| 1e-3 \| ~119K \| 3.18 \| 7.5% \|

	- ETA: ~2026-04-04 04:30 UTC (~9h remaining)
	- Hourly cron monitors via `lab_events` + `lab_log`
	- Baseline ref: PAWN-Base val_loss ~3.06, acc ~6.9% at 100K steps

	### Concurrent: Accuracy Ceiling Computation (RUNNING)

	Running `scripts/compute_theoretical_ceiling.py` on spare CPUs (RAYON_NUM_THREADS=30) while GPUs train:

	```bash
	RAYON_NUM_THREADS=30 PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python \
	scripts/compute_theoretical_ceiling.py \
	--n-games 5000 --rollouts 1024 --sample-rate 0.075 \
	--model-accuracy 0.069 --output /workspace/data/theoretical_ceiling_1024.json
	```

	- 1024 rollouts/move (8x previous), expected to narrow bias bracket from 0.66pp to ~0.23pp
	- Processes in 10 batches of 500 games, printing intermediate estimates
	- ETA: ~3-4 hours (finishes well before training)
	- Output: `/workspace/data/theoretical_ceiling_1024.json`

	### Bug Fixes Applied This Session

	- `pause_after_steps`: Was missing from `trainer.py` (only in adapter training). Added to `TrainingConfig` and `trainer.py` train loop. Wired through `scripts/train.py`.
	- `lab_resume` checkpoint discovery: Was looking for `checkpoints/best` or `checkpoints/final` only. Pretraining uses `step_XXXXXXXX` naming. Added fallback to pick highest `step_*` dir.

	### HF Bucket Backup

	All experiment data syncs to `hf://buckets/thomas-schweich/pretraining-ablations`:
	- Metrics/configs synced every ~4 hours (excluding checkpoints)
	- 100K-step checkpoints for all 3 ablations uploaded
	- Lab notes + chat transcript synced periodically
	- HF_TOKEN in `/opt/pawn/.env`
	- See "HuggingFace Bucket Backups" section above for API patterns

	### Lab Notes

	Detailed log at `/workspace/runs/lab-notes.md` (symlinked from `/opt/pawn/runs/lab-notes.md`). Survives context compaction via PostCompact hook. Read this first when resuming.

	## Post-Training Procedure

	When each training run completes (in any order):

	1. Upload best checkpoint to HF bucket in the background (see bucket sync patterns above)
	2. Run linear probes on the best checkpoint in the background:
	```bash
	PAWN_ALLOW_CPU=1 /opt/pawn/.venv/bin/python scripts/eval_probes.py \
	--checkpoint /workspace/logs/trial_XXXX/run_.../checkpoints/step_BEST \
	--log-dir /workspace/eval_probes/trial_XXXX --device cuda:N
	```
	Use the GPU that just freed up. Run in background so other completions aren't blocked.
	3. After the last probe run is kicked off, draft the final report comparing all 3 ablations + baseline:
	- val_loss and accuracy curves
	- Probe results (piece type, check, castling, material, game phase)
	- Accuracy ceiling comparison
	- Key findings and surprises
	4. Update the report with probe results as they finish
	5. Upload everything to HF bucket: final checkpoints, probe results, report, lab notes

	## Success Criteria

	1. All three ablations train to 200K steps without divergence
	2. Clear comparison of val_loss and accuracy across ablations
	3. At least one ablation shows meaningful difference from baseline (better or worse — both are informative)

	--

Xet Storage Details

Size:: 28.6 kB
Xet hash:: 2435990282dfc0b7ef32d0cc9c03315053277fa64c6d11c8ea376e386bd8e1b2

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.