PAWN / docs /TRAINING.md

Fix bugs, performance issues, and doc errors from code review

ae46efa 1 day ago

6.82 kB

	# Training Guide

	## Prerequisites

	- Rust (stable) -- required to build the chess engine native extension
	- uv -- Python package manager ([install](https://docs.astral.sh/uv/getting-started/installation/))
	- GPU with ROCm (AMD) or CUDA (NVIDIA). CPU works only for `--variant toy`

	## Installation

	```bash
	# Build the chess engine (one-time, or after engine/ changes)
	cd engine && uv run --with maturin maturin develop --release && cd ..

	# Install Python dependencies
	uv sync --extra rocm # AMD GPUs (ROCm)
	uv sync --extra cu128 # NVIDIA GPUs (CUDA 12.8)
	```

	Verify the install:

	```bash
	uv run python -c "import chess_engine; print('engine OK')"
	uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
	```

	## Pretraining from Scratch

	PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.

	```bash
	uv run python scripts/train.py --variant base
	```

	### Model variants

	\| Variant \| Params \| d_model \| Layers \| Heads \| d_ff \|
	\|---------\|--------\|---------\|--------\|-------\|------\|
	\| `small` \| ~9.5M \| 256 \| 8 \| 4 \| 1024 \|
	\| `base` \| ~36M \| 512 \| 8 \| 8 \| 2048 \|
	\| `large` \| ~68M \| 640 \| 10 \| 8 \| 2560 \|
	\| `toy` \| tiny \| 64 \| 2 \| 4 \| 256 \|

	### Default training configuration

	- Total steps: 100,000
	- Batch size: 256
	- Optimizer: [AdamW](https://arxiv.org/abs/1711.05101) (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
	- LR schedule: [cosine decay](https://arxiv.org/abs/1608.03983) (Loshchilov & Hutter, 2016) with 1,000-step warmup
	- Mixed precision: fp16 [AMP](https://arxiv.org/abs/1710.03740) (Micikevicius et al., 2017) (auto-detected)
	- Checkpoints: saved every 5,000 steps to `checkpoints/`
	- Eval: every 500 steps on 512 held-out random games

	### Common overrides

	```bash
	# Resume from a checkpoint
	uv run python scripts/train.py --variant base --resume checkpoints/step_00050000

	# Custom batch size and step count
	uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000

	# Gradient accumulation (effective batch = batch_size * accumulation_steps)
	uv run python scripts/train.py --variant base --accumulation-steps 4

	# Enable W&B logging
	uv run python scripts/train.py --variant base --wandb
	```

	## Adapter Training (Behavioral Cloning)

	Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.

	### Requirements

	1. A pretrained PAWN checkpoint (from pretraining above)
	2. A Lichess PGN file filtered to an Elo band

	Download standard rated game archives from the [Lichess open database](https://database.lichess.org/) ([Lichess](https://lichess.org/)), filtered to your target Elo band. The scripts expect a single `.pgn` file.

	### Available adapters

	\| Adapter \| Script \| Key flag \|
	\|--------------\|-----------------------------\|----------------------\|
	\| Bottleneck \| `scripts/train_bottleneck.py` \| `--bottleneck-dim 8` \|
	\| FiLM \| `scripts/train_film.py` \| \|
	\| LoRA \| `scripts/train_lora.py` \| \|
	\| Sparse \| `scripts/train_sparse.py` \| \|
	\| Hybrid \| `scripts/train_hybrid.py` \| \|

	There is also `scripts/train_tiny.py` for a standalone small transformer baseline (no frozen backbone).

	### Example: bottleneck adapter

	```bash
	uv run python scripts/train_bottleneck.py \
	--checkpoint checkpoints/pawn-base.pt \
	--pgn data/lichess_1800_1900.pgn \
	--bottleneck-dim 32 \
	--lr 1e-4
	```

	### Adapter training defaults

	- Epochs: 50 (with early stopping, patience=10)
	- Batch size: 64
	- Optimizer: AdamW (lr=3e-4)
	- LR schedule: cosine with 5% warmup
	- Min ply: 10 (games shorter than 10 plies are skipped)
	- Max games: 12,000 train + 2,000 validation
	- Legal masking: move legality enforced via the Rust engine at every position

	### Resuming adapter training

	```bash
	uv run python scripts/train_bottleneck.py \
	--checkpoint checkpoints/pawn-base.pt \
	--pgn data/lichess_1800_1900.pgn \
	--resume logs/bottleneck_20260315_120000/checkpoints/best.pt
	```

	### Selective layer placement

	Adapters can target specific layers or sublayer positions:

	```bash
	# Only FFN adapters on layers 4-7
	uv run python scripts/train_bottleneck.py \
	--checkpoint checkpoints/pawn-base.pt \
	--pgn data/lichess_1800_1900.pgn \
	--no-adapt-attn --adapter-layers 4,5,6,7
	```

	Use `--attn-layers` / `--ffn-layers` for independent control of which layers get attention vs FFN adapters.

	## Cloud Deployment (Runpod)

	The `deploy/` directory provides scripts for managing GPU pods.

	### Pod lifecycle with `pod.sh`

	```bash
	bash deploy/pod.sh create myexp --gpu a5000 # Create a pod
	bash deploy/pod.sh deploy myexp # Build + transfer + setup
	bash deploy/pod.sh launch myexp scripts/train.py --variant base # Run training
	bash deploy/pod.sh ssh myexp # SSH in
	bash deploy/pod.sh stop myexp # Stop (preserves volume)
	```

	GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`.

	### Manual deployment

	If you prefer to deploy manually:

	```bash
	# 1. Build deploy package locally
	bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/

	# 2. Transfer to pod
	rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/

	# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
	ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'
	```

	`setup.sh` handles: Rust installation, uv installation, building the chess engine, `uv sync --extra cu128`, and decompressing any zstd-compressed PGN data.

	## GPU Auto-Detection

	The `pawn.gpu` module auto-detects your GPU and configures:

	- torch.compile: enabled on CUDA, uses inductor backend
	- AMP: fp16 automatic mixed precision on CUDA
	- SDPA backend: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)

	No manual flags are needed in most cases. Override with `--no-compile`, `--no-amp`, or `--sdpa-math` if needed.

	## Monitoring

	All training scripts log metrics to JSONL files in `logs/`. Each run creates a timestamped directory (e.g., `logs/bottleneck_20260315_120000/metrics.jsonl`).

	Every log record includes:

	- Training metrics (loss, accuracy, learning rate)
	- System resource stats (RAM, GPU VRAM peak/current)
	- Timestamps and elapsed time

	The JSONL format is one JSON object per line, readable with standard tools:

	```bash
	# Watch live training progress
	tail -f logs/*/metrics.jsonl \| python -m json.tool
	```