PAWN / docs /TRAINING.md
thomas-schweich's picture
Fix bugs, performance issues, and doc errors from code review
ae46efa
# Training Guide
## Prerequisites
- **Rust** (stable) -- required to build the chess engine native extension
- **uv** -- Python package manager ([install](https://docs.astral.sh/uv/getting-started/installation/))
- **GPU** with ROCm (AMD) or CUDA (NVIDIA). CPU works only for `--variant toy`
## Installation
```bash
# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python dependencies
uv sync --extra rocm # AMD GPUs (ROCm)
uv sync --extra cu128 # NVIDIA GPUs (CUDA 12.8)
```
Verify the install:
```bash
uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
```
## Pretraining from Scratch
PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.
```bash
uv run python scripts/train.py --variant base
```
### Model variants
| Variant | Params | d_model | Layers | Heads | d_ff |
|---------|--------|---------|--------|-------|------|
| `small` | ~9.5M | 256 | 8 | 4 | 1024 |
| `base` | ~36M | 512 | 8 | 8 | 2048 |
| `large` | ~68M | 640 | 10 | 8 | 2560 |
| `toy` | tiny | 64 | 2 | 4 | 256 |
### Default training configuration
- **Total steps**: 100,000
- **Batch size**: 256
- **Optimizer**: [AdamW](https://arxiv.org/abs/1711.05101) (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
- **LR schedule**: [cosine decay](https://arxiv.org/abs/1608.03983) (Loshchilov & Hutter, 2016) with 1,000-step warmup
- **Mixed precision**: fp16 [AMP](https://arxiv.org/abs/1710.03740) (Micikevicius et al., 2017) (auto-detected)
- **Checkpoints**: saved every 5,000 steps to `checkpoints/`
- **Eval**: every 500 steps on 512 held-out random games
### Common overrides
```bash
# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000
# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000
# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4
# Enable W&B logging
uv run python scripts/train.py --variant base --wandb
```
## Adapter Training (Behavioral Cloning)
Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.
### Requirements
1. A pretrained PAWN checkpoint (from pretraining above)
2. A Lichess PGN file filtered to an Elo band
Download standard rated game archives from the [Lichess open database](https://database.lichess.org/) ([Lichess](https://lichess.org/)), filtered to your target Elo band. The scripts expect a single `.pgn` file.
### Available adapters
| Adapter | Script | Key flag |
|--------------|-----------------------------|----------------------|
| Bottleneck | `scripts/train_bottleneck.py` | `--bottleneck-dim 8` |
| FiLM | `scripts/train_film.py` | |
| LoRA | `scripts/train_lora.py` | |
| Sparse | `scripts/train_sparse.py` | |
| Hybrid | `scripts/train_hybrid.py` | |
There is also `scripts/train_tiny.py` for a standalone small transformer baseline (no frozen backbone).
### Example: bottleneck adapter
```bash
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--bottleneck-dim 32 \
--lr 1e-4
```
### Adapter training defaults
- **Epochs**: 50 (with early stopping, patience=10)
- **Batch size**: 64
- **Optimizer**: AdamW (lr=3e-4)
- **LR schedule**: cosine with 5% warmup
- **Min ply**: 10 (games shorter than 10 plies are skipped)
- **Max games**: 12,000 train + 2,000 validation
- **Legal masking**: move legality enforced via the Rust engine at every position
### Resuming adapter training
```bash
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--resume logs/bottleneck_20260315_120000/checkpoints/best.pt
```
### Selective layer placement
Adapters can target specific layers or sublayer positions:
```bash
# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--no-adapt-attn --adapter-layers 4,5,6,7
```
Use `--attn-layers` / `--ffn-layers` for independent control of which layers get attention vs FFN adapters.
## Cloud Deployment (Runpod)
The `deploy/` directory provides scripts for managing GPU pods.
### Pod lifecycle with `pod.sh`
```bash
bash deploy/pod.sh create myexp --gpu a5000 # Create a pod
bash deploy/pod.sh deploy myexp # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base # Run training
bash deploy/pod.sh ssh myexp # SSH in
bash deploy/pod.sh stop myexp # Stop (preserves volume)
```
GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`.
### Manual deployment
If you prefer to deploy manually:
```bash
# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/
# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/
# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'
```
`setup.sh` handles: Rust installation, uv installation, building the chess engine, `uv sync --extra cu128`, and decompressing any zstd-compressed PGN data.
## GPU Auto-Detection
The `pawn.gpu` module auto-detects your GPU and configures:
- **torch.compile**: enabled on CUDA, uses inductor backend
- **AMP**: fp16 automatic mixed precision on CUDA
- **SDPA backend**: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)
No manual flags are needed in most cases. Override with `--no-compile`, `--no-amp`, or `--sdpa-math` if needed.
## Monitoring
All training scripts log metrics to JSONL files in `logs/`. Each run creates a timestamped directory (e.g., `logs/bottleneck_20260315_120000/metrics.jsonl`).
Every log record includes:
- Training metrics (loss, accuracy, learning rate)
- System resource stats (RAM, GPU VRAM peak/current)
- Timestamps and elapsed time
The JSONL format is one JSON object per line, readable with standard tools:
```bash
# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool
```