| # Training Guide |
|
|
| ## Prerequisites |
|
|
| - **Rust** (stable) -- required to build the chess engine native extension |
| - **uv** -- Python package manager ([install](https://docs.astral.sh/uv/getting-started/installation/)) |
| - **GPU** with ROCm (AMD) or CUDA (NVIDIA). CPU works only for `--variant toy` |
|
|
| ## Installation |
|
|
| ```bash |
| # Build the chess engine (one-time, or after engine/ changes) |
| cd engine && uv run --with maturin maturin develop --release && cd .. |
| |
| # Install Python dependencies |
| uv sync --extra rocm # AMD GPUs (ROCm) |
| uv sync --extra cu128 # NVIDIA GPUs (CUDA 12.8) |
| ``` |
|
|
| Verify the install: |
|
|
| ```bash |
| uv run python -c "import chess_engine; print('engine OK')" |
| uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')" |
| ``` |
|
|
| ## Pretraining from Scratch |
|
|
| PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed. |
|
|
| ```bash |
| uv run python scripts/train.py --variant base |
| ``` |
|
|
| ### Model variants |
|
|
| | Variant | Params | d_model | Layers | Heads | d_ff | |
| |---------|--------|---------|--------|-------|------| |
| | `small` | ~9.5M | 256 | 8 | 4 | 1024 | |
| | `base` | ~36M | 512 | 8 | 8 | 2048 | |
| | `large` | ~68M | 640 | 10 | 8 | 2560 | |
| | `toy` | tiny | 64 | 2 | 4 | 256 | |
|
|
| ### Default training configuration |
|
|
| - **Total steps**: 100,000 |
| - **Batch size**: 256 |
| - **Optimizer**: [AdamW](https://arxiv.org/abs/1711.05101) (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01) |
| - **LR schedule**: [cosine decay](https://arxiv.org/abs/1608.03983) (Loshchilov & Hutter, 2016) with 1,000-step warmup |
| - **Mixed precision**: fp16 [AMP](https://arxiv.org/abs/1710.03740) (Micikevicius et al., 2017) (auto-detected) |
| - **Checkpoints**: saved every 5,000 steps to `checkpoints/` |
| - **Eval**: every 500 steps on 512 held-out random games |
| |
| ### Common overrides |
| |
| ```bash |
| # Resume from a checkpoint |
| uv run python scripts/train.py --variant base --resume checkpoints/step_00050000 |
|
|
| # Custom batch size and step count |
| uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000 |
|
|
| # Gradient accumulation (effective batch = batch_size * accumulation_steps) |
| uv run python scripts/train.py --variant base --accumulation-steps 4 |
|
|
| # Enable W&B logging |
| uv run python scripts/train.py --variant base --wandb |
| ``` |
| |
| ## Adapter Training (Behavioral Cloning) |
| |
| Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves. |
| |
| ### Requirements |
| |
| 1. A pretrained PAWN checkpoint (from pretraining above) |
| 2. A Lichess PGN file filtered to an Elo band |
| |
| Download standard rated game archives from the [Lichess open database](https://database.lichess.org/) ([Lichess](https://lichess.org/)), filtered to your target Elo band. The scripts expect a single `.pgn` file. |
| |
| ### Available adapters |
| |
| | Adapter | Script | Key flag | |
| |--------------|-----------------------------|----------------------| |
| | Bottleneck | `scripts/train_bottleneck.py` | `--bottleneck-dim 8` | |
| | FiLM | `scripts/train_film.py` | | |
| | LoRA | `scripts/train_lora.py` | | |
| | Sparse | `scripts/train_sparse.py` | | |
| | Hybrid | `scripts/train_hybrid.py` | | |
| |
| There is also `scripts/train_tiny.py` for a standalone small transformer baseline (no frozen backbone). |
| |
| ### Example: bottleneck adapter |
| |
| ```bash |
| uv run python scripts/train_bottleneck.py \ |
| --checkpoint checkpoints/pawn-base.pt \ |
| --pgn data/lichess_1800_1900.pgn \ |
| --bottleneck-dim 32 \ |
| --lr 1e-4 |
| ``` |
| |
| ### Adapter training defaults |
| |
| - **Epochs**: 50 (with early stopping, patience=10) |
| - **Batch size**: 64 |
| - **Optimizer**: AdamW (lr=3e-4) |
| - **LR schedule**: cosine with 5% warmup |
| - **Min ply**: 10 (games shorter than 10 plies are skipped) |
| - **Max games**: 12,000 train + 2,000 validation |
| - **Legal masking**: move legality enforced via the Rust engine at every position |
| |
| ### Resuming adapter training |
| |
| ```bash |
| uv run python scripts/train_bottleneck.py \ |
| --checkpoint checkpoints/pawn-base.pt \ |
| --pgn data/lichess_1800_1900.pgn \ |
| --resume logs/bottleneck_20260315_120000/checkpoints/best.pt |
| ``` |
| |
| ### Selective layer placement |
|
|
| Adapters can target specific layers or sublayer positions: |
|
|
| ```bash |
| # Only FFN adapters on layers 4-7 |
| uv run python scripts/train_bottleneck.py \ |
| --checkpoint checkpoints/pawn-base.pt \ |
| --pgn data/lichess_1800_1900.pgn \ |
| --no-adapt-attn --adapter-layers 4,5,6,7 |
| ``` |
|
|
| Use `--attn-layers` / `--ffn-layers` for independent control of which layers get attention vs FFN adapters. |
|
|
| ## Cloud Deployment (Runpod) |
|
|
| The `deploy/` directory provides scripts for managing GPU pods. |
|
|
| ### Pod lifecycle with `pod.sh` |
|
|
| ```bash |
| bash deploy/pod.sh create myexp --gpu a5000 # Create a pod |
| bash deploy/pod.sh deploy myexp # Build + transfer + setup |
| bash deploy/pod.sh launch myexp scripts/train.py --variant base # Run training |
| bash deploy/pod.sh ssh myexp # SSH in |
| bash deploy/pod.sh stop myexp # Stop (preserves volume) |
| ``` |
|
|
| GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`. |
|
|
| ### Manual deployment |
|
|
| If you prefer to deploy manually: |
|
|
| ```bash |
| # 1. Build deploy package locally |
| bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/ |
| |
| # 2. Transfer to pod |
| rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/ |
| |
| # 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps) |
| ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh' |
| ``` |
|
|
| `setup.sh` handles: Rust installation, uv installation, building the chess engine, `uv sync --extra cu128`, and decompressing any zstd-compressed PGN data. |
|
|
| ## GPU Auto-Detection |
|
|
| The `pawn.gpu` module auto-detects your GPU and configures: |
|
|
| - **torch.compile**: enabled on CUDA, uses inductor backend |
| - **AMP**: fp16 automatic mixed precision on CUDA |
| - **SDPA backend**: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile) |
|
|
| No manual flags are needed in most cases. Override with `--no-compile`, `--no-amp`, or `--sdpa-math` if needed. |
|
|
| ## Monitoring |
|
|
| All training scripts log metrics to JSONL files in `logs/`. Each run creates a timestamped directory (e.g., `logs/bottleneck_20260315_120000/metrics.jsonl`). |
|
|
| Every log record includes: |
|
|
| - Training metrics (loss, accuracy, learning rate) |
| - System resource stats (RAM, GPU VRAM peak/current) |
| - Timestamps and elapsed time |
|
|
| The JSONL format is one JSON object per line, readable with standard tools: |
|
|
| ```bash |
| # Watch live training progress |
| tail -f logs/*/metrics.jsonl | python -m json.tool |
| ``` |
|
|