File size: 6,815 Bytes
d7ecc62 ae46efa d7ecc62 e27101c d7ecc62 ae46efa d7ecc62 e27101c d7ecc62 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | # Training Guide
## Prerequisites
- **Rust** (stable) -- required to build the chess engine native extension
- **uv** -- Python package manager ([install](https://docs.astral.sh/uv/getting-started/installation/))
- **GPU** with ROCm (AMD) or CUDA (NVIDIA). CPU works only for `--variant toy`
## Installation
```bash
# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..
# Install Python dependencies
uv sync --extra rocm # AMD GPUs (ROCm)
uv sync --extra cu128 # NVIDIA GPUs (CUDA 12.8)
```
Verify the install:
```bash
uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
```
## Pretraining from Scratch
PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.
```bash
uv run python scripts/train.py --variant base
```
### Model variants
| Variant | Params | d_model | Layers | Heads | d_ff |
|---------|--------|---------|--------|-------|------|
| `small` | ~9.5M | 256 | 8 | 4 | 1024 |
| `base` | ~36M | 512 | 8 | 8 | 2048 |
| `large` | ~68M | 640 | 10 | 8 | 2560 |
| `toy` | tiny | 64 | 2 | 4 | 256 |
### Default training configuration
- **Total steps**: 100,000
- **Batch size**: 256
- **Optimizer**: [AdamW](https://arxiv.org/abs/1711.05101) (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
- **LR schedule**: [cosine decay](https://arxiv.org/abs/1608.03983) (Loshchilov & Hutter, 2016) with 1,000-step warmup
- **Mixed precision**: fp16 [AMP](https://arxiv.org/abs/1710.03740) (Micikevicius et al., 2017) (auto-detected)
- **Checkpoints**: saved every 5,000 steps to `checkpoints/`
- **Eval**: every 500 steps on 512 held-out random games
### Common overrides
```bash
# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000
# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000
# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4
# Enable W&B logging
uv run python scripts/train.py --variant base --wandb
```
## Adapter Training (Behavioral Cloning)
Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.
### Requirements
1. A pretrained PAWN checkpoint (from pretraining above)
2. A Lichess PGN file filtered to an Elo band
Download standard rated game archives from the [Lichess open database](https://database.lichess.org/) ([Lichess](https://lichess.org/)), filtered to your target Elo band. The scripts expect a single `.pgn` file.
### Available adapters
| Adapter | Script | Key flag |
|--------------|-----------------------------|----------------------|
| Bottleneck | `scripts/train_bottleneck.py` | `--bottleneck-dim 8` |
| FiLM | `scripts/train_film.py` | |
| LoRA | `scripts/train_lora.py` | |
| Sparse | `scripts/train_sparse.py` | |
| Hybrid | `scripts/train_hybrid.py` | |
There is also `scripts/train_tiny.py` for a standalone small transformer baseline (no frozen backbone).
### Example: bottleneck adapter
```bash
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--bottleneck-dim 32 \
--lr 1e-4
```
### Adapter training defaults
- **Epochs**: 50 (with early stopping, patience=10)
- **Batch size**: 64
- **Optimizer**: AdamW (lr=3e-4)
- **LR schedule**: cosine with 5% warmup
- **Min ply**: 10 (games shorter than 10 plies are skipped)
- **Max games**: 12,000 train + 2,000 validation
- **Legal masking**: move legality enforced via the Rust engine at every position
### Resuming adapter training
```bash
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--resume logs/bottleneck_20260315_120000/checkpoints/best.pt
```
### Selective layer placement
Adapters can target specific layers or sublayer positions:
```bash
# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
--checkpoint checkpoints/pawn-base.pt \
--pgn data/lichess_1800_1900.pgn \
--no-adapt-attn --adapter-layers 4,5,6,7
```
Use `--attn-layers` / `--ffn-layers` for independent control of which layers get attention vs FFN adapters.
## Cloud Deployment (Runpod)
The `deploy/` directory provides scripts for managing GPU pods.
### Pod lifecycle with `pod.sh`
```bash
bash deploy/pod.sh create myexp --gpu a5000 # Create a pod
bash deploy/pod.sh deploy myexp # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base # Run training
bash deploy/pod.sh ssh myexp # SSH in
bash deploy/pod.sh stop myexp # Stop (preserves volume)
```
GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`.
### Manual deployment
If you prefer to deploy manually:
```bash
# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/
# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/
# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'
```
`setup.sh` handles: Rust installation, uv installation, building the chess engine, `uv sync --extra cu128`, and decompressing any zstd-compressed PGN data.
## GPU Auto-Detection
The `pawn.gpu` module auto-detects your GPU and configures:
- **torch.compile**: enabled on CUDA, uses inductor backend
- **AMP**: fp16 automatic mixed precision on CUDA
- **SDPA backend**: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)
No manual flags are needed in most cases. Override with `--no-compile`, `--no-amp`, or `--sdpa-math` if needed.
## Monitoring
All training scripts log metrics to JSONL files in `logs/`. Each run creates a timestamped directory (e.g., `logs/bottleneck_20260315_120000/metrics.jsonl`).
Every log record includes:
- Training metrics (loss, accuracy, learning rate)
- System resource stats (RAM, GPU VRAM peak/current)
- Timestamps and elapsed time
The JSONL format is one JSON object per line, readable with standard tools:
```bash
# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool
```
|