File size: 6,815 Bytes
d7ecc62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ae46efa
d7ecc62
 
 
 
 
 
 
 
e27101c
 
 
d7ecc62
 
 
 
 
 
 
ae46efa
d7ecc62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e27101c
d7ecc62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
# Training Guide

## Prerequisites

- **Rust** (stable) -- required to build the chess engine native extension
- **uv** -- Python package manager ([install](https://docs.astral.sh/uv/getting-started/installation/))
- **GPU** with ROCm (AMD) or CUDA (NVIDIA). CPU works only for `--variant toy`

## Installation

```bash
# Build the chess engine (one-time, or after engine/ changes)
cd engine && uv run --with maturin maturin develop --release && cd ..

# Install Python dependencies
uv sync --extra rocm    # AMD GPUs (ROCm)
uv sync --extra cu128   # NVIDIA GPUs (CUDA 12.8)
```

Verify the install:

```bash
uv run python -c "import chess_engine; print('engine OK')"
uv run python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}')"
```

## Pretraining from Scratch

PAWN pretrains on random chess games generated on-the-fly by the Rust engine. No external datasets are needed.

```bash
uv run python scripts/train.py --variant base
```

### Model variants

| Variant | Params | d_model | Layers | Heads | d_ff |
|---------|--------|---------|--------|-------|------|
| `small` | ~9.5M  | 256     | 8      | 4     | 1024 |
| `base`  | ~36M   | 512     | 8      | 8     | 2048 |
| `large` | ~68M   | 640     | 10     | 8     | 2560 |
| `toy`   | tiny   | 64      | 2      | 4     | 256  |

### Default training configuration

- **Total steps**: 100,000
- **Batch size**: 256
- **Optimizer**: [AdamW](https://arxiv.org/abs/1711.05101) (Loshchilov & Hutter, 2017) (lr=3e-4, weight_decay=0.01)
- **LR schedule**: [cosine decay](https://arxiv.org/abs/1608.03983) (Loshchilov & Hutter, 2016) with 1,000-step warmup
- **Mixed precision**: fp16 [AMP](https://arxiv.org/abs/1710.03740) (Micikevicius et al., 2017) (auto-detected)
- **Checkpoints**: saved every 5,000 steps to `checkpoints/`
- **Eval**: every 500 steps on 512 held-out random games

### Common overrides

```bash
# Resume from a checkpoint
uv run python scripts/train.py --variant base --resume checkpoints/step_00050000

# Custom batch size and step count
uv run python scripts/train.py --variant base --batch-size 128 --total-steps 200000

# Gradient accumulation (effective batch = batch_size * accumulation_steps)
uv run python scripts/train.py --variant base --accumulation-steps 4

# Enable W&B logging
uv run python scripts/train.py --variant base --wandb
```

## Adapter Training (Behavioral Cloning)

Adapter training freezes the pretrained PAWN backbone and trains lightweight adapter modules on Lichess games to predict human moves.

### Requirements

1. A pretrained PAWN checkpoint (from pretraining above)
2. A Lichess PGN file filtered to an Elo band

Download standard rated game archives from the [Lichess open database](https://database.lichess.org/) ([Lichess](https://lichess.org/)), filtered to your target Elo band. The scripts expect a single `.pgn` file.

### Available adapters

| Adapter      | Script                      | Key flag             |
|--------------|-----------------------------|----------------------|
| Bottleneck   | `scripts/train_bottleneck.py` | `--bottleneck-dim 8` |
| FiLM         | `scripts/train_film.py`     |                      |
| LoRA         | `scripts/train_lora.py`     |                      |
| Sparse       | `scripts/train_sparse.py`   |                      |
| Hybrid       | `scripts/train_hybrid.py`   |                      |

There is also `scripts/train_tiny.py` for a standalone small transformer baseline (no frozen backbone).

### Example: bottleneck adapter

```bash
uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --bottleneck-dim 32 \
    --lr 1e-4
```

### Adapter training defaults

- **Epochs**: 50 (with early stopping, patience=10)
- **Batch size**: 64
- **Optimizer**: AdamW (lr=3e-4)
- **LR schedule**: cosine with 5% warmup
- **Min ply**: 10 (games shorter than 10 plies are skipped)
- **Max games**: 12,000 train + 2,000 validation
- **Legal masking**: move legality enforced via the Rust engine at every position

### Resuming adapter training

```bash
uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --resume logs/bottleneck_20260315_120000/checkpoints/best.pt
```

### Selective layer placement

Adapters can target specific layers or sublayer positions:

```bash
# Only FFN adapters on layers 4-7
uv run python scripts/train_bottleneck.py \
    --checkpoint checkpoints/pawn-base.pt \
    --pgn data/lichess_1800_1900.pgn \
    --no-adapt-attn --adapter-layers 4,5,6,7
```

Use `--attn-layers` / `--ffn-layers` for independent control of which layers get attention vs FFN adapters.

## Cloud Deployment (Runpod)

The `deploy/` directory provides scripts for managing GPU pods.

### Pod lifecycle with `pod.sh`

```bash
bash deploy/pod.sh create myexp --gpu a5000        # Create a pod
bash deploy/pod.sh deploy myexp                     # Build + transfer + setup
bash deploy/pod.sh launch myexp scripts/train.py --variant base  # Run training
bash deploy/pod.sh ssh myexp                        # SSH in
bash deploy/pod.sh stop myexp                       # Stop (preserves volume)
```

GPU shortcuts: `a5000`, `a40`, `a6000`, `4090`, `5090`, `l40s`, `h100`.

### Manual deployment

If you prefer to deploy manually:

```bash
# 1. Build deploy package locally
bash deploy/build.sh --checkpoint checkpoints/pawn-base.pt --data-dir data/

# 2. Transfer to pod
rsync -avz --progress deploy/pawn-deploy/ root@<pod-ip>:/workspace/pawn/

# 3. Run setup on the pod (installs Rust, uv, builds engine, syncs deps)
ssh root@<pod-ip> 'cd /workspace/pawn && bash deploy/setup.sh'
```

`setup.sh` handles: Rust installation, uv installation, building the chess engine, `uv sync --extra cu128`, and decompressing any zstd-compressed PGN data.

## GPU Auto-Detection

The `pawn.gpu` module auto-detects your GPU and configures:

- **torch.compile**: enabled on CUDA, uses inductor backend
- **AMP**: fp16 automatic mixed precision on CUDA
- **SDPA backend**: flash attention on NVIDIA; MATH backend on AMD (ROCm's flash attention backward has stride mismatches with torch.compile)

No manual flags are needed in most cases. Override with `--no-compile`, `--no-amp`, or `--sdpa-math` if needed.

## Monitoring

All training scripts log metrics to JSONL files in `logs/`. Each run creates a timestamped directory (e.g., `logs/bottleneck_20260315_120000/metrics.jsonl`).

Every log record includes:

- Training metrics (loss, accuracy, learning rate)
- System resource stats (RAM, GPU VRAM peak/current)
- Timestamps and elapsed time

The JSONL format is one JSON object per line, readable with standard tools:

```bash
# Watch live training progress
tail -f logs/*/metrics.jsonl | python -m json.tool
```