PAWN-Small
PAWN (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.
This is the small variant (~9.5M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.
GitHub Repository -- full source code, training scripts, adapter implementations, and documentation.
All Variants
| Variant | Parameters | Link |
|---|---|---|
| PAWN-Small | ~9.5M | thomas-schweich/pawn-small |
| PAWN (Base) | ~35.8M | thomas-schweich/pawn-base |
| PAWN-Large | ~68.4M | thomas-schweich/pawn-large |
Headline Metrics
PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling -- no model can exceed it. We report accuracy relative to three ceilings to contextualize model performance. For full details, see Accuracy Ceiling Analysis.
| Metric | Value |
|---|---|
| Legal move rate | 99.29% |
| Top-1 accuracy | 6.73% |
Accuracy Ratios
The ceilings below represent the best possible top-1 accuracy under different assumptions about what the model can know. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves.
| Ceiling | Accuracy / Ceiling | Ratio |
|---|---|---|
| Unconditioned (E[1/N_legal] = 6.43%) | 6.73 / 6.43% | 104.7% |
| Naive-conditioned (1-ply filter = 6.44%) | 6.73 / 6.44% | 104.5% |
| Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) | 6.73 / 7.92% | 85.0% |
Unconditioned ceiling (6.43%): The expected accuracy of a predictor that knows only which moves are legal at each position and picks uniformly. A model exceeding this has learned to estimate the number of legal moves and bias predictions toward constrained positions.
Naive-conditioned ceiling (6.44%): An analytical estimate that excludes moves leading to an immediate terminal state inconsistent with the game's actual outcome. This barely exceeds the unconditioned ceiling because immediate terminal states are rare.
Bayes-optimal conditioned ceiling (7.92%): The Monte Carlo estimate of the best achievable accuracy given perfect knowledge of P(outcome | move, history). This is the tightest bound. PAWN's input sequence begins with an outcome token, which leaks information about the game's trajectory. The MCTS ceiling quantifies the maximum benefit of this conditioning.
Probe Results
Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. All probes are single linear layers trained on 2,048 positions and evaluated on 512 held-out positions.
| Probe | Metric | Layer layer_7 | Description | |-------|--------|---------||-------------| | Piece type | Accuracy | 89.1% | Per-square piece type (13 classes x 64 squares) | | Side to move | Accuracy | 100.0% | Whose turn it is | | Is check | Accuracy | 94.3% | Whether the side to move is in check | | Castling rights | Accuracy | 96.5% | KQkq castling availability | | En passant square | Accuracy | 99.8% | En passant target square (64 + none) | | Material count | MSE | R²=0.865, MAE=4.87 | Piece counts per type per color | | Legal move count | MSE | R²=0.307, MAE=7.38 | Number of legal moves available | | Halfmove clock | MSE | R²=0.133, MAE=3.86 | Plies since last capture or pawn move | | Game phase | Accuracy | 91.1% | Opening / middlegame / endgame |
Diagnostic Results
Edge-case diagnostics measure the model's accuracy and legal move rate in specific tactical situations. Positions are extracted from a corpus of random games and evaluated in isolation.
| Category | Positions | Legal Rate | Top-1 Accuracy |
|---|---|---|---|
| In check | 1,000 | 82.4% | -- |
| Double check | 71 | 65.1% | -- |
| Pin restricts movement | 1,000 | 86.2% | -- |
| En passant available | 940 | 97.1% | -- |
| Castling legal (kingside) | 1,000 | 98.8% | -- |
| Castling legal (queenside) | 1,000 | 98.2% | -- |
| Castling blocked by check | 892 | 95.7% | -- |
| Promotion available | 1,000 | 96.2% | -- |
| Checkmate (terminal) | 276 | 66.4% (PAD prob) | -- |
| Stalemate (terminal) | 41 | 53.8% (PAD prob) | -- |
For terminal categories (checkmate, stalemate), the "Legal Rate" column reports the probability the model assigns to the PAD token (i.e., correctly recognizing the game is over). Top-1 accuracy is not applicable at terminal positions.
Architecture
| Parameter | Value |
|---|---|
| Architecture | Decoder-only transformer |
| d_model | 256 |
| Layers | 8 |
| Attention heads | 4 |
| Head dimension | 64 |
| d_ff | 1,024 |
| Parameters | ~9.5M |
| Vocabulary | 4,278 tokens |
| Context length | 256 tokens |
| Normalization | Pre-norm RMSNorm |
| FFN | SwiGLU (4x expansion) |
| Positional encoding | Rotary (RoPE, base 10000) |
| Embeddings | Factored (src + dst + promo) |
| Dropout | 0.0 |
The token vocabulary consists of 1 PAD token, 4,096 grid moves (64 source squares x 64 destination squares), 176 promotion moves (44 src/dst pairs x 4 piece types), and 5 outcome tokens (WHITE_CHECKMATES, BLACK_CHECKMATES, STALEMATE, DRAW_BY_RULE, PLY_LIMIT).
Each input sequence has the format [outcome] [ply_1] ... [ply_N] [PAD] ... [PAD], where the outcome token is prepended during training so the model can condition on how the game ends. Move embeddings are factored into source square + destination square + promotion piece components, reducing embedding parameters by roughly 32x while providing structural inductive bias.
The model receives no board representation, piece type information, or geometric features. All state tracking is learned internally from move sequences alone.
Training Details
| Parameter | Value |
|---|---|
| Training data | On-the-fly uniformly random legal games (no external dataset) |
| Objective | Next-token cross-entropy (non-padding positions only) |
| Total steps | 100,000 |
| Batch size | 256 |
| Learning rate | 3e-4 (cosine decay with 1,000-step warmup) |
| Optimizer | AdamW (weight decay 0.01) |
| Precision | Mixed (AMP) |
| Max gradient norm | 1.0 |
| Hardware | NVIDIA H200 |
| Chess engine | Rust (shakmaty + rayon), ~43K games/sec |
Training data is generated on-the-fly by a Rust chess engine that plays uniformly random legal moves. Each batch is a fresh set of games produced from a deterministic seed, so no game is seen twice. The engine runs with rayon parallelism and produces batches fast enough to keep the GPU fully saturated.
Games are retroactively prepended with their actual outcome token. The model is not masked to legal moves during training; it must learn which moves are legal based on the sequence of prior moves.
Usage
Loading the model
import torch
from safetensors.torch import load_file
from pawn.config import CLMConfig
from pawn.model import PAWNCLM
# Initialize and load weights
cfg = CLMConfig.small()
model = PAWNCLM(cfg).cuda().eval()
weights = load_file("model.safetensors", device="cuda")
model.load_state_dict(weights)
Autoregressive generation
from pawn.config import WHITE_CHECKMATES, PAD_TOKEN
# Start a game conditioned on white delivering checkmate
input_ids = torch.tensor([[WHITE_CHECKMATES]], device="cuda")
pad_mask = torch.ones(1, 1, dtype=torch.bool, device="cuda")
generated = [WHITE_CHECKMATES]
for _ in range(255):
logits, _ = model.forward_generate(input_ids, pad_mask)
next_token = logits[0, -1].argmax().item()
if next_token == PAD_TOKEN:
break
generated.append(next_token)
input_ids = torch.tensor([[next_token]], device="cuda")
pad_mask = torch.ones(1, 1, dtype=torch.bool, device="cuda")
Extracting hidden states for probing
import torch
from pawn.config import CLMConfig
from pawn.model import PAWNCLM
cfg = CLMConfig.small()
model = PAWNCLM(cfg).cuda().eval()
# ... load weights ...
# input_ids: (B, T) tensor of token IDs
# pad_mask: (B, T) boolean tensor (True = real token)
logits, layer_hiddens = model(input_ids, pad_mask)
# layer_hiddens: list of (B, T, d_model) tensors, one per layer
Finetuning with an adapter
# Install dependencies
cd engine && uv run --with maturin maturin develop --release && cd ..
uv sync --extra cu128 --extra dev
# Train a bottleneck adapter on Lichess games
uv run python scripts/train_bottleneck.py \
--checkpoint path/to/pawn-small \
--pgn data/lichess_1800_1900.pgn \
--bottleneck-dim 32 --lr 1e-4 --local-checkpoints
Accuracy Ceiling Analysis
The accuracy ratios reported above are derived from a theoretical analysis of the maximum achievable top-1 accuracy on uniformly random chess games. Since each move is drawn uniformly from the legal move set, there is a hard ceiling that no model -- however large -- can exceed.
The unconditioned ceiling (6.43%) is the average of 1/N_legal across all positions in random games: the best a predictor can do without any context beyond the current position's legal moves. The Bayes-optimal conditioned ceiling (7.92%) accounts for the information leaked by the outcome token at position 0, estimated via Monte Carlo rollouts.
Models that exceed the unconditioned ceiling have learned structure beyond simple move legality. The gap between the unconditioned and MCTS ceilings quantifies the value of outcome conditioning. For most games (ply-limit outcomes, which dominate the distribution), the conditioning boost is small (1.07x). For decisive outcomes (checkmate, stalemate), the boost is substantial (2.6x).
Full methodology and per-outcome breakdowns are available in the accuracy ceiling documentation.
Citation
@software{schweich2025pawn,
author = {Schweich, Thomas},
title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
year = {2025},
url = {https://github.com/thomas-schweich/PAWN},
license = {Apache-2.0}
}
License
Apache 2.0. See LICENSE.
- Downloads last month
- 98
Collection including thomas-schweich/pawn-small
Evaluation results
- Top-1 Accuracyself-reported6.73%
- Legal Move Rateself-reported99.29%