PAWN-Base

PAWN (Playstyle-Agnostic World-model Network for Chess) is a causal transformer trained on random chess games. It learns legal moves, board state representations, and game dynamics purely from uniformly random legal move sequences -- no strategic play, no hand-crafted features, no external game databases.

This is the base (default) variant (~35.8M parameters). PAWN is designed as a frozen backbone for parameter-efficient finetuning into player models with arbitrary playstyles.

GitHub Repository -- full source code, training scripts, adapter implementations, and documentation.

All Variants

Variant Parameters Link
PAWN-Small ~9.5M thomas-schweich/pawn-small
PAWN (Base) ~35.8M thomas-schweich/pawn-base
PAWN-Large ~68.4M thomas-schweich/pawn-large

Headline Metrics

PAWN is trained on uniformly random chess games, so top-1 accuracy has a hard theoretical ceiling -- no model can exceed it. We report accuracy relative to three ceilings to contextualize model performance. For full details, see Accuracy Ceiling Analysis.

Metric Value
Legal move rate 99.97%
Top-1 accuracy 6.86%

Accuracy Ratios

The ceilings below represent the best possible top-1 accuracy under different assumptions about what the model can know. Ratios above 100% on the unconditioned ceiling indicate the model has learned structure beyond simply identifying legal moves.

Ceiling Accuracy / Ceiling Ratio
Unconditioned (E[1/N_legal] = 6.43%) 6.86 / 6.43% 106.7%
Naive-conditioned (1-ply filter = 6.44%) 6.86 / 6.44% 106.5%
Bayes-optimal conditioned (MCTS, 32 rollouts = 7.92%) 6.86 / 7.92% 86.6%

Unconditioned ceiling (6.43%): The expected accuracy of a predictor that knows only which moves are legal at each position and picks uniformly. A model exceeding this has learned to estimate the number of legal moves and bias predictions toward constrained positions.

Naive-conditioned ceiling (6.44%): An analytical estimate that excludes moves leading to an immediate terminal state inconsistent with the game's actual outcome. This barely exceeds the unconditioned ceiling because immediate terminal states are rare.

Bayes-optimal conditioned ceiling (7.92%): The Monte Carlo estimate of the best achievable accuracy given perfect knowledge of P(outcome | move, history). This is the tightest bound. PAWN's input sequence begins with an outcome token, which leaks information about the game's trajectory. The MCTS ceiling quantifies the maximum benefit of this conditioning.

Probe Results

Linear probes trained on frozen hidden states measure how well the model's internal representations encode board-level features. All probes are single linear layers trained on 2,048 positions and evaluated on 512 held-out positions.

Probe Result Description
Piece type 89.7% Per-square piece type (13 classes x 64 squares)
Side to move 100.0% Whose turn it is
Is check 94.2% Whether the side to move is in check
Castling rights 96.6% KQkq castling availability
En passant square 99.7% En passant target square (64 + none)
Material count R²=0.861, MAE=6.08 Piece counts per type per color
Legal move count R²=0.379, MAE=6.85 Number of legal moves available
Halfmove clock R²=0.118, MAE=4.08 Plies since last capture or pawn move
Game phase 90.7% Opening / middlegame / endgame

Diagnostic Results

Edge-case diagnostics measure the model's accuracy and legal move rate in specific tactical situations. Positions are extracted from a corpus of random games and evaluated in isolation.

Category Positions Legal Rate
In check 1,000 97.7%
Double check 71 91.2%
Pin restricts movement 1,000 97.2%
En passant available 940 99.2%
Castling legal (kingside) 1,000 99.7%
Castling legal (queenside) 1,000 99.6%
Castling blocked by check 892 99.4%
Promotion available 1,000 99.4%
Checkmate (terminal) 276 91.2% (PAD prob)
Stalemate (terminal) 41 84.2% (PAD prob)

For terminal positions (checkmate, stalemate), there are no legal moves. The "Legal Rate" column instead reports the probability the model assigns to the PAD token — i.e., how often it correctly recognizes the game is over.

Architecture

Parameter Value
Architecture Decoder-only transformer
d_model 512
Layers 8
Attention heads 8
Head dimension 64
d_ff 2,048
Parameters ~35.8M
Vocabulary 4,278 tokens
Context length 256 tokens
Normalization Pre-norm RMSNorm
FFN SwiGLU (4x expansion)
Positional encoding Rotary (RoPE, base 10000)
Embeddings Factored (src + dst + promo)
Dropout 0.0

The token vocabulary consists of 1 PAD token, 4,096 grid moves (64 source squares x 64 destination squares), 176 promotion moves (44 src/dst pairs x 4 piece types), and 5 outcome tokens (WHITE_CHECKMATES, BLACK_CHECKMATES, STALEMATE, DRAW_BY_RULE, PLY_LIMIT).

Each input sequence has the format [outcome] [ply_1] ... [ply_N] [PAD] ... [PAD], where the outcome token is prepended during training so the model can condition on how the game ends. Move embeddings are factored into source square + destination square + promotion piece components, reducing embedding parameters by roughly 32x while providing structural inductive bias.

The model receives no board representation, piece type information, or geometric features. All state tracking is learned internally from move sequences alone.

Training Details

Parameter Value
Training data On-the-fly uniformly random legal games (no external dataset)
Objective Next-token cross-entropy (non-padding positions only)
Total steps 100,000
Batch size 256
Games seen 25,600,000
Learning rate 3e-4 (cosine decay with 1,000-step warmup)
Optimizer AdamW (weight decay 0.01)
Precision Mixed (AMP)
Max gradient norm 1.0
Hardware NVIDIA H200
Chess engine Rust (shakmaty + rayon), ~43K games/sec

Training data is generated on-the-fly by a Rust chess engine that plays uniformly random legal moves. Each batch is a fresh set of games produced from a deterministic seed, so no game is seen twice. The engine runs with rayon parallelism and produces batches fast enough to keep the GPU fully saturated.

Games are retroactively prepended with their actual outcome token. The model is not masked to legal moves during training; it must learn which moves are legal based on the sequence of prior moves.

Usage

Loading the model

import torch
from safetensors.torch import load_file
from pawn.config import CLMConfig
from pawn.model import PAWNCLM

# Initialize and load weights
cfg = CLMConfig.base()
model = PAWNCLM(cfg).cuda().eval()
weights = load_file("model.safetensors", device="cuda")
model.load_state_dict(weights)

Autoregressive generation

from pawn.config import WHITE_CHECKMATES, PAD_TOKEN

# Start a game conditioned on white delivering checkmate
input_ids = torch.tensor([[WHITE_CHECKMATES]], device="cuda")
pad_mask = torch.ones(1, 1, dtype=torch.bool, device="cuda")

generated = [WHITE_CHECKMATES]
for _ in range(255):
    logits, _ = model.forward_generate(input_ids, pad_mask)
    next_token = logits[0, -1].argmax().item()
    if next_token == PAD_TOKEN:
        break
    generated.append(next_token)
    input_ids = torch.tensor([[next_token]], device="cuda")
    pad_mask = torch.ones(1, 1, dtype=torch.bool, device="cuda")

Extracting hidden states for probing

import torch
from pawn.config import CLMConfig
from pawn.model import PAWNCLM

cfg = CLMConfig.base()
model = PAWNCLM(cfg).cuda().eval()
# ... load weights ...

# input_ids: (B, T) tensor of token IDs
# pad_mask: (B, T) boolean tensor (True = real token)
logits, layer_hiddens = model(input_ids, pad_mask)
# layer_hiddens: list of (B, T, d_model) tensors, one per layer

Finetuning with an adapter

# Install dependencies
cd engine && uv run --with maturin maturin develop --release && cd ..
uv sync --extra cu128 --extra dev

# Train a bottleneck adapter on Lichess games
uv run python scripts/train_bottleneck.py \
    --checkpoint path/to/pawn-base \
    --pgn data/lichess_1800_1900.pgn \
    --bottleneck-dim 32 --lr 1e-4 --local-checkpoints

Accuracy Ceiling Analysis

The accuracy ratios reported above are derived from a theoretical analysis of the maximum achievable top-1 accuracy on uniformly random chess games. Since each move is drawn uniformly from the legal move set, there is a hard ceiling that no model -- however large -- can exceed.

The unconditioned ceiling (6.43%) is the average of 1/N_legal across all positions in random games: the best a predictor can do without any context beyond the current position's legal moves. The Bayes-optimal conditioned ceiling (7.92%) accounts for the information leaked by the outcome token at position 0, estimated via Monte Carlo rollouts.

Models that exceed the unconditioned ceiling have learned structure beyond simple move legality. The gap between the unconditioned and MCTS ceilings quantifies the value of outcome conditioning. For most games (ply-limit outcomes, which dominate the distribution), the conditioning boost is small (1.07x). For decisive outcomes (checkmate, stalemate), the boost is substantial (2.6x).

Full methodology and per-outcome breakdowns are available in the accuracy ceiling documentation.

Citation

@software{schweich2025pawn,
  author = {Schweich, Thomas},
  title = {{PAWN}: Playstyle-Agnostic World-model Network for Chess},
  year = {2025},
  url = {https://github.com/thomas-schweich/PAWN},
  license = {Apache-2.0}
}

License

Apache 2.0. See LICENSE.

Downloads last month
103
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for thomas-schweich/pawn-base

Unable to build the model tree, the base model loops to the model itself. Learn more.

Collection including thomas-schweich/pawn-base

Evaluation results

  • Legal Move Rate on Random Legal Chess Games
    self-reported
    99.97%
  • Accuracy / Unconditioned Ceiling on Random Legal Chess Games
    self-reported
    1.067
  • Accuracy / Bayes-Optimal Ceiling on Random Legal Chess Games
    self-reported
    0.866
  • Top-1 Accuracy on Random Legal Chess Games
    self-reported
    0.069
  • Top-5 Accuracy on Random Legal Chess Games
    self-reported
    0.278
  • Val Loss on Random Legal Chess Games
    self-reported
    3.096
  • Games Seen on Random Legal Chess Games
    self-reported
    25600000.000