Theoretical Accuracy Ceiling
PAWN is trained on uniformly random chess games. Since each move is drawn uniformly from the legal move set, top-1 accuracy has a hard theoretical ceiling β no model, however large, can exceed it.
Three ceilings
Unconditional ceiling: E[1/N_legal] = 6.43%
At each position, the move is drawn uniformly from N legal moves. The best a predictor can do without any context is pick one at random: accuracy = 1/N. Averaged over all positions in random games, this gives 6.43%.
A model that exceeds this ceiling has learned something beyond just "which moves are legal" β it has learned to estimate the number of legal moves at each position and bias predictions toward positions with fewer options.
Naive conditional ceiling: 6.44%
A zero-cost analytical estimate of outcome conditioning. At each position, legal moves that lead to an immediate terminal state with a different outcome than the actual game are excluded, and accuracy = 1/(N_legal - N_wrong).
This barely exceeds the unconditional ceiling (1.00x boost) because immediate terminal states are rare β most moves at most positions lead to non-terminal continuations, so the filter has almost nothing to exclude.
MCTS conditional ceiling: 7.92%
The full Monte Carlo estimate. At each sampled position, every legal move is tried and 32 random continuations are played out to estimate P(outcome | move, history). The Bayes-optimal predictor picks the move most consistent with the known outcome.
PAWN's input sequence begins with an outcome token (WHITE_CHECKMATES,
STALEMATE, PLY_LIMIT, etc.). This leaks information about the game's
trajectory, making some moves more predictable:
- Checkmate games: The final move must deliver checkmate. Knowing this raises the ceiling at the last ply from ~5% to ~14%.
- Ply limit games: Knowing the game lasts 255 plies constrains the move distribution slightly.
- Stalemate games: The final position has no legal moves but isn't check β very constraining on late moves.
Adjusted accuracy
| Metric | Value |
|---|---|
| Unconditional ceiling (E[1/N_legal]) | 6.43% |
| Naive conditional ceiling (1-ply filter) | 6.44% |
| MCTS conditional ceiling (32 rollouts) | 7.92% |
| Conditioning boost (naive) | 1.00x |
| Conditioning boost (MCTS) | 1.23x |
For a model with top-1 accuracy A:
- Adjusted (unconditional) = A / 6.43% β measures how much the model has learned about chess legality. Values > 100% mean it has learned structure beyond just legal moves.
- Adjusted (naive conditional) = A / 6.44% β essentially the same as unconditional; confirms that 1-ply lookahead explains almost none of the outcome conditioning benefit.
- Adjusted (MCTS conditional) = A / 7.92% β measures how close the model is to the Bayes-optimal predictor with perfect outcome knowledge. This is the tighter bound.
Current model results (step ~69K)
| Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond |
|---|---|---|---|---|
| large (68M) | 6.9% | 107% | 107% | 87% |
| base (36M) | 6.9% | 107% | 107% | 87% |
| small (10M) | 6.5% | 101% | 101% | 82% |
All models exceed the unconditional and naive conditional ceilings, confirming they learn chess structure beyond move legality. The large and base models reach 87% of the MCTS conditional ceiling.
Per-outcome breakdown
| Outcome | Uncond | Naive Cond | MCTS Cond | Positions |
|---|---|---|---|---|
| White checkmated | 5.26% | 5.26% | 13.79% | 328 |
| Black checkmated | 5.02% | 5.02% | 13.64% | 388 |
| Stalemate | 7.22% | 7.22% | 18.67% | 125 |
| Insufficient material | 7.17% | 7.17% | 18.61% | 256 |
| Ply limit | 6.51% | 6.51% | 6.97% | 8,618 |
The naive conditional ceiling equals the unconditional ceiling across all outcome types β the 1-ply filter never fires in practice. The MCTS ceiling shows the real conditioning benefit: decisive outcomes (checkmate, stalemate, insufficient material) get a 2.6x boost, while ply limit games β the vast majority β show only 1.07x because knowing the game goes the distance provides minimal per-move information.
Reproducing
# Default: 2000 games, 32 rollouts/move, 2% sample rate
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
# Higher precision (slower)
uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05
Results are saved to data/theoretical_ceiling.json.
Caveats
- The MCTS ceiling is an estimate, not exact. With more rollouts and higher sample rates, the estimate improves but computation time increases quadratically.
- The ceiling assumes the model has perfect knowledge of P(outcome | move, history). In practice, the model must learn this from data, so the achievable accuracy for a finite model is somewhat below the ceiling.
- Game length information is implicit in the outcome token (e.g., PLY_LIMIT implies 255 plies). A model could theoretically use position in the sequence to estimate remaining game length, further improving predictions.