File size: 5,197 Bytes
5b90556 2660c6c 5b90556 2660c6c 5b90556 2660c6c 5b90556 2660c6c 5b90556 cd1dcfe 5b90556 2660c6c cd1dcfe 5b90556 2660c6c cd1dcfe 5b90556 2660c6c 5b90556 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | # Theoretical Accuracy Ceiling
PAWN is trained on uniformly random chess games. Since each move is drawn
uniformly from the legal move set, top-1 accuracy has a hard theoretical
ceiling β no model, however large, can exceed it.
## Three ceilings
### Unconditional ceiling: E[1/N_legal] = 6.43%
At each position, the move is drawn uniformly from N legal moves. The best
a predictor can do without any context is pick one at random: accuracy = 1/N.
Averaged over all positions in random games, this gives **6.43%**.
A model that exceeds this ceiling has learned something beyond just "which
moves are legal" β it has learned to estimate the number of legal moves at
each position and bias predictions toward positions with fewer options.
### Naive conditional ceiling: 6.44%
A zero-cost analytical estimate of outcome conditioning. At each position,
legal moves that lead to an immediate terminal state with a *different*
outcome than the actual game are excluded, and accuracy = 1/(N_legal - N_wrong).
This barely exceeds the unconditional ceiling (1.00x boost) because
immediate terminal states are rare β most moves at most positions lead to
non-terminal continuations, so the filter has almost nothing to exclude.
### MCTS conditional ceiling: 7.92%
The full Monte Carlo estimate. At each sampled position, every legal move is
tried and 32 random continuations are played out to estimate
P(outcome | move, history). The Bayes-optimal predictor picks the move most
consistent with the known outcome.
PAWN's input sequence begins with an outcome token (`WHITE_CHECKMATES`,
`STALEMATE`, `PLY_LIMIT`, etc.). This leaks information about the game's
trajectory, making some moves more predictable:
- **Checkmate games**: The final move must deliver checkmate. Knowing this
raises the ceiling at the last ply from ~5% to ~14%.
- **Ply limit games**: Knowing the game lasts 255 plies constrains the move
distribution slightly.
- **Stalemate games**: The final position has no legal moves but isn't check
β very constraining on late moves.
## Adjusted accuracy
| Metric | Value |
|--------|-------|
| Unconditional ceiling (E[1/N_legal]) | 6.43% |
| Naive conditional ceiling (1-ply filter) | 6.44% |
| MCTS conditional ceiling (32 rollouts) | 7.92% |
| Conditioning boost (naive) | 1.00x |
| Conditioning boost (MCTS) | 1.23x |
For a model with top-1 accuracy A:
- **Adjusted (unconditional)** = A / 6.43% β measures how much the model
has learned about chess legality. Values > 100% mean it has learned
structure beyond just legal moves.
- **Adjusted (naive conditional)** = A / 6.44% β essentially the same as
unconditional; confirms that 1-ply lookahead explains almost none of the
outcome conditioning benefit.
- **Adjusted (MCTS conditional)** = A / 7.92% β measures how close the
model is to the Bayes-optimal predictor with perfect outcome knowledge.
This is the tighter bound.
### Final model results (100K steps)
| Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond |
|---------|-------|-----------|---------------|--------------|
| large (68M) | 6.94% | 108% | 108% | 88% |
| base (36M) | 6.86% | 107% | 107% | 87% |
| small (10M) | 6.73% | 105% | 105% | 85% |
All models exceed the unconditional and naive conditional ceilings,
confirming they learn chess structure beyond move legality. The large and
base models reach 87-88% of the MCTS conditional ceiling.
## Per-outcome breakdown
| Outcome | Uncond | Naive Cond | MCTS Cond | Positions |
|---------|--------|------------|-----------|-----------|
| White checkmated | 5.26% | 5.26% | 13.79% | 328 |
| Black checkmated | 5.02% | 5.02% | 13.64% | 388 |
| Stalemate | 7.22% | 7.22% | 18.67% | 125 |
| Insufficient material | 7.17% | 7.17% | 18.61% | 256 |
| Ply limit | 6.51% | 6.51% | 6.97% | 8,618 |
The naive conditional ceiling equals the unconditional ceiling across all
outcome types β the 1-ply filter never fires in practice. The MCTS ceiling
shows the real conditioning benefit: decisive outcomes (checkmate, stalemate,
insufficient material) get a 2.6x boost, while ply limit games β the vast
majority β show only 1.07x because knowing the game goes the distance
provides minimal per-move information.
## Reproducing
```bash
# Default: 2000 games, 32 rollouts/move, 2% sample rate
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069
# Higher precision (slower)
uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05
```
Results are saved to `data/theoretical_ceiling.json`.
## Caveats
- The MCTS ceiling is an estimate, not exact. With more rollouts and higher
sample rates, the estimate improves but computation time increases
quadratically.
- The ceiling assumes the model has perfect knowledge of P(outcome | move,
history). In practice, the model must learn this from data, so the
achievable accuracy for a finite model is somewhat below the ceiling.
- Game length information is implicit in the outcome token (e.g., PLY_LIMIT
implies 255 plies). A model could theoretically use position in the
sequence to estimate remaining game length, further improving predictions.
|