| # Theoretical Accuracy Ceiling |
|
|
| PAWN is trained on uniformly random chess games. Since each move is drawn |
| uniformly from the legal move set, top-1 accuracy has a hard theoretical |
| ceiling β no model, however large, can exceed it. |
|
|
| ## Three ceilings |
|
|
| ### Unconditional ceiling: E[1/N_legal] = 6.43% |
| |
| At each position, the move is drawn uniformly from N legal moves. The best |
| a predictor can do without any context is pick one at random: accuracy = 1/N. |
| Averaged over all positions in random games, this gives **6.43%**. |
| |
| A model that exceeds this ceiling has learned something beyond just "which |
| moves are legal" β it has learned to estimate the number of legal moves at |
| each position and bias predictions toward positions with fewer options. |
| |
| ### Naive conditional ceiling: 6.44% |
| |
| A zero-cost analytical estimate of outcome conditioning. At each position, |
| legal moves that lead to an immediate terminal state with a *different* |
| outcome than the actual game are excluded, and accuracy = 1/(N_legal - N_wrong). |
| |
| This barely exceeds the unconditional ceiling (1.00x boost) because |
| immediate terminal states are rare β most moves at most positions lead to |
| non-terminal continuations, so the filter has almost nothing to exclude. |
| |
| ### MCTS conditional ceiling: 7.92% |
| |
| The full Monte Carlo estimate. At each sampled position, every legal move is |
| tried and 32 random continuations are played out to estimate |
| P(outcome | move, history). The Bayes-optimal predictor picks the move most |
| consistent with the known outcome. |
| |
| PAWN's input sequence begins with an outcome token (`WHITE_CHECKMATES`, |
| `STALEMATE`, `PLY_LIMIT`, etc.). This leaks information about the game's |
| trajectory, making some moves more predictable: |
|
|
| - **Checkmate games**: The final move must deliver checkmate. Knowing this |
| raises the ceiling at the last ply from ~5% to ~14%. |
| - **Ply limit games**: Knowing the game lasts 255 plies constrains the move |
| distribution slightly. |
| - **Stalemate games**: The final position has no legal moves but isn't check |
| β very constraining on late moves. |
|
|
| ## Adjusted accuracy |
|
|
| | Metric | Value | |
| |--------|-------| |
| | Unconditional ceiling (E[1/N_legal]) | 6.43% | |
| | Naive conditional ceiling (1-ply filter) | 6.44% | |
| | MCTS conditional ceiling (32 rollouts) | 7.92% | |
| | Conditioning boost (naive) | 1.00x | |
| | Conditioning boost (MCTS) | 1.23x | |
| |
| For a model with top-1 accuracy A: |
| |
| - **Adjusted (unconditional)** = A / 6.43% β measures how much the model |
| has learned about chess legality. Values > 100% mean it has learned |
| structure beyond just legal moves. |
| - **Adjusted (naive conditional)** = A / 6.44% β essentially the same as |
| unconditional; confirms that 1-ply lookahead explains almost none of the |
| outcome conditioning benefit. |
| - **Adjusted (MCTS conditional)** = A / 7.92% β measures how close the |
| model is to the Bayes-optimal predictor with perfect outcome knowledge. |
| This is the tighter bound. |
| |
| ### Final model results (100K steps) |
| |
| | Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond | |
| |---------|-------|-----------|---------------|--------------| |
| | large (68M) | 6.94% | 108% | 108% | 88% | |
| | base (36M) | 6.86% | 107% | 107% | 87% | |
| | small (10M) | 6.73% | 105% | 105% | 85% | |
| |
| All models exceed the unconditional and naive conditional ceilings, |
| confirming they learn chess structure beyond move legality. The large and |
| base models reach 87-88% of the MCTS conditional ceiling. |
| |
| ## Per-outcome breakdown |
| |
| | Outcome | Uncond | Naive Cond | MCTS Cond | Positions | |
| |---------|--------|------------|-----------|-----------| |
| | White checkmated | 5.26% | 5.26% | 13.79% | 328 | |
| | Black checkmated | 5.02% | 5.02% | 13.64% | 388 | |
| | Stalemate | 7.22% | 7.22% | 18.67% | 125 | |
| | Insufficient material | 7.17% | 7.17% | 18.61% | 256 | |
| | Ply limit | 6.51% | 6.51% | 6.97% | 8,618 | |
| |
| The naive conditional ceiling equals the unconditional ceiling across all |
| outcome types β the 1-ply filter never fires in practice. The MCTS ceiling |
| shows the real conditioning benefit: decisive outcomes (checkmate, stalemate, |
| insufficient material) get a 2.6x boost, while ply limit games β the vast |
| majority β show only 1.07x because knowing the game goes the distance |
| provides minimal per-move information. |
| |
| ## Reproducing |
| |
| ```bash |
| # Default: 2000 games, 32 rollouts/move, 2% sample rate |
| uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069 |
| |
| # Higher precision (slower) |
| uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05 |
| ``` |
| |
| Results are saved to `data/theoretical_ceiling.json`. |
|
|
| ## Caveats |
|
|
| - The MCTS ceiling is an estimate, not exact. With more rollouts and higher |
| sample rates, the estimate improves but computation time increases |
| quadratically. |
| - The ceiling assumes the model has perfect knowledge of P(outcome | move, |
| history). In practice, the model must learn this from data, so the |
| achievable accuracy for a finite model is somewhat below the ceiling. |
| - Game length information is implicit in the outcome token (e.g., PLY_LIMIT |
| implies 255 plies). A model could theoretically use position in the |
| sequence to estimate remaining game length, further improving predictions. |
| |