File size: 5,197 Bytes

# Theoretical Accuracy Ceiling

PAWN is trained on uniformly random chess games. Since each move is drawn
uniformly from the legal move set, top-1 accuracy has a hard theoretical
ceiling — no model, however large, can exceed it.

## Three ceilings

### Unconditional ceiling: E[1/N_legal] = 6.43%

At each position, the move is drawn uniformly from N legal moves. The best
a predictor can do without any context is pick one at random: accuracy = 1/N.
Averaged over all positions in random games, this gives **6.43%**.

A model that exceeds this ceiling has learned something beyond just "which
moves are legal" — it has learned to estimate the number of legal moves at
each position and bias predictions toward positions with fewer options.

### Naive conditional ceiling: 6.44%

A zero-cost analytical estimate of outcome conditioning. At each position,
legal moves that lead to an immediate terminal state with a *different*
outcome than the actual game are excluded, and accuracy = 1/(N_legal - N_wrong).

This barely exceeds the unconditional ceiling (1.00x boost) because
immediate terminal states are rare — most moves at most positions lead to
non-terminal continuations, so the filter has almost nothing to exclude.

### MCTS conditional ceiling: 7.92%

The full Monte Carlo estimate. At each sampled position, every legal move is
tried and 32 random continuations are played out to estimate
P(outcome | move, history). The Bayes-optimal predictor picks the move most
consistent with the known outcome.

PAWN's input sequence begins with an outcome token (`WHITE_CHECKMATES`,
`STALEMATE`, `PLY_LIMIT`, etc.). This leaks information about the game's
trajectory, making some moves more predictable:

- **Checkmate games**: The final move must deliver checkmate. Knowing this
  raises the ceiling at the last ply from ~5% to ~14%.
- **Ply limit games**: Knowing the game lasts 255 plies constrains the move
  distribution slightly.
- **Stalemate games**: The final position has no legal moves but isn't check
  — very constraining on late moves.

## Adjusted accuracy

| Metric | Value |
|--------|-------|
| Unconditional ceiling (E[1/N_legal]) | 6.43% |
| Naive conditional ceiling (1-ply filter) | 6.44% |
| MCTS conditional ceiling (32 rollouts) | 7.92% |
| Conditioning boost (naive) | 1.00x |
| Conditioning boost (MCTS) | 1.23x |

For a model with top-1 accuracy A:

- **Adjusted (unconditional)** = A / 6.43% — measures how much the model
  has learned about chess legality. Values > 100% mean it has learned
  structure beyond just legal moves.
- **Adjusted (naive conditional)** = A / 6.44% — essentially the same as
  unconditional; confirms that 1-ply lookahead explains almost none of the
  outcome conditioning benefit.
- **Adjusted (MCTS conditional)** = A / 7.92% — measures how close the
  model is to the Bayes-optimal predictor with perfect outcome knowledge.
  This is the tighter bound.

### Final model results (100K steps)

| Variant | Top-1 | vs Uncond | vs Naive Cond | vs MCTS Cond |
|---------|-------|-----------|---------------|--------------|
| large (68M) | 6.94% | 108% | 108% | 88% |
| base (36M) | 6.86% | 107% | 107% | 87% |
| small (10M) | 6.73% | 105% | 105% | 85% |

All models exceed the unconditional and naive conditional ceilings,
confirming they learn chess structure beyond move legality. The large and
base models reach 87-88% of the MCTS conditional ceiling.

## Per-outcome breakdown

| Outcome | Uncond | Naive Cond | MCTS Cond | Positions |
|---------|--------|------------|-----------|-----------|
| White checkmated | 5.26% | 5.26% | 13.79% | 328 |
| Black checkmated | 5.02% | 5.02% | 13.64% | 388 |
| Stalemate | 7.22% | 7.22% | 18.67% | 125 |
| Insufficient material | 7.17% | 7.17% | 18.61% | 256 |
| Ply limit | 6.51% | 6.51% | 6.97% | 8,618 |

The naive conditional ceiling equals the unconditional ceiling across all
outcome types — the 1-ply filter never fires in practice. The MCTS ceiling
shows the real conditioning benefit: decisive outcomes (checkmate, stalemate,
insufficient material) get a 2.6x boost, while ply limit games — the vast
majority — show only 1.07x because knowing the game goes the distance
provides minimal per-move information.

## Reproducing

```bash
# Default: 2000 games, 32 rollouts/move, 2% sample rate
uv run python scripts/compute_theoretical_ceiling.py --model-accuracy 0.069

# Higher precision (slower)
uv run python scripts/compute_theoretical_ceiling.py --n-games 10000 --rollouts 64 --sample-rate 0.05
```

Results are saved to `data/theoretical_ceiling.json`.

## Caveats

- The MCTS ceiling is an estimate, not exact. With more rollouts and higher
  sample rates, the estimate improves but computation time increases
  quadratically.
- The ceiling assumes the model has perfect knowledge of P(outcome | move,
  history). In practice, the model must learn this from data, so the
  achievable accuracy for a finite model is somewhat below the ceiling.
- Game length information is implicit in the outcome token (e.g., PLY_LIMIT
  implies 255 plies). A model could theoretically use position in the
  sequence to estimate remaining game length, further improving predictions.