docs: strengthen Chess960 thesis — why it's the right self-improvement benchmark
Browse files- docs/why_chess960.md +30 -22
docs/why_chess960.md
CHANGED
|
@@ -1,37 +1,45 @@
|
|
| 1 |
# Why Chess960
|
| 2 |
|
| 3 |
-
##
|
| 4 |
|
| 5 |
-
|
| 6 |
|
| 7 |
-
|
| 8 |
|
| 9 |
-
##
|
| 10 |
|
| 11 |
-
|
| 12 |
-
- the benchmark is less vulnerable to opening-book memorization than standard chess
|
| 13 |
-
- the engine must perform across many starting setups, not just one canonical opening tree
|
| 14 |
-
- reward comes from real match outcomes, not proxy text metrics
|
| 15 |
|
| 16 |
-
|
|
|
|
|
|
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
| 23 |
-
- strong standard-chess performance does not automatically transfer
|
| 24 |
-
- this makes Chess960 a good downstream benchmark for a tool-using self-improvement environment
|
| 25 |
|
| 26 |
-
|
|
|
|
|
|
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
| 31 |
|
| 32 |
-
|
| 33 |
-
-
|
| 34 |
-
-
|
| 35 |
-
-
|
|
|
|
| 36 |
|
| 37 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Why Chess960
|
| 2 |
|
| 3 |
+
## The Problem with Standard Chess
|
| 4 |
|
| 5 |
+
Standard chess engines can be improved by memorizing opening books — thousands of well-known opening sequences that have been optimized over centuries. An RL agent that "improves" a standard chess engine might just be learning to parrot known opening theory rather than developing genuine evaluation ability.
|
| 6 |
|
| 7 |
+
This is exactly the kind of reward hacking we wanted to avoid.
|
| 8 |
|
| 9 |
+
## Why Chess960 Is the Right Benchmark
|
| 10 |
|
| 11 |
+
Chess960 (Fischer Random Chess) keeps the rules of chess identical but randomizes the back-rank piece placement across **960 possible starting positions**. This eliminates:
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
- Opening book memorization
|
| 14 |
+
- Known opening theory exploitation
|
| 15 |
+
- Position-specific pattern matching that doesn't generalize
|
| 16 |
|
| 17 |
+
The engine must evaluate positions it has never seen before based on fundamental chess principles — piece activity, king safety, pawn structure, tactical threats. **If the evaluation code is better, it wins more games. Period.**
|
| 18 |
|
| 19 |
+
## What This Means for Self-Improvement
|
| 20 |
|
| 21 |
+
Chess960 is a cleaner robustness test than standard chess for exactly the reason that matters in RL:
|
|
|
|
|
|
|
| 22 |
|
| 23 |
+
- **You can't game the reward.** There's no shortcut where memorizing patterns gets you a higher score without actually understanding chess positions.
|
| 24 |
+
- **Generalization is mandatory.** The engine must perform across 960 different starting setups, not just one canonical opening tree.
|
| 25 |
+
- **The signal is real.** Win/loss/draw outcomes on held-out Chess960 positions are ground truth — not proxy metrics, not text quality scores, not human preferences.
|
| 26 |
|
| 27 |
+
## How 0x960 Uses This
|
| 28 |
|
| 29 |
+
The agent doesn't play chess. It writes evaluation code that a chess engine uses to play. The reward comes from whether that code makes the engine win more games on held-out Chess960 positions.
|
| 30 |
|
| 31 |
+
This is the bridge from the research motivation to the OpenEnv environment design:
|
| 32 |
+
- Chess960 provides a clean, non-gameable benchmark
|
| 33 |
+
- Bounded code editing provides the action space
|
| 34 |
+
- Real match outcomes provide the reward signal
|
| 35 |
+
- The agent has to write code that *actually understands chess positions* to improve
|
| 36 |
|
| 37 |
+
## Results
|
| 38 |
+
|
| 39 |
+
The system works. Starting from a basic eval function, the combination of teacher-student policy learning and autonomous Codex swarm search pushed the engine to:
|
| 40 |
+
|
| 41 |
+
- **+596.5 Elo** vs the search baseline (internal)
|
| 42 |
+
- **+221.1 Elo** vs Stockfish 1320 (external anchor)
|
| 43 |
+
- **Competitive with Stockfish 1600** in local Chess960 benchmarks
|
| 44 |
+
|
| 45 |
+
All verified on held-out Chess960 positions that the agent never trained on.
|