Shannon's Gambit (legacyaravind/shannons-gambit)
A self-improving chess intelligence. This repo holds the served network plus the
checkpoint ladder (ladder.json) and the Inference Endpoint handler. The full
system lives at
github.com/aravinds-kannappan/Chess-Gambit-RL.
What the system is
A multi-agent engine: each position is routed to the method that owns it.
| Agent | Where it plays | Method |
|---|---|---|
| MDP | solved endgames (KRvK, KQvK) | exact Bellman value iteration (optimal) |
| PPO | low-material regime | on-policy actor-critic RL |
| Reward (DQN) | low-material regime | off-policy, potential-based shaping |
| Neural | opening / middlegame | this network: AlphaZero-lite self-play + behavioural cloning |
A phase router (agents/router.py) dispatches each move to the right agent.
The network in this repo is the general full-board player and the bootstrap for
self-play; it also serves the policy/value/WDL/rating predictions.
Stockfish is the benchmark, never a player
The agents never call Stockfish to choose a move. A separate backend evaluator
(eval/benchmark.py) uses Stockfish only as a calibrated yardstick: it throttles
Stockfish to known Elo bands (UCI_LimitStrength + UCI_Elo, with a Skill Level
fallback below the floor), plays each agent a gauntlet, and fits a calibrated Elo
(Bradley-Terry MLE). It also reports centipawn loss and top-1 agreement. That
rating is the level each agent plays at and climbs as it learns.
The network
Multi-head residual network trained on real Lichess games. Heads: policy (next move), value + win/draw/loss (outcome), and player rating (Elo).
Final supervised training metrics
{
"loss_policy": 0.2169,
"loss_value": 0.0305,
"loss_wdl": 0.0295,
"loss_rating": 0.0312,
"policy_acc": 0.966,
"wdl_acc": 0.9903,
"rating_mae_elo": 21.1,
"epoch": 15
}
Input / output
- Input: 18x8x8 board planes (see
shannons_gambit/data/encode.py). - Output: policy logits over 4672 moves, scalar value in [-1, 1], WDL logits, standardised rating.
How it is served
- HF Space (Docker + FastAPI): trains continuously by self-play and serves
/move,/predict,/watch-move,/ladder, plus/calibrate(Stockfish-assessed Elo). New generations are versioned back to this repo so the ladder survives restarts. - Inference Endpoint:
handler.pyhere loadsmodel.ptand returns best move, WDL, value and rating for a FEN.
import requests
requests.post(
"https://<your-endpoint>.endpoints.huggingface.cloud",
headers={"Authorization": "Bearer <HF_TOKEN>"},
json={"inputs": {"fen": "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1"}},
).json()
Honest limitations
- The MDP, PPO and reward agents are endgame specialists (validated against the exactly-solved table); the network carries the opening and middlegame.
- On a free CPU, self-play is slow and the Elo ladder grows over hours; GPU bursts accelerate it. Any Elo is only meaningful once anchored by the Stockfish benchmark.
License: Apache-2.0.