docs: expand architecture doc with full search stack and training pipeline details
Browse files- docs/architecture.md +100 -61
docs/architecture.md
CHANGED
|
@@ -1,71 +1,110 @@
|
|
| 1 |
# Architecture
|
| 2 |
|
| 3 |
-
##
|
| 4 |
-
|
| 5 |
-
0x960
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
-
|
| 21 |
-
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
`
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
-
|
| 48 |
-
-
|
| 49 |
-
|
| 50 |
-
|
| 51 |
-
|
| 52 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 53 |
|
| 54 |
## Training Strategy
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
-
Use a strong coding model such as Codex/GPT-5.4 to generate successful bounded-action trajectories.
|
| 60 |
-
2. student fine-tuning
|
| 61 |
-
Fine-tune a smaller open model on those trajectories.
|
| 62 |
-
3. RL refinement
|
| 63 |
-
Use GRPO or a similar method only after the student already knows the workflow.
|
| 64 |
|
| 65 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
## Deployment
|
| 68 |
|
| 69 |
-
|
| 70 |
-
-
|
| 71 |
-
|
|
|
|
|
|
|
|
|
| 1 |
# Architecture
|
| 2 |
|
| 3 |
+
## System Overview
|
| 4 |
+
|
| 5 |
+
0x960 is a complete self-improvement system with four tightly integrated components, built in ~20 hours at the OpenEnv Hackathon. The system produced **+596.5 Elo** in internal engine strength and pushed the engine to **competitive with Stockfish 1600** in Chess960.
|
| 6 |
+
|
| 7 |
+
## Core Components
|
| 8 |
+
|
| 9 |
+
### 1. Chess960 Engine (`src/zero960/engine/`)
|
| 10 |
+
|
| 11 |
+
A purpose-built Chess960 engine with a competitive classical search stack:
|
| 12 |
+
|
| 13 |
+
**Search (`search.py`):**
|
| 14 |
+
- Alpha-beta negamax with quiescence search (in-check evasion handling)
|
| 15 |
+
- Transposition table with persistent TT reuse across moves
|
| 16 |
+
- Principal Variation Search (PVS) — 40% speed improvement on later plies
|
| 17 |
+
- Null-move pruning for non-check, non-endgame nodes
|
| 18 |
+
- Late Move Reductions (LMR) for quiet moves at depth >= 3
|
| 19 |
+
- Aspiration windows at root, seeded from persistent TT
|
| 20 |
+
- Killer moves + history heuristic ordering
|
| 21 |
+
- Selective root depth extensions for openings, checks, and endgames
|
| 22 |
+
|
| 23 |
+
**Evaluation (`default_eval.py` + swarm champion):**
|
| 24 |
+
- Piece values, pawn structure (doubled, isolated, passed, connected, chains)
|
| 25 |
+
- Piece mobility, center control, rook file activity
|
| 26 |
+
- King safety, castling rights, bishop pair bonus
|
| 27 |
+
- Development scoring, phase-aware transitions
|
| 28 |
+
- Specialized hooks: structure, tactical, activity, pawn-endgame, initiative
|
| 29 |
+
|
| 30 |
+
### 2. Episode Runtime (`src/zero960/runtime/`)
|
| 31 |
+
|
| 32 |
+
The bounded action environment that makes this an RL task:
|
| 33 |
+
|
| 34 |
+
**Action Space:**
|
| 35 |
+
|
| 36 |
+
| Action | Purpose | Reward Signal |
|
| 37 |
+
|--------|---------|--------------|
|
| 38 |
+
| `read_file` | Inspect current eval code | Penalty if redundant |
|
| 39 |
+
| `write_file` | Submit bounded replacement | Bonus for valid changed writes |
|
| 40 |
+
| `run_static_eval` | Quick position sanity check | Penalty if repeated |
|
| 41 |
+
| `run_match` | Full head-to-head match | Bonus for testing after write |
|
| 42 |
+
| `finish` | Declare done | Bonus only if engine improved |
|
| 43 |
+
|
| 44 |
+
**Key Design Decisions:**
|
| 45 |
+
- Invalid writes are rolled back instantly — broken code never poisons the episode
|
| 46 |
+
- Reward is grounded in actual match outcomes, not proxy text metrics
|
| 47 |
+
- Workflow hints guide the policy toward `write → test → finish`
|
| 48 |
+
- Observations include current `eval.py` contents, action history, remaining budget, and match feedback
|
| 49 |
+
|
| 50 |
+
### 3. OpenEnv Integration (`src/zero960_env/`)
|
| 51 |
+
|
| 52 |
+
OpenEnv 0.2.1 compliant wrapper with:
|
| 53 |
+
- FastAPI server with WebSocket support
|
| 54 |
+
- Structured action/observation models extending OpenEnv base types
|
| 55 |
+
- `openenv.yaml` manifest for HF Spaces deployment
|
| 56 |
+
- Docker-ready for production deployment
|
| 57 |
+
|
| 58 |
+
### 4. Training & Optimization (`train/`)
|
| 59 |
+
|
| 60 |
+
Four improvement paths that compound on each other:
|
| 61 |
+
|
| 62 |
+
**Path 1: Teacher Distillation** (`codex_distill.py`)
|
| 63 |
+
- GPT-5.4 teacher generates bounded-action trajectories via ACP runtime
|
| 64 |
+
- Constrained to same JSON action schema as student
|
| 65 |
+
- Collected 35 successful episodes / 105 clean SFT rows
|
| 66 |
+
|
| 67 |
+
**Path 2: Student SFT** (`sft_student.py`)
|
| 68 |
+
- Distills teacher traces into Qwen 3.5-0.8B
|
| 69 |
+
- 98.76% token accuracy, 5 minutes on H100
|
| 70 |
+
- Student goes from -2.1 reward (never writes code) to +1.0 (full engineering loop)
|
| 71 |
+
|
| 72 |
+
**Path 3: GRPO Refinement** (`minimal_trl_openenv.py`)
|
| 73 |
+
- HF TRL GRPO over bounded OpenEnv environment
|
| 74 |
+
- Environment-grounded RL: structured multi-step tool use, not text completion
|
| 75 |
+
- Three modes: handcrafted demo, single inference, full training
|
| 76 |
+
- Also ran Qwen 3.5-9B QLoRA GRPO as scaling probe on H100
|
| 77 |
+
|
| 78 |
+
**Path 4: Codex Agent Swarm** (`codex_swarm.py`)
|
| 79 |
+
- Over a dozen autonomous Codex agents across multiple rounds
|
| 80 |
+
- 5 specialized worker roles targeting different chess knowledge domains
|
| 81 |
+
- Champion/challenger tournament with staged screening
|
| 82 |
+
- Dual surface: eval-only and search-only editing modes
|
| 83 |
+
- 4 eval champions promoted, search-surface promotions active
|
| 84 |
+
|
| 85 |
+
**Benchmark Suite:**
|
| 86 |
+
- `benchmark_eval.py` — eval-vs-eval on held-out Chess960 positions
|
| 87 |
+
- `benchmark_engine.py` — full engine-vs-engine (each side owns its own search + eval)
|
| 88 |
+
- `benchmark_uci.py` — UCI anchor comparison against Stockfish
|
| 89 |
+
- `benchmark_league.py` — league self-play against own champion history
|
| 90 |
+
- `build_dashboard.py` — static HTML dashboard with progression charts and Stockfish anchor bars
|
| 91 |
|
| 92 |
## Training Strategy
|
| 93 |
|
| 94 |
+
The critical insight that shaped this architecture:
|
| 95 |
|
| 96 |
+
> Raw RL fails at this task because base models never discover the core engineering workflow. GRPO can't optimize a policy that doesn't explore the right actions. **Teacher distillation solves action discovery; RL refines an already-competent policy.**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
Order of operations:
|
| 99 |
+
1. Teacher distillation → behavioral competence
|
| 100 |
+
2. Student SFT → workflow reliability
|
| 101 |
+
3. GRPO refinement → reward optimization
|
| 102 |
+
4. Codex swarm → autonomous engine improvement (runs in parallel)
|
| 103 |
|
| 104 |
## Deployment
|
| 105 |
|
| 106 |
+
| Target | Status |
|
| 107 |
+
|--------|--------|
|
| 108 |
+
| HF Spaces | OpenEnv 0.2.1 compliant, Docker-ready |
|
| 109 |
+
| Northflank H100 | Heavy training + large benchmarks |
|
| 110 |
+
| Local dev | Fastest iteration loop for environment + prompt work |
|