qtzx06 commited on
Commit
7109aa9
·
1 Parent(s): 5a8e942

docs: expand architecture doc with full search stack and training pipeline details

Browse files
Files changed (1) hide show
  1. docs/architecture.md +100 -61
docs/architecture.md CHANGED
@@ -1,71 +1,110 @@
1
  # Architecture
2
 
3
- ## Core Shape
4
-
5
- 0x960 has four moving parts:
6
-
7
- 1. `src/zero960/engine/`
8
- A minimal Chess960 engine with fixed search and one narrow editable surface: `eval.py`.
9
- 2. `src/zero960/runtime/`
10
- The episode runtime that owns workspace resets, bounded actions, reward shaping, and match scoring.
11
- 3. `src/zero960_env/`
12
- The OpenEnv wrapper and WebSocket client/server layer.
13
- 4. `train/`
14
- Distillation and RL entrypoints that operate on the same bounded action schema.
15
-
16
- ## Action Space
17
-
18
- The policy only gets structured actions:
19
-
20
- - `read_file`
21
- - `write_file`
22
- - `run_static_eval`
23
- - `run_match`
24
- - `finish`
25
-
26
- The full repo is not editable. The policy can only modify `eval.py` inside a fresh workspace.
27
-
28
- ## Observation Shape
29
-
30
- Each observation includes:
31
-
32
- - the task instruction
33
- - the current `eval.py` contents
34
- - recent action history
35
- - remaining steps
36
- - last match score
37
- - workflow hints and suggested next actions
38
-
39
- The current file contents are already visible in the observation, so the intended high-reward loop is:
40
-
41
- `write_file -> run_match -> finish`
42
-
43
- ## Reward Design
44
-
45
- Reward is match-score-based with explicit shaping around the edit loop:
46
-
47
- - positive signal for valid changed writes
48
- - positive signal for explicit `run_match` after a write
49
- - penalties for repeated `run_static_eval`, redundant `read_file`, and finishing without a meaningful edit/test cycle
50
- - invalid writes are rolled back immediately
51
-
52
- This keeps the environment learnable while still grounding the main score in downstream engine strength.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
  ## Training Strategy
55
 
56
- Current order of operations:
57
 
58
- 1. teacher distillation
59
- Use a strong coding model such as Codex/GPT-5.4 to generate successful bounded-action trajectories.
60
- 2. student fine-tuning
61
- Fine-tune a smaller open model on those trajectories.
62
- 3. RL refinement
63
- Use GRPO or a similar method only after the student already knows the workflow.
64
 
65
- This is the main shift from the earlier RL-first plan. The hard part has been action discovery, not just optimization.
 
 
 
 
66
 
67
  ## Deployment
68
 
69
- - HF Spaces: public OpenEnv artifact
70
- - Northflank H100: practical heavy training and debugging box
71
- - local dev: fastest loop for environment and prompt iteration
 
 
 
1
  # Architecture
2
 
3
+ ## System Overview
4
+
5
+ 0x960 is a complete self-improvement system with four tightly integrated components, built in ~20 hours at the OpenEnv Hackathon. The system produced **+596.5 Elo** in internal engine strength and pushed the engine to **competitive with Stockfish 1600** in Chess960.
6
+
7
+ ## Core Components
8
+
9
+ ### 1. Chess960 Engine (`src/zero960/engine/`)
10
+
11
+ A purpose-built Chess960 engine with a competitive classical search stack:
12
+
13
+ **Search (`search.py`):**
14
+ - Alpha-beta negamax with quiescence search (in-check evasion handling)
15
+ - Transposition table with persistent TT reuse across moves
16
+ - Principal Variation Search (PVS) — 40% speed improvement on later plies
17
+ - Null-move pruning for non-check, non-endgame nodes
18
+ - Late Move Reductions (LMR) for quiet moves at depth >= 3
19
+ - Aspiration windows at root, seeded from persistent TT
20
+ - Killer moves + history heuristic ordering
21
+ - Selective root depth extensions for openings, checks, and endgames
22
+
23
+ **Evaluation (`default_eval.py` + swarm champion):**
24
+ - Piece values, pawn structure (doubled, isolated, passed, connected, chains)
25
+ - Piece mobility, center control, rook file activity
26
+ - King safety, castling rights, bishop pair bonus
27
+ - Development scoring, phase-aware transitions
28
+ - Specialized hooks: structure, tactical, activity, pawn-endgame, initiative
29
+
30
+ ### 2. Episode Runtime (`src/zero960/runtime/`)
31
+
32
+ The bounded action environment that makes this an RL task:
33
+
34
+ **Action Space:**
35
+
36
+ | Action | Purpose | Reward Signal |
37
+ |--------|---------|--------------|
38
+ | `read_file` | Inspect current eval code | Penalty if redundant |
39
+ | `write_file` | Submit bounded replacement | Bonus for valid changed writes |
40
+ | `run_static_eval` | Quick position sanity check | Penalty if repeated |
41
+ | `run_match` | Full head-to-head match | Bonus for testing after write |
42
+ | `finish` | Declare done | Bonus only if engine improved |
43
+
44
+ **Key Design Decisions:**
45
+ - Invalid writes are rolled back instantly broken code never poisons the episode
46
+ - Reward is grounded in actual match outcomes, not proxy text metrics
47
+ - Workflow hints guide the policy toward `write → test → finish`
48
+ - Observations include current `eval.py` contents, action history, remaining budget, and match feedback
49
+
50
+ ### 3. OpenEnv Integration (`src/zero960_env/`)
51
+
52
+ OpenEnv 0.2.1 compliant wrapper with:
53
+ - FastAPI server with WebSocket support
54
+ - Structured action/observation models extending OpenEnv base types
55
+ - `openenv.yaml` manifest for HF Spaces deployment
56
+ - Docker-ready for production deployment
57
+
58
+ ### 4. Training & Optimization (`train/`)
59
+
60
+ Four improvement paths that compound on each other:
61
+
62
+ **Path 1: Teacher Distillation** (`codex_distill.py`)
63
+ - GPT-5.4 teacher generates bounded-action trajectories via ACP runtime
64
+ - Constrained to same JSON action schema as student
65
+ - Collected 35 successful episodes / 105 clean SFT rows
66
+
67
+ **Path 2: Student SFT** (`sft_student.py`)
68
+ - Distills teacher traces into Qwen 3.5-0.8B
69
+ - 98.76% token accuracy, 5 minutes on H100
70
+ - Student goes from -2.1 reward (never writes code) to +1.0 (full engineering loop)
71
+
72
+ **Path 3: GRPO Refinement** (`minimal_trl_openenv.py`)
73
+ - HF TRL GRPO over bounded OpenEnv environment
74
+ - Environment-grounded RL: structured multi-step tool use, not text completion
75
+ - Three modes: handcrafted demo, single inference, full training
76
+ - Also ran Qwen 3.5-9B QLoRA GRPO as scaling probe on H100
77
+
78
+ **Path 4: Codex Agent Swarm** (`codex_swarm.py`)
79
+ - Over a dozen autonomous Codex agents across multiple rounds
80
+ - 5 specialized worker roles targeting different chess knowledge domains
81
+ - Champion/challenger tournament with staged screening
82
+ - Dual surface: eval-only and search-only editing modes
83
+ - 4 eval champions promoted, search-surface promotions active
84
+
85
+ **Benchmark Suite:**
86
+ - `benchmark_eval.py` — eval-vs-eval on held-out Chess960 positions
87
+ - `benchmark_engine.py` — full engine-vs-engine (each side owns its own search + eval)
88
+ - `benchmark_uci.py` — UCI anchor comparison against Stockfish
89
+ - `benchmark_league.py` — league self-play against own champion history
90
+ - `build_dashboard.py` — static HTML dashboard with progression charts and Stockfish anchor bars
91
 
92
  ## Training Strategy
93
 
94
+ The critical insight that shaped this architecture:
95
 
96
+ > Raw RL fails at this task because base models never discover the core engineering workflow. GRPO can't optimize a policy that doesn't explore the right actions. **Teacher distillation solves action discovery; RL refines an already-competent policy.**
 
 
 
 
 
97
 
98
+ Order of operations:
99
+ 1. Teacher distillation → behavioral competence
100
+ 2. Student SFT → workflow reliability
101
+ 3. GRPO refinement → reward optimization
102
+ 4. Codex swarm → autonomous engine improvement (runs in parallel)
103
 
104
  ## Deployment
105
 
106
+ | Target | Status |
107
+ |--------|--------|
108
+ | HF Spaces | OpenEnv 0.2.1 compliant, Docker-ready |
109
+ | Northflank H100 | Heavy training + large benchmarks |
110
+ | Local dev | Fastest iteration loop for environment + prompt work |