Spaces:

qtzx06
/

0x960

Sleeping

App Files Files Community

qtzx06 commited on Mar 8

Commit

7109aa9

1 Parent(s): 5a8e942

docs: expand architecture doc with full search stack and training pipeline details

Browse files

Files changed (1) hide show

docs/architecture.md +100 -61

docs/architecture.md CHANGED Viewed

@@ -1,71 +1,110 @@
 # Architecture
-## Core Shape
-0x960 has four moving parts:
-1. `src/zero960/engine/`
-   A minimal Chess960 engine with fixed search and one narrow editable surface: `eval.py`.
-2. `src/zero960/runtime/`
-   The episode runtime that owns workspace resets, bounded actions, reward shaping, and match scoring.
-3. `src/zero960_env/`
-   The OpenEnv wrapper and WebSocket client/server layer.
-4. `train/`
-   Distillation and RL entrypoints that operate on the same bounded action schema.
-## Action Space
-The policy only gets structured actions:
-- `read_file`
-- `write_file`
-- `run_static_eval`
-- `run_match`
-- `finish`
-The full repo is not editable. The policy can only modify `eval.py` inside a fresh workspace.
-## Observation Shape
-Each observation includes:
-- the task instruction
-- the current `eval.py` contents
-- recent action history
-- remaining steps
-- last match score
-- workflow hints and suggested next actions
-The current file contents are already visible in the observation, so the intended high-reward loop is:
-`write_file -> run_match -> finish`
-## Reward Design
-Reward is match-score-based with explicit shaping around the edit loop:
-- positive signal for valid changed writes
-- positive signal for explicit `run_match` after a write
-- penalties for repeated `run_static_eval`, redundant `read_file`, and finishing without a meaningful edit/test cycle
-- invalid writes are rolled back immediately
-This keeps the environment learnable while still grounding the main score in downstream engine strength.
 ## Training Strategy
-Current order of operations:
-1. teacher distillation
-   Use a strong coding model such as Codex/GPT-5.4 to generate successful bounded-action trajectories.
-2. student fine-tuning
-   Fine-tune a smaller open model on those trajectories.
-3. RL refinement
-   Use GRPO or a similar method only after the student already knows the workflow.
-This is the main shift from the earlier RL-first plan. The hard part has been action discovery, not just optimization.
 ## Deployment
-- HF Spaces: public OpenEnv artifact
-- Northflank H100: practical heavy training and debugging box
-- local dev: fastest loop for environment and prompt iteration

 # Architecture
+## System Overview
+0x960 is a complete self-improvement system with four tightly integrated components, built in ~20 hours at the OpenEnv Hackathon. The system produced **+596.5 Elo** in internal engine strength and pushed the engine to **competitive with Stockfish 1600** in Chess960.
+## Core Components
+### 1. Chess960 Engine (`src/zero960/engine/`)
+A purpose-built Chess960 engine with a competitive classical search stack:
+**Search (`search.py`):**
+- Alpha-beta negamax with quiescence search (in-check evasion handling)
+- Transposition table with persistent TT reuse across moves
+- Principal Variation Search (PVS) — 40% speed improvement on later plies
+- Null-move pruning for non-check, non-endgame nodes
+- Late Move Reductions (LMR) for quiet moves at depth >= 3
+- Aspiration windows at root, seeded from persistent TT
+- Killer moves + history heuristic ordering
+- Selective root depth extensions for openings, checks, and endgames
+**Evaluation (`default_eval.py` + swarm champion):**
+- Piece values, pawn structure (doubled, isolated, passed, connected, chains)
+- Piece mobility, center control, rook file activity
+- King safety, castling rights, bishop pair bonus
+- Development scoring, phase-aware transitions
+- Specialized hooks: structure, tactical, activity, pawn-endgame, initiative
+### 2. Episode Runtime (`src/zero960/runtime/`)
+The bounded action environment that makes this an RL task:
+**Action Space:**
+| Action | Purpose | Reward Signal |
+|--------|---------|--------------|
+| `read_file` | Inspect current eval code | Penalty if redundant |
+| `write_file` | Submit bounded replacement | Bonus for valid changed writes |
+| `run_static_eval` | Quick position sanity check | Penalty if repeated |
+| `run_match` | Full head-to-head match | Bonus for testing after write |
+| `finish` | Declare done | Bonus only if engine improved |
+**Key Design Decisions:**
+- Invalid writes are rolled back instantly — broken code never poisons the episode
+- Reward is grounded in actual match outcomes, not proxy text metrics
+- Workflow hints guide the policy toward `write → test → finish`
+- Observations include current `eval.py` contents, action history, remaining budget, and match feedback
+### 3. OpenEnv Integration (`src/zero960_env/`)
+OpenEnv 0.2.1 compliant wrapper with:
+- FastAPI server with WebSocket support
+- Structured action/observation models extending OpenEnv base types
+- `openenv.yaml` manifest for HF Spaces deployment
+- Docker-ready for production deployment
+### 4. Training & Optimization (`train/`)
+Four improvement paths that compound on each other:
+**Path 1: Teacher Distillation** (`codex_distill.py`)
+- GPT-5.4 teacher generates bounded-action trajectories via ACP runtime
+- Constrained to same JSON action schema as student
+- Collected 35 successful episodes / 105 clean SFT rows
+**Path 2: Student SFT** (`sft_student.py`)
+- Distills teacher traces into Qwen 3.5-0.8B
+- 98.76% token accuracy, 5 minutes on H100
+- Student goes from -2.1 reward (never writes code) to +1.0 (full engineering loop)
+**Path 3: GRPO Refinement** (`minimal_trl_openenv.py`)
+- HF TRL GRPO over bounded OpenEnv environment
+- Environment-grounded RL: structured multi-step tool use, not text completion
+- Three modes: handcrafted demo, single inference, full training
+- Also ran Qwen 3.5-9B QLoRA GRPO as scaling probe on H100
+**Path 4: Codex Agent Swarm** (`codex_swarm.py`)
+- Over a dozen autonomous Codex agents across multiple rounds
+- 5 specialized worker roles targeting different chess knowledge domains
+- Champion/challenger tournament with staged screening
+- Dual surface: eval-only and search-only editing modes
+- 4 eval champions promoted, search-surface promotions active
+**Benchmark Suite:**
+- `benchmark_eval.py` — eval-vs-eval on held-out Chess960 positions
+- `benchmark_engine.py` — full engine-vs-engine (each side owns its own search + eval)
+- `benchmark_uci.py` — UCI anchor comparison against Stockfish
+- `benchmark_league.py` — league self-play against own champion history
+- `build_dashboard.py` — static HTML dashboard with progression charts and Stockfish anchor bars
 ## Training Strategy
+The critical insight that shaped this architecture:
+> Raw RL fails at this task because base models never discover the core engineering workflow. GRPO can't optimize a policy that doesn't explore the right actions. **Teacher distillation solves action discovery; RL refines an already-competent policy.**
+Order of operations:
+1. Teacher distillation → behavioral competence
+2. Student SFT → workflow reliability
+3. GRPO refinement → reward optimization
+4. Codex swarm → autonomous engine improvement (runs in parallel)
 ## Deployment
+| Target | Status |
+|--------|--------|
+| HF Spaces | OpenEnv 0.2.1 compliant, Docker-ready |
+| Northflank H100 | Heavy training + large benchmarks |
+| Local dev | Fastest iteration loop for environment + prompt work |