Spaces:

qtzx06
/

0x960

Sleeping

App Files Files Community

0x960 / docs /architecture.md

qtzx06

docs: expand architecture doc with full search stack and training pipeline details

7109aa9 about 1 month ago

preview code

raw

history blame contribute delete

4.94 kB

Architecture

System Overview

0x960 is a complete self-improvement system with four tightly integrated components, built in ~20 hours at the OpenEnv Hackathon. The system produced +596.5 Elo in internal engine strength and pushed the engine to competitive with Stockfish 1600 in Chess960.

Core Components

1. Chess960 Engine (`src/zero960/engine/`)

A purpose-built Chess960 engine with a competitive classical search stack:

Search (search.py):

Alpha-beta negamax with quiescence search (in-check evasion handling)
Transposition table with persistent TT reuse across moves
Principal Variation Search (PVS) — 40% speed improvement on later plies
Null-move pruning for non-check, non-endgame nodes
Late Move Reductions (LMR) for quiet moves at depth >= 3
Aspiration windows at root, seeded from persistent TT
Killer moves + history heuristic ordering
Selective root depth extensions for openings, checks, and endgames

Evaluation (default_eval.py + swarm champion):

Piece values, pawn structure (doubled, isolated, passed, connected, chains)
Piece mobility, center control, rook file activity
King safety, castling rights, bishop pair bonus
Development scoring, phase-aware transitions
Specialized hooks: structure, tactical, activity, pawn-endgame, initiative

2. Episode Runtime (`src/zero960/runtime/`)

The bounded action environment that makes this an RL task:

Action Space:

Action	Purpose	Reward Signal
`read_file`	Inspect current eval code	Penalty if redundant
`write_file`	Submit bounded replacement	Bonus for valid changed writes
`run_static_eval`	Quick position sanity check	Penalty if repeated
`run_match`	Full head-to-head match	Bonus for testing after write
`finish`	Declare done	Bonus only if engine improved

Key Design Decisions:

Invalid writes are rolled back instantly — broken code never poisons the episode
Reward is grounded in actual match outcomes, not proxy text metrics
Workflow hints guide the policy toward write → test → finish
Observations include current eval.py contents, action history, remaining budget, and match feedback

3. OpenEnv Integration (`src/zero960_env/`)

OpenEnv 0.2.1 compliant wrapper with:

FastAPI server with WebSocket support
Structured action/observation models extending OpenEnv base types
openenv.yaml manifest for HF Spaces deployment
Docker-ready for production deployment

4. Training & Optimization (`train/`)

Four improvement paths that compound on each other:

Path 1: Teacher Distillation (codex_distill.py)

GPT-5.4 teacher generates bounded-action trajectories via ACP runtime
Constrained to same JSON action schema as student
Collected 35 successful episodes / 105 clean SFT rows

Path 2: Student SFT (sft_student.py)

Distills teacher traces into Qwen 3.5-0.8B
98.76% token accuracy, 5 minutes on H100
Student goes from -2.1 reward (never writes code) to +1.0 (full engineering loop)

Path 3: GRPO Refinement (minimal_trl_openenv.py)

HF TRL GRPO over bounded OpenEnv environment
Environment-grounded RL: structured multi-step tool use, not text completion
Three modes: handcrafted demo, single inference, full training
Also ran Qwen 3.5-9B QLoRA GRPO as scaling probe on H100

Path 4: Codex Agent Swarm (codex_swarm.py)

Over a dozen autonomous Codex agents across multiple rounds
5 specialized worker roles targeting different chess knowledge domains
Champion/challenger tournament with staged screening
Dual surface: eval-only and search-only editing modes
4 eval champions promoted, search-surface promotions active

Benchmark Suite:

benchmark_eval.py — eval-vs-eval on held-out Chess960 positions
benchmark_engine.py — full engine-vs-engine (each side owns its own search + eval)
benchmark_uci.py — UCI anchor comparison against Stockfish
benchmark_league.py — league self-play against own champion history
build_dashboard.py — static HTML dashboard with progression charts and Stockfish anchor bars

Training Strategy

The critical insight that shaped this architecture:

Raw RL fails at this task because base models never discover the core engineering workflow. GRPO can't optimize a policy that doesn't explore the right actions. Teacher distillation solves action discovery; RL refines an already-competent policy.

Order of operations:

Teacher distillation → behavioral competence
Student SFT → workflow reliability
GRPO refinement → reward optimization
Codex swarm → autonomous engine improvement (runs in parallel)

Deployment

Target	Status
HF Spaces	OpenEnv 0.2.1 compliant, Docker-ready
Northflank H100	Heavy training + large benchmarks
Local dev	Fastest iteration loop for environment + prompt work

Architecture

System Overview

Core Components

1. Chess960 Engine (src/zero960/engine/)

2. Episode Runtime (src/zero960/runtime/)

3. OpenEnv Integration (src/zero960_env/)