Spaces:

qtzx06
/

0x960

Sleeping

App Files Files Community

0x960 / docs /demo-script.md

qtzx06

docs: rewrite demo script with concrete before/after metrics and full results

5a8e942 about 1 month ago

preview code

raw

history blame contribute delete

3.06 kB

Demo Script

30-Second Version

0x960 is an OpenEnv self-improvement environment where an AI learns to engineer chess engines, not play chess. It gets a bounded coding workspace, edits engine evaluation code, tests changes with real matches, and is rewarded only when the engine actually gets stronger. We went from a base model that never wrote a single line of code (reward: -2.1) to a distilled student that reliably executes the full engineering loop (reward: +1.0), while our autonomous Codex agent swarm pushed engine strength by +596.5 Elo internally and beat Stockfish 1320 by +221.1 Elo — reaching competitive strength with Stockfish 1600 in Chess960.

One-Minute Outline

1. Opening (10s)

0x960 is a bounded self-improvement environment for Chess960 engine engineering, built on OpenEnv 0.2.1.

We turned engine engineering into a trainable RL task: inspect code, edit it, test it against a baseline, and get rewarded only when the engine is measurably stronger.

2. Why Chess960 (10s)

Chess960 randomizes the starting position across 960 setups. No opening books, no memorized lines. The engine has to actually understand chess positions to improve — you can't game the reward by memorizing patterns.

3. The Problem We Solved (15s)

When we dropped Qwen 3.5 into this environment, it scored -2.1 reward — it never once attempted to write code. It just read files and quit. Raw GRPO RL couldn't fix this because the policy never explored the right actions.

Our breakthrough: teacher-student distillation first, RL second.

GPT-5.4 teacher generates successful bounded-action trajectories via ACP runtime
Qwen 3.5-0.8B student learns the workflow through SFT (98.76% token accuracy in 5 minutes on H100)
TRL GRPO refines the student on real match reward
We also ran Qwen 3.5-9B QLoRA GRPO as a scaling probe on the Northflank H100

After distillation: reward +1.0, reliable write_file → run_match → finish execution.

4. The Codex Agent Swarm (15s)

In parallel, we built an autonomous Codex agent swarm — over a dozen agents across multiple rounds, each specializing in different chess knowledge (king safety, tactics, pawn structure, piece activity, initiative).

Champion/challenger tournament format: every patch gets benchmarked on held-out Chess960 positions. Only verified winners get promoted. 4 eval champions promoted through the gate. The swarm also edits search heuristics directly.

5. Results (10s)

+596.5 Elo internal gain (vs search baseline)
+221.1 Elo vs Stockfish 1320 anchor
Competitive with Stockfish 1600 in local Chess960 benchmarks
Engine went from bare negamax to PVS + TT + null-move pruning + LMR + aspiration windows
Full benchmark suite: eval-vs-eval, engine-vs-engine, UCI anchors, league self-play, static dashboard

All built in ~20 hours at the hackathon. Two parallel self-improvement loops — policy learning and autonomous engine search — feeding the same engine, with every claim backed by held-out match results.