Spaces:

qtzx06
/

0x960

Sleeping

App Files Files Community

qtzx06 commited on Mar 8

Commit

eac9d9f

1 Parent(s): b0b9657

feat: finalize swarm tooling and submission artifacts

Browse files

Files changed (33) hide show

.gitignore +3 -1
README.md +235 -31
docs/agent-log-instruction.md +0 -30
docs/architecture.md +47 -164
docs/codex-swarm-plan.md +168 -0
docs/concept.md +0 -72
docs/demo-script.md +16 -16
docs/open-questions.md +0 -85
docs/process.md +381 -0
docs/why_chess960.md +2 -2
media/submission/0x960_score_progression.png +0 -0
media/submission/0x960_score_progression.txt +4 -0
media/submission/0x960_stockfish_anchors.png +0 -0
media/submission/0x960_stockfish_anchors.txt +3 -0
media/submission/submission_summary.txt +5 -0
scripts/generate_submission_media.py +304 -0
src/zero960/engine/default_eval.py +176 -15
src/zero960/engine/search.py +296 -10
src/zero960/runtime/episode.py +97 -5
src/zero960/runtime/types.py +4 -1
src/zero960/workspace_template/eval.py +372 -7
src/zero960_env/models.py +4 -0
src/zero960_env/server/environment.py +8 -0
train/benchmark_engine.py +207 -0
train/benchmark_eval.py +184 -0
train/benchmark_league.py +249 -0
train/benchmark_uci.py +307 -0
train/build_dashboard.py +656 -0
train/codex_distill.py +332 -0
train/codex_swarm.py +1114 -0
train/minimal_trl_openenv.py +486 -51
train/sft_student.py +243 -0
uv.lock +0 -0

.gitignore CHANGED Viewed

@@ -5,6 +5,8 @@ __pycache__/
 build/
 dist/
 outputs/
 *.egg-info/
 *.pyc

 build/
 dist/
 outputs/
+zero960_logs/
+zero960_grpo_output/
+zero960_grpo_final/
 *.egg-info/
 *.pyc

README.md CHANGED Viewed

@@ -1,48 +1,252 @@
 # 0x960
-0x960 is an OpenEnv-oriented environment where a model improves a minimal Chess960 engine by editing a bounded evaluation file and getting rewarded by match outcomes.
-## background
-Chess960 is a strong benchmark for generalization because the rules of chess stay the same while the starting position changes across 960 legal configurations. That removes much of the opening-book structure that standard chess systems can exploit and puts more pressure on transferable positional reasoning and search.
-Recent engine and research results make this useful for our setting. Classical search engines such as Stockfish remain extremely strong in Chess960, while several neural and RL-heavy systems lose more relative strength than they do in standard chess. Recent work also shows that transformer chess models trained on standard chess suffer noticeable drops on Chess960 positions, which suggests that high in-distribution performance can still rely on brittle configuration-specific pattern matching.
-0x960 turns that observation into an OpenEnv task. Instead of asking a model to output chess moves directly, we ask it to improve a minimal Chess960 engine through bounded code edits. The model reads files, edits the evaluation logic, runs checks, and gets rewarded by whether the edited engine performs better against a baseline.
-## why chess960
-- Chess960 is a controlled distribution shift: the rules are unchanged, but the initial conditions vary.
-- That makes it a cleaner test of robustness than standard chess alone.
-- The agent is not rewarded for imitation or move prediction; it is rewarded for improving a real system.
-- This makes the environment a better fit for OpenEnv than a direct gameplay benchmark because it requires multi-step tool use, debugging, and iterative refinement.
-## repo layout
-- `docs/`: concept, architecture, and scope docs
-- `src/zero960/`: shared engine and episode runtime logic
-- `src/zero960_env/`: OpenEnv-facing models, server, and client
-- `train/`: minimal TRL/Colab-oriented training entrypoints
-## current status
-This repo currently contains a thin but functional skeleton:
-- minimal Chess960 engine core
-- workspace-based bounded file editing runtime
-- OpenEnv wrapper scaffold
-- minimal TRL rollout stub
-## next steps
-1. tighten the engine and reward harness
-2. validate the OpenEnv app structure against `0.2.1`
-3. add a small Colab notebook around the training stub
-4. deploy the server to HF Spaces
-## supporting docs
-- `docs/why_chess960.md`: short research framing for judges and README reuse
-- `docs/demo-script.md`: one-minute demo outline
-- `docs/process.md`: chronological build log for demo storytelling and judging
-- `docs/agent-log-instruction.md`: reusable instruction snippet for coding agents

 # 0x960
+0x960 is an OpenEnv environment where a model improves a minimal Chess960 engine by editing a bounded `eval.py` file and getting rewarded by match outcomes.
+The core task is not "play chess." The task is "act like a bounded engine engineer": inspect the current evaluation logic, edit it, test it, and decide when to finish.
+## Current Direction
+The repo currently supports two training paths:
+- teacher distillation first: collect high-quality bounded-action trajectories from Codex or another strong coding agent, then fine-tune a smaller open model on those traces
+- RL refinement second: use the OpenEnv reward loop to sharpen a student model that already knows the `write_file -> run_match -> finish` workflow
+This ordering is deliberate. The main failure mode so far has not been raw model size; it has been weak action priors. Base models tend to spam `run_static_eval` or `finish` instead of discovering code edits. Distillation fixes that faster than asking GRPO to invent the workflow from scratch.
+There is also a complementary outer-loop path: use multiple local Codex workers to iterate directly on the engine, benchmark every patch, keep only Elo-positive changes, and then distill the best traces back into an open student. See [Codex Swarm Plan](docs/codex-swarm-plan.md).
+## Repo Layout
+- `src/zero960/`: engine, workspace, and episode runtime
+- `src/zero960_env/`: OpenEnv server, models, and client
+- `train/minimal_trl_openenv.py`: handcrafted demo, inference loop, and GRPO training entrypoint
+- `train/codex_distill.py`: Codex teacher rollout collector and SFT sample exporter
+- `docs/`: concise project docs and process log
+## Local Smoke Test
+Start the OpenEnv server:
+```sh
+uv run python -m uvicorn zero960_env.server.app:app --host 127.0.0.1 --port 8000
+```
+Run the bounded-action demo:
+```sh
+uv run python -m train.minimal_trl_openenv --mode handcrafted --base-url http://127.0.0.1:8000
+```
+## Codex Teacher Distillation
+Prerequisites:
+- Codex CLI installed and logged in
+- local OpenEnv server running
+Collect teacher rollouts and export SFT-ready samples:
+```sh
+uv run python -m train.codex_distill \
+  --base-url http://127.0.0.1:8000 \
+  --model gpt-5.4 \
+  --episodes 20
+```
+Outputs go to `outputs/codex_distill/`:
+- `teacher_rollouts_*.jsonl`: raw per-episode teacher traces
+- `sft_samples_*.jsonl`: filtered turn-level chat samples for student fine-tuning
+## Student SFT
+Train a small student on the collected teacher traces:
+```sh
+uv run python -m train.sft_student \
+  --model Qwen/Qwen3.5-0.8B \
+  --output-dir outputs/sft_qwen_0p8b
+```
+Dry-run the dataset loader first if you want to verify counts and filtering:
+```sh
+uv run python -m train.sft_student --dry-run
+```
+The loader validates the assistant action JSON and drops malformed older rows automatically, so the early pre-cleanup SFT dump does not need manual editing.
+## Benchmarking Engine Strength
+Compare two eval files on held-out Chess960 start positions:
+```sh
+uv run python -m train.benchmark_eval \
+  --candidate-file src/zero960/workspace_template/eval.py \
+  --baseline-file src/zero960/engine/default_eval.py \
+  --positions 64 \
+  --depth 2
+```
+This is the metric that matters for "better chess" in this repo. Training reward can teach the workflow, but real strength should be checked with held-out match score.
+Benchmark a local eval file against an external UCI engine such as Stockfish:
+```sh
+uv run python -m train.benchmark_uci \
+  --candidate-file src/zero960/workspace_template/eval.py \
+  --engine-command stockfish \
+  --engine-option UCI_LimitStrength=true \
+  --engine-option UCI_Elo=1320 \
+  --positions 32 \
+  --candidate-depth 2 \
+  --engine-depth 1
+```
+This is the cleanest anchor for demo purposes: keep the repo baseline as `0 Elo`, then report how the current champion scores against fixed Stockfish settings under the same Chess960 benchmark.
+Benchmark two full engine roots so each side uses its own `search.py` plus its own eval file:
+```sh
+uv run python -m train.benchmark_engine \
+  --candidate-root /tmp/0x960-codex-swarm/worker-1 \
+  --baseline-root /Users/qtzx/Desktop/codebase/0x960 \
+  --positions 32 \
+  --depth 2
+```
+Use this when you want to open search heuristics safely. The older eval-only benchmark is still the right promotion gate while workers only edit `eval.py`, but once search changes are allowed, head-to-head must load each side's own `search.py` instead of sharing the live repo implementation.
+To benchmark a candidate against the original baseline plus accepted swarm champions:
+```sh
+uv run python -m train.benchmark_league \
+  --candidate-file outputs/codex_swarm/champion_eval.py \
+  --positions 16
+```
+By default this league includes the original baseline and the most recent accepted swarm snapshots, while skipping any snapshot that is byte-identical to the candidate. This is the simplest self-play style check for “did the engine improve against its own history, not just one baseline?”
+To generate a static dashboard with swarm progression, league results, and optional Stockfish anchors:
+```sh
+uv run python -m train.build_dashboard --include-stockfish
+```
+This writes [index.html](outputs/dashboard/index.html) plus the backing [dashboard_data.json](outputs/dashboard/dashboard_data.json). Open the HTML file locally to inspect accepted champions, internal Elo deltas, league self-play, and anchor bars in one place.
+To generate submission-ready PNGs for media uploads (score progression + anchor bars), run:
+```sh
+python3 scripts/generate_submission_media.py
+```
+This writes tracked files under `media/submission/`.
+To also surface the current search gain against the saved pre-upgrade engine baseline:
+```sh
+uv run python -m train.build_dashboard \
+  --include-engine-progress \
+  --engine-baseline-root /tmp/0x960-search-baseline \
+  --include-stockfish
+```
+## Local Codex Swarm
+Initialize the local champion plus worker sandboxes:
+```sh
+uv run python -m train.codex_swarm setup --workers 3
+```
+Run one champion/challenger round with Codex workers:
+```sh
+uv run python -m train.codex_swarm run \
+  --workers 5 \
+  --rounds 1 \
+  --model gpt-5.3-codex \
+  --screen-positions 8 \
+  --positions 16 \
+  --worker-timeout-sec 180 \
+  --max-diff-lines 80
+```
+Run a search-focused round that edits only `src/zero960/engine/search.py` and benchmarks full engine roots:
+```sh
+uv run python -m train.codex_swarm run \
+  --workers 3 \
+  --rounds 1 \
+  --surface search \
+  --model gpt-5.3-codex \
+  --screen-positions 4 \
+  --positions 8 \
+  --worker-timeout-sec 180 \
+  --max-diff-lines 100
+```
+Dry-run the coordinator without invoking Codex:
+```sh
+uv run python -m train.codex_swarm run --workers 3 --rounds 1 --dry-run --serial
+```
+Run the swarm in a continuous champion/challenger loop:
+```sh
+uv run python -m train.codex_swarm run \
+  --workers 5 \
+  --continuous \
+  --max-stall-rounds 3 \
+  --model gpt-5.3-codex \
+  --screen-positions 8 \
+  --positions 16 \
+  --max-diff-lines 80 \
+  --worker-timeout-sec 180
+```
+The coordinator now rejects overgrown whole-file rewrites by default. Workers are expected to make surgical `eval.py` edits that stay within the `--max-diff-lines` budget; increasing that flag should be a deliberate choice, not the default. Codex workers no longer run the held-out match benchmark themselves. They patch, optionally do one tiny local sanity check, and stop; the coordinator runs an `8`-position screen on every eligible patch and only runs the heavier final benchmark on the best screen winner.
+For `--surface search`, the coordinator freezes a baseline engine snapshot for the round and uses [benchmark_engine.py](train/benchmark_engine.py) so each side gets its own `search.py` plus its own eval. That is the safe path once workers are allowed to touch search heuristics.
+The coordinator tries real git worktrees first and falls back to lightweight local clones under `/tmp/0x960-codex-swarm/` when worktree metadata is not writable. Swarm state and accepted challengers are recorded under `outputs/codex_swarm/`. The fast default is now a 3-worker wave, and the coordinator reorders hook lanes each round so empty hooks are targeted first, then simple passthrough hooks, and already-customized hooks last.
+Each worker now gets a small local research pack before it edits:
+- `AGENTS.md`, `README.md`, and [Codex Swarm Plan](docs/codex-swarm-plan.md)
+- benchmark scripts in `train/`
+- the current champion snapshot
+- the swarm ledger
+- accepted historical winners under `outputs/codex_swarm/accepted/`
+The default roles are:
+- `worker-1`: Structure Researcher
+- `worker-2`: Tactical Safety Researcher
+- `worker-3`: Activity Researcher
+- `worker-4`: Pawn-Endgame Researcher
+- `worker-5`: Initiative Tuner
+Workers still edit only one file per round. On the default `eval` surface they patch `src/zero960/workspace_template/eval.py`; on the `search` surface they patch `src/zero960/engine/search.py`. In both modes they can inspect the full local research pack to avoid repeating prior winners and to justify their patch against actual benchmark history.
+To copy the current swarm champion back into the source tree:
+```sh
+uv run python -m train.codex_swarm promote
+```
+## Notes
+- The environment already includes the current `eval.py` contents in each observation.
+- Reward shaping now favors valid edits, explicit `run_match`, and clean `finish`.
+- Invalid writes are rolled back immediately so bad code does not poison the rest of the episode.
+## Docs
+- [Architecture](docs/architecture.md)
+- [Codex Swarm Plan](docs/codex-swarm-plan.md)
+- [Why Chess960](docs/why_chess960.md)
+- [Demo Script](docs/demo-script.md)
+- [Process Log](docs/process.md)

docs/agent-log-instruction.md DELETED Viewed

@@ -1,30 +0,0 @@
-# agent log instruction
-Use this as a reusable instruction snippet for coding agents working on 0x960.
-## short snippet
-After each meaningful implementation step, append a short entry to `docs/process.md`.
-Each entry should:
-- use the current timestamp
-- summarize what changed in 2-5 factual bullets
-- note any important decisions or blockers
-- end with a clear next step
-Do not paste large raw command outputs into the log. Summarize them instead.
-## longer snippet
-You are working in the 0x960 repo. Maintain `docs/process.md` as the project build log.
-Rules:
-- append to the file after each meaningful work block, not after every micro-step
-- keep entries concise and factual
-- include what changed, why it changed, blockers, and the next step
-- prefer evidence summaries over raw terminal dumps
-- optimize the log for demo storytelling and judge review
-If you make a product or architecture decision, record it. If a test fails, record the failure briefly and say what remains to fix.

docs/architecture.md CHANGED Viewed

@@ -1,188 +1,71 @@
-# architecture
-## stack decisions
-These are fixed by the hackathon or by scope discipline:
-- environment interface: OpenEnv `0.2.1`
-- deployment target: HF Spaces
-- training demo: HF TRL or Unsloth in Colab
-- core model class: open-weight OSS model only
-- optional infra for real training: Northflank H100
-Closed frontier models are not part of the core training path. If we use them at all, they are comparison-only in the demo layer.
-## system shape
-The system should have four layers.
-### 1. engine workspace
-A minimal Chess960 engine scaffold with:
-- fixed move generation
-- fixed search implementation
-- fixed tournament runner
-- one narrow editable surface
-Recommended editable surface:
-- `engine/eval.py`
-Optional later extension:
-- `engine/weights.json`
-The whole repo should not be editable by the policy. Narrow edit scope keeps training stable and makes the story legible.
-### 2. environment runtime
-The environment owns the full episode lifecycle:
-1. clone a fresh engine workspace
-2. sample one Chess960 start or a small suite of starts
-3. expose bounded actions to the model
-4. execute actions and return observations
-5. after the step budget is exhausted, run matches
-6. compute reward and terminate
-The environment should be written as a normal Python runtime first, then wrapped cleanly for OpenEnv.
-### 3. reward and evaluation harness
-This layer runs fast matches between the edited engine and baselines.
-It should provide:
-- training reward matches
-- held-out evaluation matches
-- crash handling
-- reproducible position sampling
-### 4. training loop
-The training loop should use:
-- GRPO or equivalent in TRL/Unsloth
-- a rollout function that runs a full episode
-- checkpoint logging
-- reward curves and crash-rate metrics
-The training loop is minimal by design. The goal is to show the environment can produce a learnable signal, not to max out Elo during the hackathon.
-## episode contract
-### observation
-Each step should return a compact structured observation containing:
 - the task instruction
-- current editable file contents
 - recent action history
-- recent command outputs or error messages
-- remaining step budget
-- start-position metadata
-### actions
-Start with structured actions, not open shell access.
-- `read_file(path)`
-- `write_file(path, content)`
-- `run_static_eval()`
-- `run_match()`
-- `finish()`
-If needed, a restricted shell tool can be added later, but it should not be required for MVP.
-### termination
-An episode ends when:
-- the agent calls `finish()`
-- the step budget is exhausted
-- the workspace becomes invalid in a fatal way
-## reward design
-Default reward for MVP:
-- primary reward: match score against a fixed baseline engine
-- penalty: invalid edit, crash, or timeout
-Recommended first-pass formula:
-`reward = score_vs_fixed_baseline - crash_penalty`
-Do not make parent-checkpoint self-play the only reward. If we use it, it should be a secondary signal only.
-## evaluation protocol
-Training and evaluation must be separated.
-### training
-- sample Chess960 starts from a training pool
-- play a small number of fast games against the fixed baseline
-- compute reward
-### held-out eval
-- separate fixed start-position suite
-- fixed baseline configuration
-- fixed game count and time control
-- run periodically on saved checkpoints
-This is how we avoid fooling ourselves with a rising training reward that does not correspond to stronger engines.
-## model strategy
-We should optimize for stable tool behavior, not for the largest model possible.
-Recommended order:
-1. `Qwen3.5-9B`
-2. one backup model with good coding/tool-use behavior
-3. only try larger models if the smaller path is already stable
-Single-H100-safe priority:
-- dense 7B to 14B class models first
-- larger MoE models only if integration is already working
-## speed target
-A good MVP episode should be cheap enough to run many times.
-Target envelope:
-- step budget: `4-8` actions
-- match count: very small during training
-- episode runtime: ideally under `30s`
-If episodes are too slow, we reduce game count before we add complexity elsewhere.
-## deployment
-### HF Spaces
-HF Spaces hosts the OpenEnv environment and provides the submission artifact judges can inspect.
-### Colab
-Colab provides the minimal public training notebook using TRL or Unsloth.
-### Northflank
-Northflank is the practical training box if we want a real H100-backed run, but it is not required for the minimal architecture itself.
-## deferred work
-These are explicitly outside the MVP:
-- frontier model integrations
-- OAuth-based coding agent sessions
-- multi-agent swarm variants
-- Elo dashboards
-- tournament leagues across many checkpoints
-- full ACP-like unrestricted workspace tooling

+# Architecture
+## Core Shape
+0x960 has four moving parts:
+1. `src/zero960/engine/`
+   A minimal Chess960 engine with fixed search and one narrow editable surface: `eval.py`.
+2. `src/zero960/runtime/`
+   The episode runtime that owns workspace resets, bounded actions, reward shaping, and match scoring.
+3. `src/zero960_env/`
+   The OpenEnv wrapper and WebSocket client/server layer.
+4. `train/`
+   Distillation and RL entrypoints that operate on the same bounded action schema.
+## Action Space
+The policy only gets structured actions:
+- `read_file`
+- `write_file`
+- `run_static_eval`
+- `run_match`
+- `finish`
+The full repo is not editable. The policy can only modify `eval.py` inside a fresh workspace.
+## Observation Shape
+Each observation includes:
 - the task instruction
+- the current `eval.py` contents
 - recent action history
+- remaining steps
+- last match score
+- workflow hints and suggested next actions
+The current file contents are already visible in the observation, so the intended high-reward loop is:
+`write_file -> run_match -> finish`
+## Reward Design
+Reward is match-score-based with explicit shaping around the edit loop:
+- positive signal for valid changed writes
+- positive signal for explicit `run_match` after a write
+- penalties for repeated `run_static_eval`, redundant `read_file`, and finishing without a meaningful edit/test cycle
+- invalid writes are rolled back immediately
+This keeps the environment learnable while still grounding the main score in downstream engine strength.
+## Training Strategy
+Current order of operations:
+1. teacher distillation
+   Use a strong coding model such as Codex/GPT-5.4 to generate successful bounded-action trajectories.
+2. student fine-tuning
+   Fine-tune a smaller open model on those trajectories.
+3. RL refinement
+   Use GRPO or a similar method only after the student already knows the workflow.
+This is the main shift from the earlier RL-first plan. The hard part has been action discovery, not just optimization.
+## Deployment
+- HF Spaces: public OpenEnv artifact
+- Northflank H100: practical heavy training and debugging box
+- local dev: fastest loop for environment and prompt iteration

docs/codex-swarm-plan.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Codex Swarm Plan
+This is the current highest-leverage path for building a stronger Chess960 eval engine quickly.
+## Goal
+Use multiple Codex workers as an outer-loop engine lab:
+- propose changes to `eval.py` and small search heuristics
+- benchmark every candidate against the current champion
+- keep only Elo-positive patches
+- periodically distill the best traces back into a smaller open student
+The point is not to replace the OpenEnv environment. The point is to use strong coding agents to search the engine-design space faster than raw RL can.
+## Why This Path
+The project has already shown two things:
+- base small models do not reliably discover the edit loop on their own
+- a distilled student can learn `write_file -> run_match -> finish`
+That solves workflow compliance, but the actual submission claim needs to be engine strength. The next bottleneck is finding better eval/search ideas, not teaching the loop again from scratch.
+## Worker Architecture
+Run one coordinator plus several parallel Codex workers locally by default. Use the H100 only for heavy benchmark or training jobs.
+- Coordinator:
+  - assigns experiment ideas
+  - tracks the current champion engine
+  - merges only benchmark-positive patches
+- Worker:
+  - runs in its own git worktree when possible, with a lightweight local clone fallback when the environment cannot write `.git/worktrees`
+  - researches the current champion, accepted history, and benchmark code before editing
+  - edits a narrow surface area
+  - returns one bounded patch plus a short rationale
+  - lets the coordinator run the held-out benchmark and promotion gate
+The default fast wave should use 3 workers. The coordinator should re-rank hook lanes each round so empty hooks are targeted first, then simple passthrough hooks, and already-customized hooks last. Workers remain specialist researcher-implementers with read-only access to:
+- `AGENTS.md`, `README.md`, and this plan
+- `train/benchmark_eval.py`, `train/benchmark_league.py`, and `train/benchmark_uci.py`
+- the current champion at `outputs/codex_swarm/champion_eval.py`
+- the promotion ledger at `outputs/codex_swarm/ledger.jsonl`
+- all accepted snapshots under `outputs/codex_swarm/accepted/`
+The available specialist roles are:
+- worker 1: Structure Researcher
+  king safety and castling structure in Chess960 starts
+- worker 2: Tactical Safety Researcher
+  loose-piece pressure, attacked-undefended pieces, and practical safety terms
+- worker 3: Activity Researcher
+  piece activity, development, space, and centralization at shallow search depth
+- worker 4: Pawn-Endgame Researcher
+  pawn structure, passed pawns, rook files, and simple endgame conversion terms
+- worker 5: Initiative Tuner
+  tempo, mobility pressure, queen safety, and initiative terms that convert shallow-search advantages faster
+After each promotion, the coordinator should automatically deprioritize the lane that just gained custom logic and spend the next short wave on the emptier hooks.
+There are now two practical swarm surfaces:
+- `eval` surface:
+  workers edit only `src/zero960/workspace_template/eval.py` and benchmark with `train/benchmark_eval.py`
+- `search` surface:
+  workers edit only `src/zero960/engine/search.py` and benchmark with `train/benchmark_engine.py` so each side gets its own eval plus its own searcher
+## Evaluation Loop
+Every candidate should go through the same loop:
+1. read current engine code and latest benchmark results
+2. make one bounded patch
+3. stop quickly and hand the patch back to the coordinator
+4. let the coordinator run a cheap screen benchmark first
+5. run a heavier final benchmark only on the best screen winner
+6. keep only patches that improve held-out score or estimated Elo delta
+Preferred rule:
+- no patch is promoted unless it beats the current champion on a fixed held-out benchmark set
+- each worker should make one bounded patch and stop; the coordinator owns held-out benchmarking
+- benchmark in stages: cheap screen on all eligible candidates, heavier final check only for the best screen winner
+- workers that run too long should be timed out rather than left to wander
+- workers should inspect accepted history first so lanes diverge instead of repeating the same patch four times
+## Safety Constraints
+Keep the search legible and hard to game.
+- edit only engine files, benchmark code, or clearly scoped support code
+- no dependency churn unless explicitly needed
+- no broad repo rewrites
+- benchmark on held-out Chess960 starts, not only the training positions
+- record candidate, benchmark settings, and result for every accepted patch
+## Local Setup
+The default shape is:
+- install Codex CLI locally
+- log in locally
+- create multiple git worktrees next to the main repo
+- run several Codex workers in parallel from those worktrees
+- keep the main repo as the coordinator / champion branch
+Useful local pattern:
+- main repo: champion branch and benchmark history
+- `/tmp/0x960-codex-swarm/worker-1`
+- `/tmp/0x960-codex-swarm/worker-2`
+- `/tmp/0x960-codex-swarm/worker-3`
+If device auth is used, `codex login --device-auth` will print a one-time URL and code. If API-key auth is easier, `codex login --with-api-key` is also fine.
+## Optional H100 Use
+The H100 is still useful, but not as the primary Codex host.
+- run large held-out benchmarks there
+- run student SFT there
+- run RL refinement there if needed later
+This keeps Codex orchestration simple while still using the GPU box where it actually matters.
+## Relationship To OpenEnv
+OpenEnv is still the core environment and submission artifact.
+This swarm loop is an outer optimization layer:
+- OpenEnv remains the bounded agent environment
+- teacher traces can still be collected from Codex in the bounded action schema
+- the best engine patches found by the swarm can become:
+  - stronger workspace templates
+  - better baselines
+  - better teacher data
+  - better student targets
+## Immediate Next Steps
+1. Finish local Codex CLI auth.
+2. Run `uv run python -m train.codex_swarm setup --workers 3`.
+3. Start with `uv run python -m train.codex_swarm run --workers 3 --rounds 1 --model gpt-5.3-codex --screen-positions 8 --positions 16 --worker-timeout-sec 180 --max-diff-lines 80`.
+4. Start on the `eval` surface only until the hook lanes are no longer giving clean wins, then open the `search` surface.
+5. Promote only patches that improve held-out benchmark score.
+6. Use the H100 only for heavier benchmark or training passes.
+7. Distill the accepted traces back into the student model after enough wins accumulate.
+For a longer autonomous loop, use:
+```sh
+uv run python -m train.codex_swarm run \
+  --workers 5 \
+  --continuous \
+  --max-stall-rounds 3 \
+  --model gpt-5.3-codex \
+  --screen-positions 8 \
+  --positions 16 \
+  --max-diff-lines 80 \
+  --worker-timeout-sec 180
+```
+The coordinator should stay opinionated about patch size. Recent Codex waves tended to rewrite nearly the whole file even when the actual improvement was a one-line or one-function tweak, so the current default rejects candidates that exceed the `--max-diff-lines` budget.
+This keeps running until interrupted or until several consecutive rounds fail to promote a new champion.

docs/concept.md DELETED Viewed

@@ -1,72 +0,0 @@
-# 0x960
-## what this is
-0x960 is an OpenEnv environment where a model improves a minimal Chess960 engine by making bounded code edits to its evaluation logic.
-The agent does not play chess directly. It reads engine files, edits the eval function, runs checks, and is rewarded by match performance against a fixed baseline.
-## hackathon fit
-This project is designed to satisfy the OpenEnv hackathon constraints:
-- use OpenEnv `0.2.1`
-- deploy the environment on HF Spaces
-- provide a minimal training script in Colab using HF TRL or Unsloth
-## core claim
-The interesting task is not "can an LLM output a good chess move?"
-The interesting task is:
-- can a model operate inside a real coding environment
-- make multi-step edits to a live system
-- and improve that system under an objective downstream metric
-Chess960 is useful because opening memorization is much less valuable than in standard chess. That makes engine improvement a better fit for an agentic environment than pure next-move prediction.
-## novelty claim
-We should not claim that tool use is new or that Chess960 benchmarking is new.
-The stronger and more defensible claim is:
-- Chess960 engine evaluation is an existing benchmark domain
-- coding agents with tool use are an existing capability pattern
-- 0x960 combines them into a self-improvement RL environment where the model modifies engine code and is rewarded by actual engine strength
-## why this is a good OpenEnv task
-- it is multi-step, not single-step classification
-- reward comes from a real external process: engine matches
-- the agent interacts with files and commands, not just text
-- failure modes are meaningful: bad edits, crashes, invalid code, weak evals
-This aligns best with:
-- Statement 3.1: Professional Tasks
-- Statement 4: Self-Improvement
-## MVP
-The MVP should be intentionally narrow.
-- one minimal Chess960 engine scaffold
-- one fixed search implementation
-- one narrow editable surface: `eval.py` or `weights.json`
-- one fixed baseline opponent
-- one held-out evaluation suite
-- one training path using OpenEnv + TRL/Unsloth
-## non-goals for MVP
-- no frontier model dependency in the training loop
-- no OAuth or hosted coding-agent integration
-- no multi-agent swarm
-- no broad repo-wide code editing
-- no polished Elo dashboard unless the core loop already works
-## practical pitch
-"We built an OpenEnv environment where a model learns to be a Chess960 engine engineer, not a chess player. The model uses bounded coding actions to improve an engine's eval function, and reward comes from whether the edited engine actually performs better."

docs/demo-script.md CHANGED Viewed

@@ -1,31 +1,31 @@
-# one-minute demo script
-## 30-second version
-We built 0x960, an OpenEnv environment where a model learns to be a Chess960 engine engineer, not a chess player. Chess960 is useful because it removes much of the opening memorization that standard chess systems can rely on, making it a stronger test of generalization. In our environment, the model gets a bounded coding workspace, edits the engine's eval function, runs checks, and is rewarded by whether the edited engine actually performs better against a fixed baseline. The training signal comes from real downstream engine strength, not just text imitation or next-move prediction.
-## full one-minute outline
-### 1. opening
-0x960 is an OpenEnv environment for training models to improve a Chess960 engine through bounded code edits.
-### 2. why this task
-Chess960 keeps the rules of chess the same but randomizes the starting position, so it is a cleaner test of robustness than standard chess alone.
-### 3. what the model does
-The model does not play chess directly. It reads engine files, edits `eval.py`, runs checks, and decides when to finish.
-### 4. reward
-After the edit budget is used, the engine plays fast matches against a fixed baseline. Reward is based on match score, with penalties for invalid edits or crashes.
-### 5. why OpenEnv
-This is a real multi-step tool-use task with files, commands, failures, and downstream evaluation. That makes it a strong fit for Statement 3.1 and Statement 4.
-### 6. close
-The result is a self-improvement environment where the model learns to engineer a stronger Chess960 system, not just imitate chess moves.

+# Demo Script
+## 30-Second Version
+0x960 is an OpenEnv environment where a model learns to act like a Chess960 engine engineer, not a chess player. The model gets a bounded coding workspace, edits `eval.py`, tests the change with fast matches, and is rewarded by whether the engine actually improves. We found that raw RL alone struggled because base models did not discover the edit loop, so the current path is teacher distillation first and RL refinement second.
+## One-Minute Outline
+### 1. Opening
+0x960 is a bounded self-improvement environment for a minimal Chess960 engine.
+### 2. Why Chess960
+Chess960 keeps the rules of chess fixed while changing the starting position, so it is a cleaner robustness test than standard chess alone.
+### 3. What the Agent Does
+The policy sees the current `eval.py`, writes a bounded replacement, runs a match, and decides when to finish.
+### 4. Why Teacher Distillation
+Base models were not discovering `write_file` reliably, so we added a teacher path: collect successful bounded-action trajectories from a stronger coding agent, fine-tune a smaller open model on those traces, then use RL to refine it.
+### 5. Why OpenEnv
+This is a real multi-step tool-use task with code edits, failures, and downstream evaluation. The reward comes from engine strength, not proxy text metrics.
+### 6. Close
+The result is a self-improvement environment where the model learns a real engineering workflow instead of just outputting moves or text.

docs/open-questions.md DELETED Viewed

@@ -1,85 +0,0 @@
-# open questions
-## blockers to resolve first
-1. **engine skeleton**
-What is the smallest Chess960 engine we can ship quickly while still making eval edits meaningful?
-Default assumption:
-- python move generation
-- fixed search
-- pluggable eval module
-2. **OpenEnv integration**
-What is the thinnest wrapper needed to expose the environment through OpenEnv `0.2.1` and still support a multi-step episode?
-We should prefer the simplest compliant implementation over clever abstractions.
-3. **training loop shape**
-What is the smallest public Colab example that proves the reward loop works with TRL or Unsloth?
-The goal is not large-scale training in Colab. The goal is to show a valid training script and some observable reward signal.
-4. **baseline and held-out suite**
-We need one fixed training baseline and one fixed held-out evaluation suite.
-If the baseline is too weak, reward saturates. If it is too strong, the policy gets no signal.
-5. **episode speed**
-How many games can we afford per episode while keeping iteration tight enough to show learning during the hackathon?
-## defaults unless they fail
-These are no longer open-ended research questions. They are the default implementation choices until proven insufficient.
-1. **model**
-Start with a small open model in the `7B-14B` range, with `Qwen3.5-9B` as the default first candidate.
-2. **action space**
-Use structured actions, not unrestricted shell access.
-3. **editable surface**
-Restrict writes to `eval.py` and optionally `weights.json`.
-4. **reward**
-Use fixed-baseline match score with a crash penalty.
-5. **comparison models**
-Do not use frontier closed models in the core training loop.
-## possible upgrades if time remains
-1. **parent-checkpoint reward**
-Add score against the previous checkpoint or a small checkpoint pool as an auxiliary curriculum signal, not as the only reward.
-2. **frontier comparison**
-Run a closed frontier coding agent in the same environment for demo purposes only.
-3. **visualization**
-Plot reward curves, checkpoint strength, and action traces.
-4. **league evaluation**
-Run small tournaments among checkpoints to show progression over time.
-## explicitly deferred
-- multi-agent or swarm architectures
-- OAuth integration
-- unrestricted ACP-style terminal access
-- large-model training beyond what a single H100 can support comfortably
-- polished benchmark packaging beyond the hackathon submission

docs/process.md CHANGED Viewed

@@ -15,6 +15,32 @@ Logging rules:
 - include decisions, blockers, and concrete next steps
 - summarize command/test results instead of pasting long raw output
 ## 2026-03-07 17:10 PST
 - Read the initial project docs and collapsed the scope from a broad research wishlist into a narrow hackathon MVP.
@@ -103,3 +129,358 @@ Logging rules:
 - Also hit vLLM 0.17 vs transformers 5.3 incompatibility (vLLM wants <5.0). Dropped vLLM for now, using native HF generation.
 - Training running on Northflank H100 with QLoRA + gradient checkpointing.
 - Next: confirm training completes, check reward progression, update docs.

 - include decisions, blockers, and concrete next steps
 - summarize command/test results instead of pasting long raw output
+## 2026-03-08 01:05 PST
+- Upgraded the local Codex swarm prompt in [train/codex_swarm.py](../train/codex_swarm.py) from generic lanes to explicit specialist researcher-implementer roles.
+- Workers now receive a local research pack before patching: `AGENTS.md`, `README.md`, the swarm plan, benchmark scripts, the current champion snapshot, the swarm ledger, and accepted historical winners copied into each worker sandbox.
+- Kept the editable surface narrow at `src/zero960/workspace_template/eval.py` so promotion still measures one variable cleanly, while making accepted history visible so workers can differentiate instead of repeating the same rewrite.
+- Updated [README.md](../README.md) and [docs/codex-swarm-plan.md](./codex-swarm-plan.md) to match the new role-based swarm shape.
+## 2026-03-08 01:20 PST
+- Expanded the default local swarm from 4 to 5 workers by adding an `Initiative Tuner` role in [train/codex_swarm.py](../train/codex_swarm.py).
+- Added continuous swarm mode with `--continuous`, `--max-stall-rounds`, and `--sleep-sec` so the coordinator can keep running promotion rounds until interrupted or until it stalls.
+- Kept promotion eval-focused because the current benchmark path only measures `eval.py` cleanly; search edits still need a separate promotion harness before they should be opened up.
+- Updated [README.md](../README.md) and [docs/codex-swarm-plan.md](./codex-swarm-plan.md) with the 5-worker defaults and the long-running loop command.
+## 2026-03-08 01:35 PST
+- Found and fixed a coordinator bug in [train/codex_swarm.py](../train/codex_swarm.py): worker sandboxes were copying the repo `workspace_template/eval.py` instead of overwriting it with the frozen swarm champion before Codex started.
+- Stopped the invalid live five-worker loop, patched `_sync_worker_snapshot()` to copy `outputs/codex_swarm/champion_eval.py` into each worker's editable `eval.py`, and prepared to restart the loop from a valid champion snapshot.
+- Confirmed this also explains why all five workers initially converged on the same hash despite the new specialist-role prompts.
+## 2026-03-08 01:50 PST
+- Added a separate search-safe benchmark harness in [train/benchmark_engine.py](../train/benchmark_engine.py).
+- This harness benchmarks two full engine roots against each other, loading each side's own `select_move()` from its own `src/zero960/engine/search.py` plus its own eval file, instead of sharing the live repo search module.
+- Kept the main swarm promotion gate unchanged for now; this new harness is the prerequisite for later opening `search.py` edits without corrupting head-to-head comparisons.
 ## 2026-03-07 17:10 PST
 - Read the initial project docs and collapsed the scope from a broad research wishlist into a narrow hackathon MVP.
 - Also hit vLLM 0.17 vs transformers 5.3 incompatibility (vLLM wants <5.0). Dropped vLLM for now, using native HF generation.
 - Training running on Northflank H100 with QLoRA + gradient checkpointing.
 - Next: confirm training completes, check reward progression, update docs.
+## 2026-03-08 09:30 PST
+- Inspected rollout logs and confirmed the failure mode was policy-level, not pure model size: the agent kept choosing `run_static_eval` or early `finish` and rarely attempted a code edit.
+- Tightened the runtime reward shaping around the intended workflow in `src/zero960/runtime/episode.py`: valid changed writes now get an immediate bonus, explicit `run_match` after a write is rewarded, repeated `run_static_eval` and wasted `read_file` calls are penalized, and finishing without an edit or explicit match is penalized.
+- Changed write handling so `write_file` validates `eval.py` immediately by loading `evaluate(board)` and rolls back invalid edits instead of leaving the episode in a broken workspace.
+- Extended observations with workflow hints and suggested next actions so the policy sees explicit guidance like "write first" and "run_match next" after each step.
+- Reworked `train/minimal_trl_openenv.py` prompt instructions to state that `eval.py` is already visible, show the preferred `write_file -> run_match -> finish` sequence, and reduced completion length from 512 to 256 to bias toward compact JSON outputs.
+- Replaced the brittle regex JSON parser with a brace-balanced extractor so `write_file` actions containing nested braces in Python code are more likely to parse correctly.
+- Next: run fresh `infer` and short GRPO checks to see whether the action distribution shifts from `run_static_eval`/`finish` toward `write_file`.
+## 2026-03-08 11:10 PST
+- Added `train/codex_distill.py`, a teacher-data collection path that runs Codex through the same bounded Zero960 action schema and writes both raw rollout traces and SFT-ready chat samples.
+- Kept the teacher constrained to one JSON action per turn with a strict output schema, so the collected data matches the student policy interface instead of leaking shell/editor tool use.
+- Simplified the top-level docs around the current strategy: distillation first, RL refinement second.
+- Deleted redundant planning/research docs that mostly restated old RL-first assumptions and rewrote the README to point at the active entrypoints and docs that still matter.
+- Added generated training artifacts to `.gitignore` so local and remote runs stop cluttering the worktree.
+- Next: run a first short Codex teacher collection against the live env and inspect how many traces survive the reward filter.
+## 2026-03-08 21:20 PST
+- Added `train/sft_student.py`, a minimal student fine-tuning entrypoint that reads the exported `sft_samples_*.jsonl` files, validates the assistant action payloads, drops malformed legacy rows, deduplicates identical chats, and trains with TRL `SFTTrainer`.
+- Kept the dataset conversational and enabled `assistant_only_loss` so the student is optimized on the teacher’s bounded JSON action turn, not on reproducing the prompt text.
+- Added a dry-run mode and basic dataset stats so the repo can inspect the current teacher corpus before launching a real training job.
+- Updated the README with the new student-SFT command, keeping the distill-first flow explicit.
+- Next: run a dry-load smoke test locally, then train a first 0.8B student checkpoint and compare `infer` behavior against the base model.
+## 2026-03-08 22:05 PST
+- Re-read the project docs and aligned the next work with the intended claim: reward should reflect downstream Chess960 engine strength, not only loop compliance.
+- Replaced the toy default eval with a stronger Chess960-safe heuristic in both `src/zero960/engine/default_eval.py` and `src/zero960/workspace_template/eval.py`, adding pawn-structure, mobility, center control, rook-file, king-safety, castling-rights, bishop-pair, and development terms.
+- Added simple move ordering to `src/zero960/engine/search.py` so shallow alpha-beta spends more time on captures, checks, promotions, and castling moves.
+- Updated the deterministic training write path in `train/minimal_trl_openenv.py` to make small valid edits against the new eval constants instead of the old toy eval body.
+- Added `train/benchmark_eval.py` plus a README command so candidate eval files can be compared against a baseline on held-out Chess960 start positions with an estimated Elo delta.
+- Next: run the benchmark on the H100 against saved candidate evals and use that metric, not shaped reward alone, to judge whether future training actually improves play.
+## 2026-03-08 22:40 PST
+- Ran the first remote student SFT job on the Northflank H100 against the merged teacher corpus (`105` clean rows / `35` successful episodes); the job finished cleanly in about `5m 11s`.
+- Final SFT metrics on the remote run were strong for this narrow dataset: train loss `0.2072`, eval loss `0.04192`, eval token accuracy `0.9876`.
+- Compared `infer` behavior on the H100: base `Qwen/Qwen3.5-0.8B` still spammed `run_static_eval` for all six steps and ended at reward `-2.1`, while the SFT checkpoint executed the intended `write_file -> run_match -> finish` loop in three steps for reward `1.0`.
+- Updated `train/minimal_trl_openenv.py` so infer mode can accept a separate tokenizer path when evaluating checkpoints that do not bundle tokenizer files at the model root.
+- Next: run a small batched eval of base vs SFT student across multiple episodes and then decide whether to add GRPO refinement or collect more teacher data first.
+## 2026-03-08 23:05 PST
+- Wrote down the new higher-level direction in `docs/codex-swarm-plan.md`: use multiple Codex workers on the H100 as a champion/challenger engine-iteration loop, benchmark every candidate, and only keep Elo-positive patches.
+- Kept the repo story explicit: OpenEnv remains the core environment and submission artifact, while the Codex swarm acts as an outer optimization layer for discovering stronger engine code and better teacher traces.
+- Updated the README to link this plan so the project direction is visible without digging through chat history.
+- Next: finish Codex CLI auth on the H100, create isolated worker worktrees, and start with a small `eval.py`-only worker swarm before broadening the editable surface.
+## 2026-03-08 23:20 PST
+- Simplified the Codex swarm plan: local Codex workers are now the default orchestration path, while the H100 is treated as an optional heavy-compute box for larger benchmarks and training.
+- Updated `docs/codex-swarm-plan.md` to reflect the practical setup that avoids remote Node/npm bootstrap friction and keeps the worker loop easier to debug.
+- Updated the README wording so the Codex outer loop is clearly described as a local worker swarm rather than an H100-hosted agent farm.
+- Next: finish local Codex auth, create 3 local worker worktrees, and start with an `eval.py`-only champion/challenger loop.
+## 2026-03-08 23:40 PST
+- Added `train/codex_swarm.py`, a runnable local coordinator for the new champion/challenger loop instead of leaving the swarm idea only in docs.
+- The coordinator initializes a champion eval snapshot under `outputs/codex_swarm/champion_eval.py`, spins up worker sandboxes under `/tmp/0x960-codex-swarm/`, and runs one Codex worker per sandbox against the same frozen champion each round.
+- Worker setup now tries `git worktree add` first and falls back to a lightweight local `git clone --shared` when `.git/worktrees` cannot be written, which makes the swarm usable in stricter local sandboxes too.
+- Round execution writes prompts, Codex stdout/stderr, final summaries, and per-worker JSON results under `outputs/codex_swarm/runs/`, then promotes only the best challenger whose held-out score beats the configured threshold.
+- Refactored `train/benchmark_eval.py` into a reusable library surface with `benchmark_eval_files(...)` plus a structured `BenchmarkResult`, so the CLI benchmark and swarm coordinator share the same evaluation logic.
+- Smoke-tested the new entrypoints locally with `py_compile`, `uv run python -m train.codex_swarm setup --workers 2`, `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial`, and `uv run python -m train.codex_swarm status`.
+- Next: run a live local Codex round, inspect the first real challenger diffs and benchmark scores, then decide whether to broaden the editable surface beyond `eval.py`.
+## 2026-03-08 23:58 PST
+- Added `train/benchmark_uci.py`, a separate UCI benchmark entrypoint for anchoring the local eval/search engine against external engines like Stockfish under fixed Chess960 start positions.
+- The new harness loads a local `eval.py`, plays both colors against a UCI engine, and reports wins, draws, losses, score, and an Elo-style delta estimate so the demo can show both relative improvement and an external anchor.
+- Documented the new Stockfish-style benchmark command in `README.md` alongside the existing baseline-vs-candidate benchmark flow.
+- Smoke-tested the new entrypoint locally with `python3 -m py_compile train/benchmark_uci.py` and `uv run python -m train.benchmark_uci --help`.
+## 2026-03-09 00:09 PST
+- Extended `train/benchmark_uci.py` with repeated `--engine-option NAME=VALUE` support so the repo can run calibrated UCI anchors such as `UCI_LimitStrength=true` and `UCI_Elo=1320` instead of only raw depth-based Stockfish tests.
+- Installed `stockfish` locally and ran the first rough external ladder against the best current challenger from the Codex swarm.
+- On a small `4`-position / `8`-game sample, the worker-1 patch scored `4.5/8` against `stockfish` with `UCI_Elo=1320`, then `2.0/8` against both `UCI_Elo=1600` and `UCI_Elo=1800`; this is noisy but enough to bracket the current engine above the weakest anchor and below the stronger ones.
+- Updated the README example to show the calibrated Stockfish option flow rather than only raw `engine-depth`.
+## 2026-03-09 00:20 PST
+- Tightened `train/codex_swarm.py` for faster live rounds instead of just adding more undirected agents: the default worker count is now `4`, each worker gets an explicit heuristic lane, and the coordinator enforces a per-worker timeout.
+- The default lanes now spread the first wave across king safety, loose-piece/tactical pressure, piece activity, and pawn/rook structure so four Codex workers do not all rediscover the same generic positional patch.
+- Updated the worker prompt so each agent is explicitly told to make one bounded patch, run one final benchmark, and stop. This should keep rounds short enough to iterate like a real champion/challenger loop.
+- Updated `README.md` and `docs/codex-swarm-plan.md` to use the 4-worker setup and document the new `--worker-timeout-sec` flow.
+- Smoke-tested the coordinator changes with `python3 -m py_compile train/codex_swarm.py train/benchmark_uci.py train/benchmark_eval.py`.
+## 2026-03-09 00:34 PST
+- Added `train/benchmark_league.py`, a new league-style self-play benchmark that evaluates one candidate against the original baseline plus the accepted swarm champion history instead of only one current baseline.
+- The default league builder pulls from `outputs/codex_swarm/accepted/`, includes the original baseline, skips the candidate itself, and also skips any accepted snapshot whose contents are byte-identical to the candidate so the league does not accidentally include a mirror match.
+- Smoke-tested the new script with `python3 -m py_compile train/benchmark_league.py`, `uv run python -m train.benchmark_league --help`, and a tiny real run at `--positions 4`.
+- That sample run showed the current champion splitting the small league overall: strong against the original baseline, weaker against the older accepted worker-1 snapshot, and neutral overall on the combined pool.
+- Updated the README with the new league benchmark command so the self-play path is visible next to the existing head-to-head and Stockfish anchor commands.
+## 2026-03-09 00:45 PST
+- Added `train/build_dashboard.py`, a static dashboard generator that reads the swarm ledger, current champion, accepted history, league benchmark, and optional Stockfish anchors, then writes a self-contained `outputs/dashboard/index.html` plus `outputs/dashboard/dashboard_data.json`.
+- The generated page visualizes accepted-champion progression, internal Elo deltas, recent swarm results, league self-play rows, and Stockfish anchor bars without needing a frontend framework or a running web server.
+- Fixed the default dashboard pool so it skips accepted snapshots that are byte-identical to the current champion instead of showing a misleading mirror match.
+- Smoke-tested the generator with `python3 -m py_compile train/build_dashboard.py`, `uv run python -m train.build_dashboard`, and a full `uv run python -m train.build_dashboard --include-stockfish`.
+- Updated the README with the new dashboard command and output paths so the visualization can be regenerated after each swarm round.
+## 2026-03-09 00:09 PST
+- Tightened `train/codex_swarm.py` so worker prompts now explicitly require surgical edits via `apply_patch` and call out a hard diff-size budget instead of loosely asking for “small” changes.
+- Added `--max-diff-lines` to the swarm CLI, defaulting to `80`, and recorded added/deleted diff counts in each worker result so whole-file rewrites are visible in the ledger.
+- Promotion/acceptance now requires both `benchmark.score > min_score` and `diff_lines_added + diff_lines_deleted <= max_diff_lines`, which stops noisy 250-line rewrites from winning by a tiny margin.
+- Updated `README.md` and `docs/codex-swarm-plan.md` to use the new `--max-diff-lines 80` flag in the standard swarm commands.
+- Smoke-tested the coordinator change with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --max-diff-lines 40`.
+## 2026-03-09 00:09 PST
+- Checked the official Codex docs and aligned `train/codex_swarm.py` to the best practices that are actually supported by the installed CLI on this box.
+- Added a per-worker root `AGENTS.override.md` so the “surgical patch only / no whole-file rewrite / one probe max / one final benchmark” constraints live in Codex’s native instruction channel instead of only in the prompt body.
+- Kept workers on sandboxed automatic execution, but disabled web search with `-c 'web_search="disabled"'` so the swarm stays local and reproducible.
+- Switched worker prompts from giant argv strings to stdin (`codex exec ... -`), which keeps process listings readable and avoids shoving long prompts into the command line.
+- Enabled `--ephemeral` and `--json` for worker execs so automation runs stay stateless and stdout captures machine-readable Codex events for debugging.
+- Verified that `npm install -g @openai/codex@latest` still resolves to `@openai/codex@0.111.0`; this box is already on the newest npm-published CLI, and that version does not support the newer `--ask-for-approval` flag from the docs.
+- Smoke-tested the updated coordinator with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --max-diff-lines 40`.
+## 2026-03-09 00:22 PST
+- Tightened the swarm loop again after seeing five workers converge on near-identical 250-300 line rewrites without finishing promptly.
+- Changed `train/codex_swarm.py` so Codex workers no longer run `train.benchmark_eval` themselves. They now research, patch `eval.py`, optionally do one tiny local sanity check, and stop.
+- Moved the expensive held-out benchmark fully into the coordinator path and made the coordinator skip benchmarking entirely when a worker exceeds the `--max-diff-lines` budget.
+- Updated `README.md` and `docs/codex-swarm-plan.md` to reflect the new control flow: Codex proposes, coordinator benchmarks, promotion stays centralized.
+- Smoke-tested the refactor with `python3 -m py_compile train/codex_swarm.py` and `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --max-diff-lines 40`.
+## 2026-03-09 00:31 PST
+- Switched the swarm default model from `gpt-5.4` to `gpt-5.3-codex` after inspecting live worker diffs and seeing GPT-5.4 repeatedly collapse into near-identical 300-line eval rewrites.
+- Updated the standard swarm commands in `README.md` and `docs/codex-swarm-plan.md` to use `gpt-5.3-codex` as the preferred local worker model.
+- Next: run a single bounded `gpt-5.3-codex` wave, inspect the raw diffs directly, and only restore continuous mode if the patches become smaller and more diverse.
+## 2026-03-09 00:42 PST
+- Refactored `outputs/codex_swarm/champion_eval.py` into explicit swarm hook lanes: `_structure_hook`, `_tactical_hook`, `_activity_hook`, `_pawn_endgame_hook`, and `_initiative_hook`.
+- Preserved the prior champion behavior by wrapping the existing extra heuristics inside the new hook functions instead of changing the score formula itself.
+- Updated `train/codex_swarm.py` so each worker role is now bound to one named hook rather than a vague lane description. Prompts and `AGENTS.override.md` now tell workers to edit only their assigned hook body.
+- Verified the refactor with `python3 -m py_compile train/codex_swarm.py outputs/codex_swarm/champion_eval.py`, a dry-run coordinator pass, and a quick old-vs-new champion benchmark: `score=0.500` over `8` games.
+## 2026-03-09 01:06 PST
+- Found a bug in the swarm diff gate: it was measuring candidate changes against repo `HEAD` instead of the frozen champion snapshot copied into each worker sandbox, which falsely made every worker look like a 300-line rewrite.
+- Fixed `train/codex_swarm.py` to compute diff counts against the pre-run snapshot, not the git checkout below it.
+- Fixed the worker-timeout path so `subprocess.TimeoutExpired` stdout/stderr bytes are decoded cleanly and timed-out workers still return structured results instead of crashing the coordinator.
+- Ran a short hook-targeted `gpt-5.3-codex` probe and confirmed the new structure works: the worker produced a localized patch only inside `_structure_hook` rather than rewriting the evaluator.
+- The first localized patch added king-shield, pawn-storm, and Chess960 castled-structure terms inside `_structure_hook`; benchmark measurement is slower than the interactive loop, but the swarm behavior is finally aligned with the intended patch surface.
+## 2026-03-09 01:15 PST
+- Added a staged benchmark funnel to `train/codex_swarm.py` so workers no longer all pay for the full held-out benchmark.
+- New flow: every eligible patch gets a cheap screen benchmark (`--screen-positions`, default `8`), then only the best screen winner gets the heavier final benchmark (`--positions`, now the final-stage sample count).
+- Added `--screen-positions` and `--screen-min-score` CLI flags; the default fast path is now `8` positions for screening and `16` for the final promotion check.
+- Reduced the recommended worker timeout in the docs from `600s` to `180s` because workers now only patch and return, not benchmark locally.
+- Smoke-tested the updated coordinator with `python3 -m py_compile train/codex_swarm.py`, `uv run python -m train.codex_swarm run --workers 1 --rounds 1 --dry-run --serial --screen-positions 4 --positions 8 --max-diff-lines 40`, and `uv run python -m train.codex_swarm run --help`.
+## 2026-03-09 01:28 PST
+- The first hook-targeted screened round produced a real promotion: `worker-2` patched `_tactical_hook`, screened at `0.656` over `16` games, and held `0.578` over `32` games for an estimated `+54.7 Elo` versus the previous champion.
+- Promoted that tactical hook patch into `outputs/codex_swarm/champion_eval.py` and saved the accepted snapshot as `outputs/codex_swarm/accepted/20260308T092035Z_worker-2_eval.py`.
+- Tightened the round scheduler so it now reads the current champion and prioritizes underdeveloped hook lanes automatically: empty hooks (`return 0`) first, then simple passthrough hooks (`return _base_*`), then already-customized hooks last.
+- Reordered the default worker specializations so the fast three-worker wave now naturally targets `structure`, `pawn_endgame`, and `initiative` once the tactical hook is already carrying custom logic.
+- Smoke-tested the prioritizer with `python3 -m py_compile train/codex_swarm.py`, `uv run python -m train.codex_swarm run --workers 3 --rounds 1 --dry-run --serial --screen-positions 4 --positions 8 --worker-timeout-sec 60`, and a direct hook-state probe under `uv run python -`.
+## 2026-03-09 01:54 PST
+- Multiple follow-up eval-only hook waves regressed on held-out screens: recent `_structure_hook`, `_pawn_endgame_hook`, and `_initiative_hook` candidates all scored below the current tactical-hook champion on fresh `train.benchmark_eval` probes.
+- Concluded that the fastest path to a larger jump was no longer eval stacking but classical search quality, since `src/zero960/engine/search.py` was still a bare fixed-depth negamax with no quiescence or transposition memory.
+- Upgraded `src/zero960/engine/search.py` with:
+  - quiescence search at depth-0 leaves (captures and promotions only),
+  - transposition-table probe/store using `board._transposition_key()`,
+  - killer-move and history-heuristic move ordering on quiet moves.
+- Snapshotted the pre-change search/eval pair into `/tmp/0x960-search-baseline/` and benchmarked the new search against it with `train.benchmark_engine`.
+- Internal engine-vs-engine results were dramatic:
+  - `positions=2`: `4.0/4`, `score=1.000`
+  - `positions=4`: `8.0/8`, `score=1.000`
+  - `positions=8`: `15.5/16`, `score=0.969`, estimated `+596.5 Elo`
+- External anchor also improved sharply under the upgraded search:
+  - `uv run python -m train.benchmark_uci --candidate-file outputs/codex_swarm/champion_eval.py --engine-command stockfish --engine-option UCI_LimitStrength=true --engine-option UCI_Elo=1320 --positions 8 --candidate-depth 2 --engine-depth 1 --max-plies 120 --seed 42`
+  - result: `12.5/16`, `score=0.781`, estimated `+221.1 Elo` versus the `1320` anchor in this local setup.
+## 2026-03-09 03:08 PST
+- Extended `train/codex_swarm.py` with a second swarm surface: `--surface search`.
+- Search-mode workers now edit only `src/zero960/engine/search.py`, targeting one named search function per worker:
+  - `_move_order_score`
+  - `_quiescence`
+  - `negamax`
+  - `select_move`
+  - `_tactical_moves`
+- Added `outputs/codex_swarm/champion_search.py` as the frozen swarm search baseline, parallel to `champion_eval.py`.
+- The coordinator now snapshots a per-round baseline engine root and uses `train.benchmark_engine` for search-surface promotion, so each side gets its own eval plus its own searcher during held-out matches.
+- Added benchmark timeout support to `train/codex_swarm.py` via `--benchmark-timeout-sec` so pathological search patches can be rejected instead of stalling the whole swarm.
+- Updated `train/build_dashboard.py` to support `--include-engine-progress`, which benchmarks the current champion eval plus current repo search against `/tmp/0x960-search-baseline` and exposes that result in `dashboard_data.json` / `index.html`.
+- Updated `README.md` and `docs/codex-swarm-plan.md` to document:
+  - the new search-surface swarm command
+  - the engine-progress dashboard command
+  - the difference between eval-surface and search-surface promotion
+- Smoke-tested the new coordinator and dashboard code with:
+  - `python3 -m py_compile train/codex_swarm.py train/build_dashboard.py`
+  - `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --surface search --screen-positions 2 --positions 4 --worker-timeout-sec 60 --benchmark-timeout-sec 30`
+## 2026-03-09 03:23 PST
+- The first real search-surface Codex round produced clean small patches in `_move_order_score` and `_quiescence`, but both candidates timed out in the original search screen benchmark configuration.
+- Tightened the search-surface coordinator path so search screening is now intentionally cheaper than eval screening:
+  - added `--search-screen-positions`
+  - added `--search-screen-depth`
+  - added `--search-screen-max-plies`
+  - added a separate `--final-benchmark-timeout-sec`
+- Current intended search fast path is:
+  - cheap screen: `positions=1`, `depth=1`, `max_plies=20`
+  - final check: a slightly heavier engine-vs-engine match with its own timeout budget
+- Fixed a worker snapshot refresh race in `_copy_tree()` by switching the pre-copy cleanup to `shutil.rmtree(..., ignore_errors=True)`, which avoids spurious `FileNotFoundError` failures when reusing local worker sandboxes under `/tmp/0x960-codex-swarm/`.
+- Smoke-tested the cheaper search-screen path with:
+  - `python3 -m py_compile train/codex_swarm.py`
+  - `uv run python -m train.codex_swarm run --workers 2 --rounds 1 --dry-run --serial --surface search --search-screen-positions 1 --search-screen-depth 1 --search-screen-max-plies 20 --positions 4 --depth 2 --max-plies 120 --worker-timeout-sec 60 --benchmark-timeout-sec 20 --final-benchmark-timeout-sec 60`
+- Direct fast-screen probes against the earlier search candidates finally returned promptly at `max_plies=20`:
+  - move-ordering patch: `score=0.500` over `2` games
+  - quiescence patch: `score=0.500` over `2` games
+- That is not enough to claim improvement, but it proves the search-surface screen is now operational instead of timing out by default.
+- The current engine also held up better than the earlier rough anchor read against a bigger `Stockfish UCI_Elo=1600` sample:
+  - `uv run python -m train.benchmark_uci --candidate-file outputs/codex_swarm/champion_eval.py --engine-command stockfish --engine-option UCI_LimitStrength=true --engine-option UCI_Elo=1600 --positions 4 --candidate-depth 2 --engine-depth 1 --max-plies 120 --seed 42`
+  - result: `4.5/8`, `score=0.5625`, estimated `+43.7 Elo` versus that local `1600` anchor setting
+- Loosened the search-surface screen gate in `train/codex_swarm.py` so neutral search screens (`score == threshold`) can still advance to one heavier final benchmark. The ultra-fast `2`-game search screen is too coarse to treat `0.500` as automatic rejection.
+## 2026-03-09 04:14 PST
+- Stopped waiting on slow full-match probes and moved to faster direct checks on the already-strong search baseline.
+- Added selective root deepening to `src/zero960/engine/search.py` and synced the same change into `outputs/codex_swarm/champion_search.py`:
+  - when the root is in check,
+  - or the root move count is small (`<= 12`),
+  - or the game is in a low-material endgame with moderate branching,
+  - `select_move(..., depth=2)` now searches one extra ply at the root instead of paying for full-time `depth=3`.
+- Timing sanity checks on the current champion eval with the new searcher:
+  - opening roots at nominal `depth=2` stayed fast (`~0.07s` to `0.11s` on three sampled Chess960 starts),
+  - a short 10-ply sample game mostly stayed under `~1.0s` per move, with a few heavier later plies around `1.0s`,
+  - full `depth=3` remained much slower (`~1.3s` to `1.5s` opening roots, growing to multi-second later plies), so selective root deepening is the better trade for now.
+- Quick engine checks on the selective-depth searcher:
+  - internal engine-vs-engine smoke test against `/tmp/0x960-search-baseline` with `positions=1`, `depth=2`, `max_plies=80` still swept `2/2` games.
+  - local anchor smoke test against `Stockfish UCI_Elo=1600` with `positions=1`, `candidate_depth=2`, `engine_depth=1`, `max_plies=80` also scored `2/2` games.
+- Synced the measured best eval surface into `src/zero960/workspace_template/eval.py` so the actual environment workspace now matches `outputs/codex_swarm/champion_eval.py` instead of lagging behind the swarm champion.
+## 2026-03-09 04:28 PST
+- Fixed a real search-quality bug in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`: quiescence no longer uses stand-pat when the side to move is in check, and it now searches all legal evasions in that case instead of only tactical captures.
+- Timing sanity after the in-check quiescence fix stayed healthy on sampled Chess960 openings at nominal `depth=2`:
+  - `0.099s`, `0.058s`, and `0.069s` on three sampled starts.
+- Filled the previously empty `_structure_hook` in both `src/zero960/workspace_template/eval.py` and `outputs/codex_swarm/champion_eval.py` with conservative pawn-coordination terms:
+  - connected pawns,
+  - pawn chains,
+  - central pawn duos,
+  - modest bonuses for advanced central pawns,
+  - all phase-weighted so they matter in the middlegame without distorting late endgames.
+- Avoided further king-safety duplication in that hook; the new structure terms are intended to complement the existing tactical/activity hooks rather than re-score the same shelter signals.
+## 2026-03-09 04:36 PST
+- Added a persistent module-level transposition table in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py` instead of rebuilding the TT from scratch on every `select_move` call.
+- Also started using the stored TT best move at the root for move ordering.
+- This is a classical engine improvement rather than a prompt/surface change: later moves in the same game can now reuse earlier search work.
+- Short same-game timing probe on Chess960 start `123` at nominal `depth=2` improved substantially versus the earlier selective-depth-only version:
+  - early plies dropped to roughly `0.05s` to `0.10s`,
+  - later mid-opening plies stayed around `0.32s` to `0.63s`,
+  - compared to the prior selective-depth run where similar later plies were around `0.62s` to `1.03s`.
+- Kept the selective root deepening path in place, so the current searcher now combines:
+  - quiescence,
+  - TT probe/store,
+  - persistent TT reuse across moves,
+  - killer/history ordering,
+  - selective one-ply root extensions in tactical / low-branching roots.
+## 2026-03-09 04:44 PST
+- Added principal variation search (PVS) to both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
+  - first move at a node is searched on the full window,
+  - later moves use a zero-window search first,
+  - only fail-high candidates get the full re-search.
+- This is another classical-engine speed optimization on top of the earlier alpha-beta + TT stack.
+- Same 10-ply timing probe on Chess960 start `123` at nominal `depth=2` improved again versus the TT-persistent version:
+  - later plies that had been around `0.32s` to `0.63s` came down to roughly `0.25s` to `0.46s`,
+  - opening plies stayed in the same healthy range (`~0.05s` to `0.11s`).
+- Current search stack is now:
+  - alpha-beta negamax,
+  - quiescence with in-check evasions,
+  - TT probe/store,
+  - persistent TT reuse across moves,
+  - TT root move ordering,
+  - killer/history ordering,
+  - PVS,
+  - selective one-ply root extensions.
+## 2026-03-09 04:51 PST
+- Spent part of the newly-won search speed on opening strength by widening selective root deepening for the very early game:
+  - new rule in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
+    - if `fullmove_number <= 2` and the root has `<= 24` legal moves, search one extra ply.
+- Short timing probe on Chess960 start `123` at nominal `depth=2` after this change:
+  - first two plies were about `0.72s` to `0.79s`,
+  - later plies mostly stayed below `~1.0s`,
+  - move choices changed from the previous PVS-only run, which is exactly the intended effect.
+- This is a deliberate trade:
+  - use the earlier TT/PVS speed wins to buy more opening search depth,
+  - keep the rest of the game closer to the cheaper depth-2 profile.
+## 2026-03-09 05:00 PST
+- Added null-move pruning in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py` for non-check, non-endgame nodes at depth `>= 3`.
+- Null-move did not produce a clean universal speed win on the sampled 10-ply probe, but it did alter the searched lines and reduced some later plies while making the earliest opening ply somewhat heavier. Kept it in place as a standard classical pruning rule pending larger-match validation.
+- Added persistent history ordering across moves in both search files so quiet-move ordering can reuse what the engine has already learned earlier in the same game.
+- Timing on the same Chess960 start after the last two changes stayed in the same general operating envelope:
+  - opening plies roughly `0.86s` to `1.21s` under the widened early-opening extension,
+  - later plies mostly around `0.10s` to `0.72s`,
+  - still materially better than the older pre-TT / pre-PVS search stack on later same-game plies.
+- Tiny one-position `Stockfish UCI_Elo=1600` anchor probes are still too slow / flaky to treat as decision-grade, so the most reliable signal from this phase remains the measured same-game search speed improvements plus the earlier larger baseline/anchor results already recorded above.
+## 2026-03-09 05:08 PST
+- Added late move reductions (LMR) in both `src/zero960/engine/search.py` and `outputs/codex_swarm/champion_search.py`:
+  - later quiet moves at depth `>= 3` are searched at one reduced ply first,
+  - only moves that improve the window get the full re-search.
+- Added aspiration windows at the root, seeded from the persistent TT score with automatic fallback to a full window on fail-low / fail-high.
+- On the standard 10-ply Chess960 timing probe, these two changes kept the engine on a better branch:
+  - first ply dropped to about `0.86s`,
+  - second ply to about `0.61s`,
+  - later plies mostly in the `0.11s` to `0.46s` range.
+- Also tried quiescence delta pruning as another leaf-speed optimization, but reverted it after it made several early plies materially worse on the same probe.
+- Current kept search stack is therefore:
+  - alpha-beta negamax
+  - quiescence with in-check evasions
+  - TT probe/store
+  - persistent TT reuse across moves
+  - TT root move ordering
+  - persistent history ordering
+  - killer ordering
+  - PVS
+  - null-move pruning
+  - LMR
+  - aspiration windows at the root
+  - selective opening / tactical / endgame root extensions
+## 2026-03-09 05:16 PST
+- Tried widening the opening-depth policy from:
+  - `fullmove_number <= 2` / `<= 24` legal moves
+  to
+  - `fullmove_number <= 3` / `<= 22` legal moves.
+- On the standard 10-ply timing probe, that pushed the first opening plies too high (`~1.40s` and `~1.05s`) without enough evidence of compensating benefit, so the change was reverted.
+- Keeping the more conservative opening-depth rule that was already in place before that experiment.

docs/why_chess960.md CHANGED Viewed

@@ -1,4 +1,4 @@
-# why chess960
 ## short version
@@ -23,7 +23,7 @@ We should claim something narrower and more defensible:
 - strong standard-chess performance does not automatically transfer
 - this makes Chess960 a good downstream benchmark for a tool-using self-improvement environment
-## relation to the project
 0x960 is not a move-prediction benchmark. The model does not play moves directly as its primary task.

+# Why Chess960
 ## short version
 - strong standard-chess performance does not automatically transfer
 - this makes Chess960 a good downstream benchmark for a tool-using self-improvement environment
+## Relation to 0x960
 0x960 is not a move-prediction benchmark. The model does not play moves directly as its primary task.

media/submission/0x960_score_progression.png ADDED Viewed

media/submission/0x960_score_progression.txt ADDED Viewed

	@@ -0,0 +1,4 @@

+Champion score progression (all attempts)
+points=9
+min=0.4305
+max=0.7219

media/submission/0x960_stockfish_anchors.png ADDED Viewed

media/submission/0x960_stockfish_anchors.txt ADDED Viewed

	@@ -0,0 +1,3 @@

+Stockfish anchor bars
+anchors=2
+elo range=1320-1600

media/submission/submission_summary.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+Accepted samples:
+round_20260308T063558Z_1: 0.6172 (yes)
+round_20260308T070827Z_1: 0.5859 (yes)
+round_20260308T091220Z_1: 0.5781 (yes)
+round_20260308T111412Z_1: 0.6875 (yes)

scripts/generate_submission_media.py ADDED Viewed

	@@ -0,0 +1,304 @@

+#!/usr/bin/env python3
+"""Generate tracked PNG graphs from benchmark artifacts for submission media."""
+from __future__ import annotations
+import json
+import math
+import struct
+import zlib
+from dataclasses import dataclass
+from pathlib import Path
+@dataclass(slots=True)
+class Color:
+    r: int
+    g: int
+    b: int
+    def to_tuple(self) -> tuple[int, int, int]:
+        return (self.r, self.g, self.b)
+WHITE = Color(245, 247, 250)
+BG = Color(13, 17, 23)
+AXIS = Color(132, 146, 165)
+GRID = Color(44, 58, 73)
+LINE = Color(88, 166, 255)
+GOOD = Color(63, 185, 80)
+BAD = Color(248, 81, 73)
+MID = Color(210, 153, 34)
+TEXT = Color(230, 237, 243)
+class Canvas:
+    def __init__(self, width: int, height: int, bg: Color = BG) -> None:
+        self.width = width
+        self.height = height
+        self.pixels = [[bg.to_tuple() for _ in range(width)] for _ in range(height)]
+    def set_pixel(self, x: int, y: int, color: Color) -> None:
+        if 0 <= x < self.width and 0 <= y < self.height:
+            self.pixels[y][x] = color.to_tuple()
+    def line(self, x0: int, y0: int, x1: int, y1: int, color: Color) -> None:
+        dx = abs(x1 - x0)
+        dy = -abs(y1 - y0)
+        sx = 1 if x0 < x1 else -1
+        sy = 1 if y0 < y1 else -1
+        err = dx + dy
+        while True:
+            self.set_pixel(x0, y0, color)
+            if x0 == x1 and y0 == y1:
+                break
+            e2 = 2 * err
+            if e2 >= dy:
+                err += dy
+                x0 += sx
+            if e2 <= dx:
+                err += dx
+                y0 += sy
+    def rect(
+        self,
+        x0: int,
+        y0: int,
+        x1: int,
+        y1: int,
+        color: Color,
+        fill: bool = True,
+    ) -> None:
+        if fill:
+            for yy in range(max(0, y0), min(self.height, y1 + 1)):
+                for xx in range(max(0, x0), min(self.width, x1 + 1)):
+                    self.pixels[yy][xx] = color.to_tuple()
+        else:
+            self.line(x0, y0, x1, y0, color)
+            self.line(x0, y1, x1, y1, color)
+            self.line(x0, y0, x0, y1, color)
+            self.line(x1, y0, x1, y1, color)
+    def circle(self, x: int, y: int, radius: int, color: Color) -> None:
+        for dy in range(-radius, radius + 1):
+            for dx in range(-radius, radius + 1):
+                if dx * dx + dy * dy <= radius * radius:
+                    self.set_pixel(x + dx, y + dy, color)
+    def write_png(self, path: Path) -> None:
+        body = bytearray()
+        for row in self.pixels:
+            body.append(0)
+            row_bytes = bytearray()
+            for pixel in row:
+                row_bytes.extend(bytearray(pixel))
+            body.extend(row_bytes)
+        raw = zlib.compress(bytes(body), 9)
+        def chunk(chunk_type: bytes, data: bytes) -> bytes:
+            size = len(data)
+            head = struct.pack(">I", size) + chunk_type + data
+            crc = zlib.crc32(chunk_type + data) & 0xFFFFFFFF
+            return struct.pack(">I", size) + chunk_type + data + struct.pack(">I", crc)
+        ihdr = struct.pack(
+            ">IIBBBBB",
+            self.width,
+            self.height,
+            8,
+            2,
+            0,
+            0,
+            0,
+        )
+        png_data = (
+            b"\x89PNG\r\n\x1a\n"
+            + chunk(b"IHDR", ihdr)
+            + chunk(b"IDAT", raw)
+            + chunk(b"IEND", b"")
+        )
+        path.parent.mkdir(parents=True, exist_ok=True)
+        path.write_bytes(png_data)
+def _draw_axes(chart: Canvas, left: int, right: int, top: int, bottom: int) -> None:
+    chart.line(left, bottom, right, bottom, AXIS)
+    chart.line(left, top, left, bottom, AXIS)
+    for i in range(5):
+        x = left + int((right - left) * (i / 4))
+        chart.line(x, top, x, bottom, GRID)
+        chart.line(left, top + int((bottom - top) * (i / 4)), right, top + int((bottom - top) * (i / 4)), GRID)
+def _norm(value: float, lo: float, hi: float) -> float:
+    if hi == lo:
+        return 0.0
+    return (value - lo) / (hi - lo)
+def _plot_line_chart(
+    out_path: Path,
+    points: list[tuple[str, float, bool]],
+    title: str,
+) -> None:
+    if not points:
+        return
+    width, height = 1200, 700
+    canvas = Canvas(width, height)
+    left, right = 100, width - 80
+    top, bottom = 120, height - 90
+    _draw_axes(canvas, left, right, top, bottom)
+    values = [p[1] for p in points]
+    min_v = min(values) * 0.95
+    max_v = max(values) * 1.05
+    if min_v == max_v:
+        min_v -= 0.1
+        max_v += 0.1
+    def point_to_xy(index: int, value: float) -> tuple[int, int]:
+        x = left + int((right - left) * (index / max(len(points) - 1, 1)))
+        y = bottom - int((_norm(value, min_v, max_v)) * (bottom - top))
+        return x, y
+    for idx in range(len(points) - 1):
+        x0, y0 = point_to_xy(idx, points[idx][1])
+        x1, y1 = point_to_xy(idx + 1, points[idx + 1][1])
+        color = GOOD if points[idx + 1][2] else MID
+        canvas.line(x0, y0, x1, y1, color)
+    for idx, (_, value, accepted) in enumerate(points):
+        x, y = point_to_xy(idx, value)
+        canvas.circle(x, y, 5, GOOD if accepted else BAD)
+    for x in range(len(points)):
+        px, py = point_to_xy(x, points[x][1])
+        canvas.line(px, py + 8, px, bottom, AXIS)
+        canvas.set_pixel(px, bottom + 2, TEXT)
+    canvas.line(left + 1, top + 20, right - 1, top + 20, GRID)
+    canvas.set_pixel(left + 2, top + 5, TEXT)
+    # Simple title marker in shape (no text due no font dependency)
+    canvas.rect(left + 4, 22, left + 14, 36, AXIS, fill=False)
+    canvas.line(right - 200, 34, right - 80, 34, AXIS)
+    canvas.set_pixel(right - 60, 34, TEXT)
+    canvas.write_png(out_path)
+    _write_caption(
+        out_path.with_suffix(".txt"),
+        [
+            title,
+            f"points={len(points)}",
+            f"min={min_v:.4f}",
+            f"max={max_v:.4f}",
+        ],
+    )
+def _plot_anchor_bars(out_path: Path, anchors: list[dict[str, object]]) -> None:
+    width, height = 1200, 700
+    canvas = Canvas(width, height, BG)
+    left, right = 120, width - 80
+    top, bottom = 140, height - 130
+    _draw_axes(canvas, left, right, top, bottom)
+    if not anchors:
+        canvas.line(left + 1, bottom - 1, right - 1, top + 1, MID)
+        canvas.write_png(out_path)
+        return
+    bars = []
+    for row in anchors:
+        elo = float(row.get("uci_elo", 0))
+        score = float(row.get("score", 0.5))
+        bars.append((elo, score))
+    bar_space = (right - left) / max(len(bars), 1)
+    min_score = min(score for _, score in bars)
+    max_score = max(score for _, score in bars)
+    if min_score == max_score:
+        min_score -= 0.05
+        max_score += 0.05
+    for idx, (elo, score) in enumerate(bars):
+        x0 = int(left + idx * bar_space + bar_space * 0.2)
+        x1 = int(left + (idx + 1) * bar_space - bar_space * 0.2)
+        y = bottom - int(_norm(score, min_score, max_score) * (bottom - top))
+        canvas.rect(x0, y, x1, bottom, GOOD if score > 0.5 else BAD)
+        label = int(elo)
+        chart_pos = x0 + 6
+        for digit in str(label):
+            if chart_pos < width - 20:
+                chart_pos += 10
+    canvas.write_png(out_path)
+    _write_caption(
+        out_path.with_suffix(".txt"),
+        [
+            "Stockfish anchor bars",
+            f"anchors={len(anchors)}",
+            f"elo range={int(min(e for e, _ in bars))}-{int(max(e for e, _ in bars))}",
+        ],
+    )
+def _write_caption(path: Path, lines: list[str]) -> None:
+    path.write_text("\n".join(lines), encoding="utf-8")
+def load_dashboard_data(path: Path) -> dict:
+    if not path.exists():
+        raise FileNotFoundError(f"missing dashboard data: {path}")
+    return json.loads(path.read_text(encoding="utf-8"))
+def main() -> None:
+    root = Path(__file__).resolve().parents[1]
+    data = load_dashboard_data(root / "outputs" / "dashboard" / "dashboard_data.json")
+    media_root = root / "media" / "submission"
+    media_root.mkdir(parents=True, exist_ok=True)
+    accepted = [
+        (row.get("round_name", f"#{idx}"), float(row.get("score", 0.5)), bool(row.get("accepted", False)))
+        for idx, row in enumerate(data.get("accepted_results", []))
+    ]
+    all_results = [
+        (row.get("round_name", f"#{idx}"), float(row.get("score", 0.5)), bool(row.get("accepted", False)))
+        for idx, row in enumerate(data.get("all_results", []))
+    ]
+    if all_results:
+        _plot_line_chart(
+            media_root / "0x960_score_progression.png",
+            all_results,
+            "Champion score progression (all attempts)",
+        )
+    else:
+        _plot_line_chart(
+            media_root / "0x960_score_progression.png",
+            [("n/a", 0.5, True)],
+            "Champion score progression (empty)",
+        )
+    _plot_anchor_bars(
+        media_root / "0x960_stockfish_anchors.png",
+        data.get("stockfish_anchors", []),
+    )
+    if accepted:
+        _write_caption(
+            media_root / "submission_summary.txt",
+            [
+                "Accepted samples:",
+                *(
+                    f"{round_name}: {score:.4f} ({'yes' if accepted else 'no'})"
+                    for round_name, score, accepted in accepted
+                ),
+            ],
+        )
+if __name__ == "__main__":
+    main()

src/zero960/engine/default_eval.py CHANGED Viewed

@@ -5,33 +5,194 @@ import chess
 PIECE_VALUES = {
     chess.PAWN: 100,
     chess.KNIGHT: 320,
-    chess.BISHOP: 330,
     chess.ROOK: 500,
     chess.QUEEN: 900,
     chess.KING: 0,
 }
-CENTER_SQUARES = {chess.D4, chess.E4, chess.D5, chess.E5}
-def evaluate(board: chess.Board) -> int:
-    """Return a simple white-centric score in centipawns."""
-    if board.is_checkmate():
-        return -100_000 if board.turn == chess.WHITE else 100_000
-    if board.is_stalemate() or board.is_insufficient_material():
         return 0
     score = 0
     for piece_type, piece_value in PIECE_VALUES.items():
-        score += piece_value * len(board.pieces(piece_type, chess.WHITE))
-        score -= piece_value * len(board.pieces(piece_type, chess.BLACK))
-    for square in CENTER_SQUARES:
-        piece = board.piece_at(square)
-        if piece is None:
-            continue
-        score += 15 if piece.color == chess.WHITE else -15
-    score += 2 * board.legal_moves.count() if board.turn == chess.WHITE else -2 * board.legal_moves.count()
     return score

 PIECE_VALUES = {
     chess.PAWN: 100,
     chess.KNIGHT: 320,
+    chess.BISHOP: 335,
     chess.ROOK: 500,
     chess.QUEEN: 900,
     chess.KING: 0,
 }
+CENTER_SQUARES = (chess.D4, chess.E4, chess.D5, chess.E5)
+EXTENDED_CENTER = (
+    chess.C3, chess.D3, chess.E3, chess.F3,
+    chess.C4, chess.D4, chess.E4, chess.F4,
+    chess.C5, chess.D5, chess.E5, chess.F5,
+    chess.C6, chess.D6, chess.E6, chess.F6,
+)
+PIECE_MOBILITY_WEIGHTS = {
+    chess.KNIGHT: 4,
+    chess.BISHOP: 5,
+    chess.ROOK: 3,
+    chess.QUEEN: 2,
+}
+BISHOP_PAIR_BONUS = 35
+ROOK_OPEN_FILE_BONUS = 20
+ROOK_SEMIOPEN_FILE_BONUS = 10
+DOUBLED_PAWN_PENALTY = 18
+ISOLATED_PAWN_PENALTY = 14
+BACK_RANK_MINOR_PENALTY = 10
+CENTER_OCCUPANCY_BONUS = 14
+CENTER_ATTACK_BONUS = 3
+CASTLING_RIGHTS_BONUS = 12
+TEMPO_BONUS = 8
+PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]
+def _phase(board: chess.Board) -> int:
+    phase = 0
+    phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
+    phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
+    phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
+    phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
+    return min(phase, 24)
+def _friendly(square: int, color: chess.Color, board: chess.Board) -> bool:
+    return board.color_at(square) == color
+def _file_pawn_counts(board: chess.Board, color: chess.Color) -> list[int]:
+    counts = [0] * 8
+    for square in board.pieces(chess.PAWN, color):
+        counts[chess.square_file(square)] += 1
+    return counts
+def _pawn_structure_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    pawns = sorted(board.pieces(chess.PAWN, color))
+    enemy_pawns = list(board.pieces(chess.PAWN, not color))
+    file_counts = _file_pawn_counts(board, color)
+    for count in file_counts:
+        if count > 1:
+            score -= DOUBLED_PAWN_PENALTY * (count - 1)
+    for square in pawns:
+        file_index = chess.square_file(square)
+        left_count = file_counts[file_index - 1] if file_index > 0 else 0
+        right_count = file_counts[file_index + 1] if file_index < 7 else 0
+        if left_count == 0 and right_count == 0:
+            score -= ISOLATED_PAWN_PENALTY
+        rank_index = chess.square_rank(square)
+        blocked = False
+        for enemy_square in enemy_pawns:
+            enemy_file = chess.square_file(enemy_square)
+            if abs(enemy_file - file_index) > 1:
+                continue
+            enemy_rank = chess.square_rank(enemy_square)
+            if color == chess.WHITE and enemy_rank > rank_index:
+                blocked = True
+                break
+            if color == chess.BLACK and enemy_rank < rank_index:
+                blocked = True
+                break
+        if not blocked:
+            advance = rank_index if color == chess.WHITE else 7 - rank_index
+            score += PASSED_PAWN_BONUS_BY_RANK[advance]
+    return score
+def _mobility_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    friendly_mask = board.occupied_co[color]
+    for piece_type, weight in PIECE_MOBILITY_WEIGHTS.items():
+        for square in board.pieces(piece_type, color):
+            attacks = board.attacks_mask(square) & ~friendly_mask
+            score += weight * chess.popcount(attacks)
+    return score
+def _center_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    for square in CENTER_SQUARES:
+        if _friendly(square, color, board):
+            score += CENTER_OCCUPANCY_BONUS
+    for square in EXTENDED_CENTER:
+        score += CENTER_ATTACK_BONUS * chess.popcount(board.attackers_mask(color, square))
+    return score
+def _rook_file_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    friendly_pawns = board.pieces(chess.PAWN, color)
+    enemy_pawns = board.pieces(chess.PAWN, not color)
+    for square in board.pieces(chess.ROOK, color):
+        file_index = chess.square_file(square)
+        friendly_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in friendly_pawns)
+        enemy_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in enemy_pawns)
+        if not friendly_on_file:
+            score += ROOK_SEMIOPEN_FILE_BONUS
+            if not enemy_on_file:
+                score += ROOK_OPEN_FILE_BONUS
+    return score
+def _king_safety_score(board: chess.Board, color: chess.Color, phase: int) -> int:
+    king_square = board.king(color)
+    if king_square is None:
+        return 0
+    score = 0
+    king_file = chess.square_file(king_square)
+    king_rank = chess.square_rank(king_square)
+    for file_index in range(max(0, king_file - 1), min(7, king_file + 1) + 1):
+        shelter_ranks = [king_rank + 1, king_rank + 2] if color == chess.WHITE else [king_rank - 1, king_rank - 2]
+        for rank_index in shelter_ranks:
+            if 0 <= rank_index < 8 and _friendly(chess.square(file_index, rank_index), color, board):
+                score += 4
+    enemy_pressure = 0
+    for square in chess.SquareSet(chess.BB_KING_ATTACKS[king_square]):
+        enemy_pressure += chess.popcount(board.attackers_mask(not color, square))
+    score -= enemy_pressure * (2 + phase // 8)
+    if board.has_castling_rights(color):
+        score += CASTLING_RIGHTS_BONUS * phase // 24
+    return score
+def _development_score(board: chess.Board, color: chess.Color, phase: int) -> int:
+    if phase <= 8:
         return 0
+    home_rank = 0 if color == chess.WHITE else 7
+    penalty = 0
+    for piece_type in (chess.KNIGHT, chess.BISHOP):
+        for square in board.pieces(piece_type, color):
+            if chess.square_rank(square) == home_rank:
+                penalty += BACK_RANK_MINOR_PENALTY
+    return -penalty
+def _side_score(board: chess.Board, color: chess.Color, phase: int) -> int:
     score = 0
     for piece_type, piece_value in PIECE_VALUES.items():
+        score += piece_value * len(board.pieces(piece_type, color))
+    if len(board.pieces(chess.BISHOP, color)) >= 2:
+        score += BISHOP_PAIR_BONUS
+    score += _pawn_structure_score(board, color)
+    score += _mobility_score(board, color)
+    score += _center_score(board, color)
+    score += _rook_file_score(board, color)
+    score += _king_safety_score(board, color, phase)
+    score += _development_score(board, color, phase)
     return score
+def evaluate(board: chess.Board) -> int:
+    """Return a Chess960-safe white-centric score in centipawns."""
+    if board.is_checkmate():
+        return -100_000 if board.turn == chess.WHITE else 100_000
+    if board.is_stalemate() or board.is_insufficient_material():
+        return 0
+    phase = _phase(board)
+    score = _side_score(board, chess.WHITE, phase) - _side_score(board, chess.BLACK, phase)
+    score += TEMPO_BONUS if board.turn == chess.WHITE else -TEMPO_BONUS
+    return score

src/zero960/engine/search.py CHANGED Viewed

@@ -1,11 +1,45 @@
 from __future__ import annotations
 from collections.abc import Callable
 import chess
 EvalFn = Callable[[chess.Board], int]
 MATE_SCORE = 100_000
 def _terminal_score(board: chess.Board) -> int:
@@ -14,30 +48,232 @@ def _terminal_score(board: chess.Board) -> int:
     return 0
 def _score_for_turn(board: chess.Board, eval_fn: EvalFn) -> int:
     score = eval_fn(board)
     return score if board.turn == chess.WHITE else -score
-def negamax(board: chess.Board, depth: int, alpha: int, beta: int, eval_fn: EvalFn) -> int:
-    if depth == 0 or board.is_game_over(claim_draw=True):
-        if board.is_game_over(claim_draw=True):
-            return _terminal_score(board)
-        return _score_for_turn(board, eval_fn)
     best_score = -MATE_SCORE
-    for move in board.legal_moves:
         board.push(move)
-        score = -negamax(board, depth - 1, -beta, -alpha, eval_fn)
         board.pop()
         if score > best_score:
             best_score = score
         if best_score > alpha:
             alpha = best_score
         if alpha >= beta:
             break
     return best_score
@@ -46,10 +282,61 @@ def select_move(board: chess.Board, depth: int, eval_fn: EvalFn) -> chess.Move:
     best_score = -MATE_SCORE
     alpha = -MATE_SCORE
     beta = MATE_SCORE
-    for move in board.legal_moves:
         board.push(move)
-        score = -negamax(board, depth - 1, -beta, -alpha, eval_fn)
         board.pop()
         if best_move is None or score > best_score:
@@ -61,4 +348,3 @@ def select_move(board: chess.Board, depth: int, eval_fn: EvalFn) -> chess.Move:
     if best_move is None:
         raise RuntimeError("no legal move available")
     return best_move

 from __future__ import annotations
 from collections.abc import Callable
+from typing import NamedTuple
 import chess
 EvalFn = Callable[[chess.Board], int]
 MATE_SCORE = 100_000
+TT_EXACT = "exact"
+TT_LOWER = "lower"
+TT_UPPER = "upper"
+MAX_TT_ENTRIES = 50_000
+CAPTURE_ORDER = {
+    chess.PAWN: 1,
+    chess.KNIGHT: 3,
+    chess.BISHOP: 3,
+    chess.ROOK: 5,
+    chess.QUEEN: 9,
+    chess.KING: 0,
+}
+ENDGAME_PHASE_THRESHOLD = 6
+LOW_BRANCHING_THRESHOLD = 12
+ENDGAME_BRANCHING_THRESHOLD = 18
+OPENING_FULLMOVE_LIMIT = 2
+OPENING_BRANCHING_THRESHOLD = 24
+NULL_MOVE_DEPTH_REDUCTION = 2
+NULL_MOVE_MIN_DEPTH = 3
+LMR_MIN_DEPTH = 3
+LMR_MIN_MOVE_INDEX = 3
+ASPIRATION_WINDOW = 60
+class TTEntry(NamedTuple):
+    depth: int
+    score: int
+    bound: str
+    best_move: chess.Move | None
+_GLOBAL_TT: dict[tuple[object, ...], TTEntry] = {}
+_GLOBAL_HISTORY: dict[tuple[int, int], int] = {}
 def _terminal_score(board: chess.Board) -> int:
     return 0
+def _phase(board: chess.Board) -> int:
+    phase = 0
+    phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
+    phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
+    phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
+    phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
+    return min(phase, 24)
+def _selective_root_depth(board: chess.Board, depth: int, move_count: int) -> int:
+    if depth < 2:
+        return depth
+    if board.fullmove_number <= OPENING_FULLMOVE_LIMIT and move_count <= OPENING_BRANCHING_THRESHOLD:
+        return depth + 1
+    if board.is_check() or move_count <= LOW_BRANCHING_THRESHOLD:
+        return depth + 1
+    if _phase(board) <= ENDGAME_PHASE_THRESHOLD and move_count <= ENDGAME_BRANCHING_THRESHOLD:
+        return depth + 1
+    return depth
 def _score_for_turn(board: chess.Board, eval_fn: EvalFn) -> int:
     score = eval_fn(board)
     return score if board.turn == chess.WHITE else -score
+def _move_order_score(
+    board: chess.Board,
+    move: chess.Move,
+    *,
+    tt_move: chess.Move | None = None,
+    killer_moves: tuple[chess.Move, ...] = (),
+    history: dict[tuple[int, int], int] | None = None,
+) -> int:
+    if tt_move is not None and move == tt_move:
+        return 1_000_000
+    score = 0
+    if board.is_capture(move):
+        victim = board.piece_at(move.to_square)
+        attacker = board.piece_at(move.from_square)
+        if victim is not None:
+            score += 100 * CAPTURE_ORDER[victim.piece_type]
+        if attacker is not None:
+            score -= 10 * CAPTURE_ORDER[attacker.piece_type]
+    if move.promotion is not None:
+        score += 800 + CAPTURE_ORDER.get(move.promotion, 0)
+    if board.gives_check(move):
+        score += 50
+    if board.is_castling(move):
+        score += 25
+    if not board.is_capture(move) and move.promotion is None:
+        for index, killer in enumerate(killer_moves):
+            if move == killer:
+                score += 90_000 - index * 10_000
+                break
+        if history is not None:
+            piece_type = board.piece_type_at(move.from_square)
+            if piece_type is not None:
+                score += history.get((piece_type, move.to_square), 0)
+    return score
+def _ordered_moves(
+    board: chess.Board,
+    *,
+    tt_move: chess.Move | None = None,
+    killer_moves: tuple[chess.Move, ...] = (),
+    history: dict[tuple[int, int], int] | None = None,
+) -> list[chess.Move]:
+    return sorted(
+        board.legal_moves,
+        key=lambda move: _move_order_score(
+            board,
+            move,
+            tt_move=tt_move,
+            killer_moves=killer_moves,
+            history=history,
+        ),
+        reverse=True,
+    )
+def _tactical_moves(board: chess.Board) -> list[chess.Move]:
+    return [
+        move
+        for move in _ordered_moves(board)
+        if board.is_capture(move) or move.promotion is not None
+    ]
+def _record_killer(killers: dict[int, tuple[chess.Move, ...]], ply: int, move: chess.Move) -> None:
+    existing = tuple(candidate for candidate in killers.get(ply, ()) if candidate != move)
+    killers[ply] = (move, *existing[:1])
+def _record_history(
+    history: dict[tuple[int, int], int],
+    board: chess.Board,
+    move: chess.Move,
+    depth: int,
+) -> None:
+    piece_type = board.piece_type_at(move.from_square)
+    if piece_type is None:
+        return
+    key = (piece_type, move.to_square)
+    history[key] = history.get(key, 0) + depth * depth
+def _quiescence(board: chess.Board, alpha: int, beta: int, eval_fn: EvalFn) -> int:
+    if board.is_game_over(claim_draw=True):
+        return _terminal_score(board)
+    in_check = board.is_check()
+    if not in_check:
+        stand_pat = _score_for_turn(board, eval_fn)
+        if stand_pat >= beta:
+            return stand_pat
+        if stand_pat > alpha:
+            alpha = stand_pat
+    moves = _ordered_moves(board) if in_check else _tactical_moves(board)
+    for move in moves:
+        board.push(move)
+        score = -_quiescence(board, -beta, -alpha, eval_fn)
+        board.pop()
+        if score >= beta:
+            return score
+        if score > alpha:
+            alpha = score
+    return alpha
+def negamax(
+    board: chess.Board,
+    depth: int,
+    alpha: int,
+    beta: int,
+    eval_fn: EvalFn,
+    tt: dict[tuple[object, ...], TTEntry],
+    killers: dict[int, tuple[chess.Move, ...]],
+    history: dict[tuple[int, int], int],
+    ply: int = 0,
+) -> int:
+    if board.is_game_over(claim_draw=True):
+        return _terminal_score(board)
+    if depth == 0:
+        return _quiescence(board, alpha, beta, eval_fn)
+    alpha_orig = alpha
+    key = board._transposition_key()
+    entry = tt.get(key)
+    tt_move = entry.best_move if entry is not None else None
+    if entry is not None and entry.depth >= depth:
+        if entry.bound == TT_EXACT:
+            return entry.score
+        if entry.bound == TT_LOWER:
+            alpha = max(alpha, entry.score)
+        elif entry.bound == TT_UPPER:
+            beta = min(beta, entry.score)
+        if alpha >= beta:
+            return entry.score
+    if (
+        depth >= NULL_MOVE_MIN_DEPTH
+        and not board.is_check()
+        and _phase(board) > ENDGAME_PHASE_THRESHOLD
+        and beta < MATE_SCORE
+    ):
+        board.push(chess.Move.null())
+        null_score = -negamax(
+            board,
+            depth - 1 - NULL_MOVE_DEPTH_REDUCTION,
+            -beta,
+            -beta + 1,
+            eval_fn,
+            tt,
+            killers,
+            history,
+            ply + 1,
+        )
+        board.pop()
+        if null_score >= beta:
+            return beta
     best_score = -MATE_SCORE
+    best_move: chess.Move | None = None
+    killer_moves = killers.get(ply, ())
+    for move_index, move in enumerate(_ordered_moves(board, tt_move=tt_move, killer_moves=killer_moves, history=history)):
         board.push(move)
+        if move_index == 0:
+            score = -negamax(board, depth - 1, -beta, -alpha, eval_fn, tt, killers, history, ply + 1)
+        else:
+            reduced_depth = depth - 1
+            if (
+                depth >= LMR_MIN_DEPTH
+                and move_index >= LMR_MIN_MOVE_INDEX
+                and not board.is_check()
+                and not board.is_capture(move)
+                and move.promotion is None
+            ):
+                reduced_depth -= 1
+            score = -negamax(board, reduced_depth, -alpha - 1, -alpha, eval_fn, tt, killers, history, ply + 1)
+            if alpha < score < beta:
+                score = -negamax(board, depth - 1, -beta, -alpha, eval_fn, tt, killers, history, ply + 1)
         board.pop()
         if score > best_score:
             best_score = score
+            best_move = move
         if best_score > alpha:
             alpha = best_score
         if alpha >= beta:
+            if not board.is_capture(move) and move.promotion is None:
+                _record_killer(killers, ply, move)
+                _record_history(history, board, move, depth)
             break
+    bound = TT_EXACT
+    if best_score <= alpha_orig:
+        bound = TT_UPPER
+    elif best_score >= beta:
+        bound = TT_LOWER
+    if len(tt) >= MAX_TT_ENTRIES:
+        tt.clear()
+    tt[key] = TTEntry(depth=depth, score=best_score, bound=bound, best_move=best_move)
     return best_score
     best_score = -MATE_SCORE
     alpha = -MATE_SCORE
     beta = MATE_SCORE
+    killers: dict[int, tuple[chess.Move, ...]] = {}
+    root_entry = _GLOBAL_TT.get(board._transposition_key())
+    if root_entry is not None and abs(root_entry.score) < MATE_SCORE // 2:
+        alpha = max(-MATE_SCORE, root_entry.score - ASPIRATION_WINDOW)
+        beta = min(MATE_SCORE, root_entry.score + ASPIRATION_WINDOW)
+    root_moves = _ordered_moves(
+        board,
+        tt_move=root_entry.best_move if root_entry is not None else None,
+        history=_GLOBAL_HISTORY,
+    )
+    search_depth = _selective_root_depth(board, depth, len(root_moves))
+    use_full_window = False
+    for move_index, move in enumerate(root_moves):
         board.push(move)
+        if move_index == 0:
+            score = -negamax(board, search_depth - 1, -beta, -alpha, eval_fn, _GLOBAL_TT, killers, _GLOBAL_HISTORY, 1)
+        else:
+            reduced_depth = search_depth - 1
+            if (
+                search_depth >= LMR_MIN_DEPTH
+                and move_index >= LMR_MIN_MOVE_INDEX
+                and not board.is_check()
+                and not board.is_capture(move)
+                and move.promotion is None
+            ):
+                reduced_depth -= 1
+            score = -negamax(
+                board,
+                reduced_depth,
+                -alpha - 1,
+                -alpha,
+                eval_fn,
+                _GLOBAL_TT,
+                killers,
+                _GLOBAL_HISTORY,
+                1,
+            )
+            if alpha < score < beta:
+                score = -negamax(board, search_depth - 1, -beta, -alpha, eval_fn, _GLOBAL_TT, killers, _GLOBAL_HISTORY, 1)
+        if not use_full_window and (score <= alpha or score >= beta):
+            score = -negamax(
+                board,
+                search_depth - 1,
+                -MATE_SCORE,
+                MATE_SCORE,
+                eval_fn,
+                _GLOBAL_TT,
+                killers,
+                _GLOBAL_HISTORY,
+                1,
+            )
+            alpha = -MATE_SCORE
+            beta = MATE_SCORE
+            use_full_window = True
         board.pop()
         if best_move is None or score > best_score:
     if best_move is None:
         raise RuntimeError("no legal move available")
     return best_move

src/zero960/runtime/episode.py CHANGED Viewed

@@ -14,9 +14,21 @@ from zero960.runtime.workspace import WorkspaceManager
 @dataclass(slots=True)
 class EpisodeConfig:
     max_steps: int = 6
-    search_depth: int = 2
-    training_games: int = 2
     crash_penalty: float = 0.25
 class Zero960EpisodeRuntime:
@@ -28,6 +40,11 @@ class Zero960EpisodeRuntime:
         self.steps_taken = 0
         self.invalid_edit_count = 0
         self.last_match_score: float | None = None
     def reset(self, chess960_index: int | None = None) -> RuntimeObservation:
         self.close()
@@ -37,6 +54,11 @@ class Zero960EpisodeRuntime:
         self.steps_taken = 0
         self.invalid_edit_count = 0
         self.last_match_score = None
         return self._observation("episode reset")
     def close(self) -> None:
@@ -50,6 +72,7 @@ class Zero960EpisodeRuntime:
         done = False
         reward: float | None = None
         status_message = ""
         info: dict[str, object] = {}
@@ -59,11 +82,29 @@ class Zero960EpisodeRuntime:
                     raise ValueError("read_file requires path")
                 content = self.workspace.read_file(action.path)
                 status_message = f"read {action.path} ({len(content)} bytes)"
             elif action.action_type == "write_file":
                 if action.path is None or action.content is None:
                     raise ValueError("write_file requires path and content")
                 self.workspace.write_file(action.path, action.content)
-                status_message = f"wrote {action.path}"
             elif action.action_type == "run_static_eval":
                 eval_fn = self.workspace.load_eval_function()
                 board = chess.Board.from_chess960_pos(self.start_position)
@@ -71,10 +112,20 @@ class Zero960EpisodeRuntime:
                 score = eval_fn(board)
                 status_message = f"static eval score={score}"
                 info["static_eval_score"] = score
             elif action.action_type == "run_match":
                 self.last_match_score = self._run_training_match()
                 status_message = f"match score={self.last_match_score:.3f}"
                 info["match_score"] = self.last_match_score
             elif action.action_type == "finish":
                 reward = self._final_reward()
                 done = True
@@ -86,8 +137,13 @@ class Zero960EpisodeRuntime:
             status_message = f"action failed: {exc}"
             info["error"] = str(exc)
         self.history.append(f"{action.action_type}: {status_message}")
         self.steps_taken += 1
         if not done and self.steps_taken >= self.config.max_steps:
             reward = self._final_reward()
@@ -113,8 +169,17 @@ class Zero960EpisodeRuntime:
     def _final_reward(self) -> float:
         if self.last_match_score is None:
             self.last_match_score = self._run_training_match()
         penalty = self.invalid_edit_count * self.config.crash_penalty
-        return self.last_match_score - penalty
     def _observation(
         self,
@@ -124,10 +189,12 @@ class Zero960EpisodeRuntime:
     ) -> RuntimeObservation:
         if self.workspace is None:
             raise RuntimeError("workspace unavailable")
         return RuntimeObservation(
             task=(
                 "Improve eval.py for the current Chess960 engine. "
-                "Use bounded file edits and finish when ready for scoring."
             ),
             status_message=status_message,
             file_contents={"eval.py": self.workspace.read_file("eval.py")},
@@ -136,7 +203,32 @@ class Zero960EpisodeRuntime:
             remaining_steps=max(self.config.max_steps - self.steps_taken, 0),
             last_match_score=self.last_match_score,
             invalid_edit_count=self.invalid_edit_count,
             reward=reward,
             done=done,
         )

 @dataclass(slots=True)
 class EpisodeConfig:
     max_steps: int = 6
+    search_depth: int = 1
+    training_games: int = 1
     crash_penalty: float = 0.25
+    valid_write_bonus: float = 0.20
+    changed_write_bonus: float = 0.10
+    unchanged_write_penalty: float = 0.10
+    explicit_match_bonus: float = 0.15
+    finish_after_match_bonus: float = 0.05
+    repeated_static_eval_penalty: float = 0.15
+    static_eval_before_write_penalty: float = 0.20
+    redundant_read_penalty: float = 0.25
+    match_without_edit_penalty: float = 0.15
+    finish_without_edit_penalty: float = 0.45
+    finish_without_match_penalty: float = 0.20
+    finish_without_retest_penalty: float = 0.08
 class Zero960EpisodeRuntime:
         self.steps_taken = 0
         self.invalid_edit_count = 0
         self.last_match_score: float | None = None
+        self.has_valid_edit = False
+        self.has_run_match = False
+        self.wrote_since_match = False
+        self.shaping_reward_total = 0.0
+        self.last_action_type: str | None = None
     def reset(self, chess960_index: int | None = None) -> RuntimeObservation:
         self.close()
         self.steps_taken = 0
         self.invalid_edit_count = 0
         self.last_match_score = None
+        self.has_valid_edit = False
+        self.has_run_match = False
+        self.wrote_since_match = False
+        self.shaping_reward_total = 0.0
+        self.last_action_type = None
         return self._observation("episode reset")
     def close(self) -> None:
         done = False
         reward: float | None = None
+        step_reward = 0.0
         status_message = ""
         info: dict[str, object] = {}
                     raise ValueError("read_file requires path")
                 content = self.workspace.read_file(action.path)
                 status_message = f"read {action.path} ({len(content)} bytes)"
+                if action.path == "eval.py":
+                    step_reward -= self.config.redundant_read_penalty
+                    status_message += "; eval.py was already visible"
             elif action.action_type == "write_file":
                 if action.path is None or action.content is None:
                     raise ValueError("write_file requires path and content")
+                previous_content = self.workspace.read_file(action.path)
                 self.workspace.write_file(action.path, action.content)
+                try:
+                    self.workspace.load_eval_function()
+                except Exception:
+                    self.workspace.write_file(action.path, previous_content)
+                    raise
+                if action.content == previous_content:
+                    step_reward -= self.config.unchanged_write_penalty
+                    status_message = f"wrote {action.path}; file unchanged"
+                else:
+                    step_reward += self.config.valid_write_bonus + self.config.changed_write_bonus
+                    self.has_valid_edit = True
+                    self.wrote_since_match = True
+                    status_message = f"wrote {action.path}; validated evaluate(board)"
+                    info["code_changed"] = True
             elif action.action_type == "run_static_eval":
                 eval_fn = self.workspace.load_eval_function()
                 board = chess.Board.from_chess960_pos(self.start_position)
                 score = eval_fn(board)
                 status_message = f"static eval score={score}"
                 info["static_eval_score"] = score
+                if not self.has_valid_edit:
+                    step_reward -= self.config.static_eval_before_write_penalty
+                if self.last_action_type == "run_static_eval":
+                    step_reward -= self.config.repeated_static_eval_penalty
             elif action.action_type == "run_match":
                 self.last_match_score = self._run_training_match()
+                self.has_run_match = True
                 status_message = f"match score={self.last_match_score:.3f}"
                 info["match_score"] = self.last_match_score
+                if self.has_valid_edit and self.wrote_since_match:
+                    step_reward += self.config.explicit_match_bonus
+                    self.wrote_since_match = False
+                elif not self.has_valid_edit:
+                    step_reward -= self.config.match_without_edit_penalty
             elif action.action_type == "finish":
                 reward = self._final_reward()
                 done = True
             status_message = f"action failed: {exc}"
             info["error"] = str(exc)
+        if not done:
+            reward = step_reward
+            self.shaping_reward_total += step_reward
         self.history.append(f"{action.action_type}: {status_message}")
         self.steps_taken += 1
+        self.last_action_type = action.action_type
         if not done and self.steps_taken >= self.config.max_steps:
             reward = self._final_reward()
     def _final_reward(self) -> float:
         if self.last_match_score is None:
             self.last_match_score = self._run_training_match()
+        reward = self.last_match_score + self.shaping_reward_total
+        if self.has_run_match:
+            reward += self.config.finish_after_match_bonus
+        if not self.has_valid_edit:
+            reward -= self.config.finish_without_edit_penalty
+        if not self.has_run_match:
+            reward -= self.config.finish_without_match_penalty
+        if self.wrote_since_match:
+            reward -= self.config.finish_without_retest_penalty
         penalty = self.invalid_edit_count * self.config.crash_penalty
+        return reward - penalty
     def _observation(
         self,
     ) -> RuntimeObservation:
         if self.workspace is None:
             raise RuntimeError("workspace unavailable")
+        workflow_hint, suggested_actions = self._workflow_state()
         return RuntimeObservation(
             task=(
                 "Improve eval.py for the current Chess960 engine. "
+                "The full file is already visible below. Best loop: write_file a valid replacement, "
+                "run_match to test it, then finish. Repeated run_static_eval and early finish are penalized."
             ),
             status_message=status_message,
             file_contents={"eval.py": self.workspace.read_file("eval.py")},
             remaining_steps=max(self.config.max_steps - self.steps_taken, 0),
             last_match_score=self.last_match_score,
             invalid_edit_count=self.invalid_edit_count,
+            workflow_hint=workflow_hint,
+            suggested_actions=suggested_actions,
+            has_valid_edit=self.has_valid_edit,
+            has_run_match=self.has_run_match,
             reward=reward,
             done=done,
         )
+    def _workflow_state(self) -> tuple[str, list[str]]:
+        if not self.has_valid_edit:
+            return (
+                "eval.py is already shown below. Do not waste a turn on read_file. "
+                "Write a full valid replacement for eval.py next.",
+                ["write_file", "run_match", "finish"],
+            )
+        if self.wrote_since_match:
+            return (
+                "You have a valid untested edit. Run run_match next to measure it.",
+                ["run_match", "write_file", "finish"],
+            )
+        if self.has_run_match:
+            return (
+                "You have a tested edit. Finish if the score is acceptable, otherwise write_file again.",
+                ["finish", "write_file", "run_match"],
+            )
+        return (
+            "A valid edit exists but no explicit match has been run yet. Run run_match next.",
+            ["run_match", "finish", "write_file"],
+        )

src/zero960/runtime/types.py CHANGED Viewed

@@ -23,6 +23,10 @@ class RuntimeObservation:
     remaining_steps: int
     last_match_score: float | None
     invalid_edit_count: int
     reward: float | None = None
     done: bool = False
@@ -33,4 +37,3 @@ class RuntimeStepResult:
     reward: float | None
     done: bool
     info: dict[str, Any] = field(default_factory=dict)

     remaining_steps: int
     last_match_score: float | None
     invalid_edit_count: int
+    workflow_hint: str
+    suggested_actions: list[str]
+    has_valid_edit: bool
+    has_run_match: bool
     reward: float | None = None
     done: bool = False
     reward: float | None
     done: bool
     info: dict[str, Any] = field(default_factory=dict)

src/zero960/workspace_template/eval.py CHANGED Viewed

@@ -5,20 +5,385 @@ import chess
 PIECE_VALUES = {
     chess.PAWN: 100,
     chess.KNIGHT: 320,
-    chess.BISHOP: 330,
     chess.ROOK: 500,
     chess.QUEEN: 900,
     chess.KING: 0,
 }
-def evaluate(board: chess.Board) -> int:
     score = 0
     for piece_type, piece_value in PIECE_VALUES.items():
-        score += piece_value * len(board.pieces(piece_type, chess.WHITE))
-        score -= piece_value * len(board.pieces(piece_type, chess.BLACK))
-    white_center = sum(1 for square in (chess.D4, chess.E4, chess.D5, chess.E5) if board.color_at(square) == chess.WHITE)
-    black_center = sum(1 for square in (chess.D4, chess.E4, chess.D5, chess.E5) if board.color_at(square) == chess.BLACK)
-    score += 15 * (white_center - black_center)
     return score

 PIECE_VALUES = {
     chess.PAWN: 100,
     chess.KNIGHT: 320,
+    chess.BISHOP: 335,
     chess.ROOK: 500,
     chess.QUEEN: 900,
     chess.KING: 0,
 }
+CENTER_SQUARES = (chess.D4, chess.E4, chess.D5, chess.E5)
+EXTENDED_CENTER = (
+    chess.C3, chess.D3, chess.E3, chess.F3,
+    chess.C4, chess.D4, chess.E4, chess.F4,
+    chess.C5, chess.D5, chess.E5, chess.F5,
+    chess.C6, chess.D6, chess.E6, chess.F6,
+)
+PIECE_MOBILITY_WEIGHTS = {
+    chess.KNIGHT: 4,
+    chess.BISHOP: 5,
+    chess.ROOK: 3,
+    chess.QUEEN: 2,
+}
+CENTER_AXIS_BONUS = (0, 1, 2, 3, 3, 2, 1, 0)
+BISHOP_PAIR_BONUS = 35
+ROOK_OPEN_FILE_BONUS = 20
+ROOK_SEMIOPEN_FILE_BONUS = 10
+DOUBLED_PAWN_PENALTY = 18
+ISOLATED_PAWN_PENALTY = 14
+BACK_RANK_MINOR_PENALTY = 10
+CENTER_OCCUPANCY_BONUS = 14
+CENTER_ATTACK_BONUS = 3
+CASTLING_RIGHTS_BONUS = 12
+CASTLED_BONUS = 18
+KNIGHT_CENTER_BONUS = 6
+BISHOP_CENTER_BONUS = 2
+KING_ENDGAME_CENTER_BONUS = 5
+UNDEFENDED_TARGET_DIVISOR = 16
+OVERLOADED_TARGET_DIVISOR = 24
+TEMPO_BONUS = 8
+PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]
+LOOSE_PIECE_DIVISOR = 24
+OUTNUMBERED_PIECE_DIVISOR = 40
+PAWN_HARASSMENT_PENALTY = 8
+CONNECTED_PAWN_BONUS = 4
+PAWN_CHAIN_BONUS = 5
+CENTRAL_PAWN_DUO_BONUS = 10
+ADVANCED_CENTRAL_PAWN_BONUS = 3
+def _phase(board: chess.Board) -> int:
+    phase = 0
+    phase += 4 * (len(board.pieces(chess.QUEEN, chess.WHITE)) + len(board.pieces(chess.QUEEN, chess.BLACK)))
+    phase += 2 * (len(board.pieces(chess.ROOK, chess.WHITE)) + len(board.pieces(chess.ROOK, chess.BLACK)))
+    phase += len(board.pieces(chess.BISHOP, chess.WHITE)) + len(board.pieces(chess.BISHOP, chess.BLACK))
+    phase += len(board.pieces(chess.KNIGHT, chess.WHITE)) + len(board.pieces(chess.KNIGHT, chess.BLACK))
+    return min(phase, 24)
+def _friendly(square: int, color: chess.Color, board: chess.Board) -> bool:
+    return board.color_at(square) == color
+def _center_axis_score(square: int) -> int:
+    return CENTER_AXIS_BONUS[chess.square_file(square)] + CENTER_AXIS_BONUS[chess.square_rank(square)]
+def _file_pawn_counts(board: chess.Board, color: chess.Color) -> list[int]:
+    counts = [0] * 8
+    for square in board.pieces(chess.PAWN, color):
+        counts[chess.square_file(square)] += 1
+    return counts
+def _pawn_structure_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    pawns = sorted(board.pieces(chess.PAWN, color))
+    enemy_pawns = list(board.pieces(chess.PAWN, not color))
+    file_counts = _file_pawn_counts(board, color)
+    for count in file_counts:
+        if count > 1:
+            score -= DOUBLED_PAWN_PENALTY * (count - 1)
+    for square in pawns:
+        file_index = chess.square_file(square)
+        left_count = file_counts[file_index - 1] if file_index > 0 else 0
+        right_count = file_counts[file_index + 1] if file_index < 7 else 0
+        if left_count == 0 and right_count == 0:
+            score -= ISOLATED_PAWN_PENALTY
+        rank_index = chess.square_rank(square)
+        blocked = False
+        for enemy_square in enemy_pawns:
+            enemy_file = chess.square_file(enemy_square)
+            if abs(enemy_file - file_index) > 1:
+                continue
+            enemy_rank = chess.square_rank(enemy_square)
+            if color == chess.WHITE and enemy_rank > rank_index:
+                blocked = True
+                break
+            if color == chess.BLACK and enemy_rank < rank_index:
+                blocked = True
+                break
+        if not blocked:
+            advance = rank_index if color == chess.WHITE else 7 - rank_index
+            score += PASSED_PAWN_BONUS_BY_RANK[advance]
+    return score
+def _mobility_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    friendly_mask = board.occupied_co[color]
+    for piece_type, weight in PIECE_MOBILITY_WEIGHTS.items():
+        for square in board.pieces(piece_type, color):
+            attacks = board.attacks_mask(square) & ~friendly_mask
+            score += weight * chess.popcount(attacks)
+    return score
+def _center_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    for square in CENTER_SQUARES:
+        if _friendly(square, color, board):
+            score += CENTER_OCCUPANCY_BONUS
+    for square in EXTENDED_CENTER:
+        score += CENTER_ATTACK_BONUS * chess.popcount(board.attackers_mask(color, square))
+    return score
+def _rook_file_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    friendly_pawns = board.pieces(chess.PAWN, color)
+    enemy_pawns = board.pieces(chess.PAWN, not color)
+    for square in board.pieces(chess.ROOK, color):
+        file_index = chess.square_file(square)
+        friendly_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in friendly_pawns)
+        enemy_on_file = any(chess.square_file(pawn_square) == file_index for pawn_square in enemy_pawns)
+        if not friendly_on_file:
+            score += ROOK_SEMIOPEN_FILE_BONUS
+            if not enemy_on_file:
+                score += ROOK_OPEN_FILE_BONUS
+    return score
+def _king_safety_score(board: chess.Board, color: chess.Color, phase: int) -> int:
+    king_square = board.king(color)
+    if king_square is None:
+        return 0
+    score = 0
+    king_file = chess.square_file(king_square)
+    king_rank = chess.square_rank(king_square)
+    for file_index in range(max(0, king_file - 1), min(7, king_file + 1) + 1):
+        shelter_ranks = [king_rank + 1, king_rank + 2] if color == chess.WHITE else [king_rank - 1, king_rank - 2]
+        for rank_index in shelter_ranks:
+            if 0 <= rank_index < 8 and _friendly(chess.square(file_index, rank_index), color, board):
+                score += 4
+    enemy_pressure = 0
+    for square in chess.SquareSet(chess.BB_KING_ATTACKS[king_square]):
+        enemy_pressure += chess.popcount(board.attackers_mask(not color, square))
+    score -= enemy_pressure * (2 + phase // 8)
+    if board.has_castling_rights(color):
+        score += CASTLING_RIGHTS_BONUS * phase // 24
+    return score
+def _development_score(board: chess.Board, color: chess.Color, phase: int) -> int:
+    if phase <= 8:
+        return 0
+    home_rank = 0 if color == chess.WHITE else 7
+    penalty = 0
+    for piece_type in (chess.KNIGHT, chess.BISHOP):
+        for square in board.pieces(piece_type, color):
+            if chess.square_rank(square) == home_rank:
+                penalty += BACK_RANK_MINOR_PENALTY
+    return -penalty
+def _base_piece_safety_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    for piece_type in (chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
+        for square in board.pieces(piece_type, color):
+            attackers_mask = board.attackers_mask(not color, square)
+            if not attackers_mask:
+                continue
+            attackers = chess.popcount(attackers_mask)
+            defenders = chess.popcount(board.attackers_mask(color, square))
+            if defenders == 0:
+                score -= PIECE_VALUES[piece_type] // LOOSE_PIECE_DIVISOR
+            elif attackers > defenders:
+                score -= PIECE_VALUES[piece_type] // OUTNUMBERED_PIECE_DIVISOR
+            if defenders <= attackers:
+                pawn_pressure = 0
+                for attacker_square in chess.SquareSet(attackers_mask):
+                    if board.piece_type_at(attacker_square) == chess.PAWN:
+                        pawn_pressure += 1
+                score -= pawn_pressure * PAWN_HARASSMENT_PENALTY
+    return score
+def _base_piece_placement_score(board: chess.Board, color: chess.Color, phase: int) -> int:
+    score = 0
+    for square in board.pieces(chess.KNIGHT, color):
+        score += KNIGHT_CENTER_BONUS * _center_axis_score(square)
+    for square in board.pieces(chess.BISHOP, color):
+        score += BISHOP_CENTER_BONUS * _center_axis_score(square)
+    king_square = board.king(color)
+    if king_square is not None:
+        back_rank = 0 if color == chess.WHITE else 7
+        if chess.square_rank(king_square) == back_rank:
+            if king_square in (chess.C1, chess.G1, chess.C8, chess.G8):
+                rook_square = chess.square(3 if chess.square_file(king_square) == 2 else 5, back_rank)
+                if _friendly(rook_square, color, board):
+                    score += CASTLED_BONUS * phase // 24
+        score += KING_ENDGAME_CENTER_BONUS * _center_axis_score(king_square) * (24 - phase) // 24
+    return score
+def _base_threat_score(board: chess.Board, color: chess.Color) -> int:
+    score = 0
+    for piece_type in (chess.PAWN, chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
+        for square in board.pieces(piece_type, not color):
+            attackers = chess.popcount(board.attackers_mask(color, square))
+            if attackers == 0:
+                continue
+            defenders = chess.popcount(board.attackers_mask(not color, square))
+            if defenders == 0:
+                score += PIECE_VALUES[piece_type] // UNDEFENDED_TARGET_DIVISOR
+            elif attackers > defenders:
+                score += PIECE_VALUES[piece_type] // OVERLOADED_TARGET_DIVISOR
+    return score
+def _structure_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
+    """Swarm lane: structure and castling heuristics."""
+    # SWARM_HOOK: structure
+    score = 0
+    pawns = board.pieces(chess.PAWN, color)
+    direction = 1 if color == chess.WHITE else -1
+    for square in pawns:
+        file_index = chess.square_file(square)
+        rank_index = chess.square_rank(square)
+        for neighbor_file in (file_index - 1, file_index + 1):
+            if 0 <= neighbor_file < 8:
+                neighbor_square = chess.square(neighbor_file, rank_index)
+                if neighbor_square in pawns:
+                    score += CONNECTED_PAWN_BONUS
+        support_rank = rank_index - direction
+        if 0 <= support_rank < 8:
+            for support_file in (file_index - 1, file_index + 1):
+                if 0 <= support_file < 8:
+                    support_square = chess.square(support_file, support_rank)
+                    if support_square in pawns:
+                        score += PAWN_CHAIN_BONUS
+        if file_index in (3, 4):
+            advance = rank_index if color == chess.WHITE else 7 - rank_index
+            if advance >= 3:
+                score += ADVANCED_CENTRAL_PAWN_BONUS
+    if chess.D4 in pawns and chess.E4 in pawns:
+        score += CENTRAL_PAWN_DUO_BONUS
+    if chess.D5 in pawns and chess.E5 in pawns:
+        score += CENTRAL_PAWN_DUO_BONUS
+    return score * phase // 24
+def _tactical_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
+    """Swarm lane: tactical safety and loose-piece pressure."""
+    # SWARM_HOOK: tactical
+    score = _base_piece_safety_score(board, color)
+    tactical_values = {
+        chess.PAWN: 100,
+        chess.KNIGHT: 320,
+        chess.BISHOP: 335,
+        chess.ROOK: 500,
+        chess.QUEEN: 900,
+        chess.KING: 1200,
+    }
+    def _least_tactical_value(attackers_mask: int) -> int:
+        least = 10_000
+        for attacker_square in chess.SquareSet(attackers_mask):
+            piece_type = board.piece_type_at(attacker_square)
+            if piece_type is None:
+                continue
+            value = tactical_values[piece_type]
+            if value < least:
+                least = value
+        return least
+    def _exchange_edge(attacking_color: chess.Color, square: int, piece_type: int) -> int:
+        attackers_mask = board.attackers_mask(attacking_color, square)
+        if not attackers_mask:
+            return 0
+        least_attacker = _least_tactical_value(attackers_mask)
+        target_value = PIECE_VALUES[piece_type]
+        if least_attacker >= target_value:
+            return 0
+        defenders_mask = board.attackers_mask(not attacking_color, square)
+        edge = 0
+        if not defenders_mask:
+            edge += (target_value - least_attacker) // 16 + 4
+        else:
+            least_defender = _least_tactical_value(defenders_mask)
+            if least_defender > least_attacker:
+                edge += (least_defender - least_attacker) // 24 + 1
+            if chess.popcount(attackers_mask) > chess.popcount(defenders_mask):
+                edge += target_value // 96
+        return edge
+    for piece_type in (chess.KNIGHT, chess.BISHOP, chess.ROOK, chess.QUEEN):
+        for square in board.pieces(piece_type, not color):
+            score += _exchange_edge(color, square, piece_type)
+        for square in board.pieces(piece_type, color):
+            score -= _exchange_edge(not color, square, piece_type)
+    return score
+def _activity_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
+    """Swarm lane: activity, centralization, and placement."""
+    # SWARM_HOOK: activity
+    return _base_piece_placement_score(board, color, phase)
+def _pawn_endgame_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
+    """Swarm lane: pawn structure and endgame conversion."""
+    # SWARM_HOOK: pawn_endgame
+    return 0
+def _initiative_hook(board: chess.Board, color: chess.Color, phase: int) -> int:
+    """Swarm lane: threats, tempo conversion, and initiative."""
+    # SWARM_HOOK: initiative
+    return _base_threat_score(board, color)
+def _side_score(board: chess.Board, color: chess.Color, phase: int) -> int:
     score = 0
     for piece_type, piece_value in PIECE_VALUES.items():
+        score += piece_value * len(board.pieces(piece_type, color))
+    if len(board.pieces(chess.BISHOP, color)) >= 2:
+        score += BISHOP_PAIR_BONUS
+    score += _pawn_structure_score(board, color)
+    score += _mobility_score(board, color)
+    score += _center_score(board, color)
+    score += _rook_file_score(board, color)
+    score += _king_safety_score(board, color, phase)
+    score += _development_score(board, color, phase)
+    score += _structure_hook(board, color, phase)
+    score += _tactical_hook(board, color, phase)
+    score += _activity_hook(board, color, phase)
+    score += _pawn_endgame_hook(board, color, phase)
+    score += _initiative_hook(board, color, phase)
+    return score
+def evaluate(board: chess.Board) -> int:
+    if board.is_checkmate():
+        return -100_000 if board.turn == chess.WHITE else 100_000
+    if board.is_stalemate() or board.is_insufficient_material():
+        return 0
+    phase = _phase(board)
+    score = _side_score(board, chess.WHITE, phase) - _side_score(board, chess.BLACK, phase)
+    score += TEMPO_BONUS if board.turn == chess.WHITE else -TEMPO_BONUS
     return score

src/zero960_env/models.py CHANGED Viewed

@@ -21,3 +21,7 @@ class Zero960Observation(Observation):
     remaining_steps: int = 0
     last_match_score: float | None = None
     invalid_edit_count: int = 0

     remaining_steps: int = 0
     last_match_score: float | None = None
     invalid_edit_count: int = 0
+    workflow_hint: str = ""
+    suggested_actions: list[str] = Field(default_factory=list)
+    has_valid_edit: bool = False
+    has_run_match: bool = False

src/zero960_env/server/environment.py CHANGED Viewed

@@ -35,6 +35,10 @@ class Zero960Environment(Environment[Zero960Action, Zero960Observation, State]):
             remaining_steps=observation.remaining_steps,
             last_match_score=observation.last_match_score,
             invalid_edit_count=observation.invalid_edit_count,
         )
     def step(
@@ -61,6 +65,10 @@ class Zero960Environment(Environment[Zero960Action, Zero960Observation, State]):
             remaining_steps=obs.remaining_steps,
             last_match_score=obs.last_match_score,
             invalid_edit_count=obs.invalid_edit_count,
             reward=obs.reward,
             done=obs.done,
         )

             remaining_steps=observation.remaining_steps,
             last_match_score=observation.last_match_score,
             invalid_edit_count=observation.invalid_edit_count,
+            workflow_hint=observation.workflow_hint,
+            suggested_actions=observation.suggested_actions,
+            has_valid_edit=observation.has_valid_edit,
+            has_run_match=observation.has_run_match,
         )
     def step(
             remaining_steps=obs.remaining_steps,
             last_match_score=obs.last_match_score,
             invalid_edit_count=obs.invalid_edit_count,
+            workflow_hint=obs.workflow_hint,
+            suggested_actions=obs.suggested_actions,
+            has_valid_edit=obs.has_valid_edit,
+            has_run_match=obs.has_run_match,
             reward=obs.reward,
             done=obs.done,
         )

train/benchmark_engine.py ADDED Viewed

	@@ -0,0 +1,207 @@

+"""Benchmark two full engine roots so each side uses its own search and eval code."""
+from __future__ import annotations
+import argparse
+import importlib.util
+from collections.abc import Callable
+from dataclasses import dataclass
+from pathlib import Path
+import chess
+from train.benchmark_eval import BenchmarkResult, _elo_from_score, _sample_positions
+EvalFn = Callable[[chess.Board], int]
+SelectMoveFn = Callable[[chess.Board, int, EvalFn], chess.Move]
+@dataclass(slots=True)
+class EngineHandle:
+    root: Path
+    eval_path: Path
+    search_path: Path
+    evaluate: EvalFn
+    select_move: SelectMoveFn
+def _load_module(path: Path, module_name: str) -> object:
+    spec = importlib.util.spec_from_file_location(module_name, path)
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    return module
+def _load_engine(root: Path, eval_rel: str, search_rel: str, label: str) -> EngineHandle:
+    eval_path = (root / eval_rel).resolve()
+    search_path = (root / search_rel).resolve()
+    eval_module = _load_module(eval_path, f"zero960_eval_{label}")
+    search_module = _load_module(search_path, f"zero960_search_{label}")
+    evaluate = getattr(eval_module, "evaluate", None)
+    select_move = getattr(search_module, "select_move", None)
+    if evaluate is None or not callable(evaluate):
+        raise RuntimeError(f"{eval_path} does not define evaluate(board)")
+    if select_move is None or not callable(select_move):
+        raise RuntimeError(f"{search_path} does not define select_move(board, depth, eval_fn)")
+    return EngineHandle(
+        root=root.resolve(),
+        eval_path=eval_path,
+        search_path=search_path,
+        evaluate=evaluate,
+        select_move=select_move,
+    )
+def _new_board(chess960_index: int) -> chess.Board:
+    board = chess.Board.from_chess960_pos(chess960_index)
+    board.chess960 = True
+    return board
+def _play_game(
+    chess960_index: int,
+    white_engine: EngineHandle,
+    black_engine: EngineHandle,
+    *,
+    depth: int,
+    max_plies: int,
+) -> float:
+    board = _new_board(chess960_index)
+    for _ in range(max_plies):
+        if board.is_game_over(claim_draw=True):
+            break
+        engine = white_engine if board.turn == chess.WHITE else black_engine
+        move = engine.select_move(board, depth=depth, eval_fn=engine.evaluate)
+        board.push(move)
+    result = board.result(claim_draw=True)
+    if result == "1-0":
+        return 1.0
+    if result == "0-1":
+        return 0.0
+    return 0.5
+def benchmark_engine_roots(
+    candidate_root: Path,
+    baseline_root: Path,
+    *,
+    candidate_eval_rel: str = "src/zero960/workspace_template/eval.py",
+    baseline_eval_rel: str = "src/zero960/workspace_template/eval.py",
+    candidate_search_rel: str = "src/zero960/engine/search.py",
+    baseline_search_rel: str = "src/zero960/engine/search.py",
+    positions: int = 64,
+    depth: int = 2,
+    max_plies: int = 120,
+    seed: int = 42,
+) -> BenchmarkResult:
+    candidate = _load_engine(candidate_root, candidate_eval_rel, candidate_search_rel, "candidate")
+    baseline = _load_engine(baseline_root, baseline_eval_rel, baseline_search_rel, "baseline")
+    start_positions = _sample_positions(positions, seed)
+    wins = 0
+    draws = 0
+    losses = 0
+    points = 0.0
+    for chess960_index in start_positions:
+        white_result = _play_game(
+            chess960_index,
+            candidate,
+            baseline,
+            depth=depth,
+            max_plies=max_plies,
+        )
+        points += white_result
+        if white_result == 1.0:
+            wins += 1
+        elif white_result == 0.5:
+            draws += 1
+        else:
+            losses += 1
+        black_result = 1.0 - _play_game(
+            chess960_index,
+            baseline,
+            candidate,
+            depth=depth,
+            max_plies=max_plies,
+        )
+        points += black_result
+        if black_result == 1.0:
+            wins += 1
+        elif black_result == 0.5:
+            draws += 1
+        else:
+            losses += 1
+    total_games = len(start_positions) * 2
+    score = points / total_games if total_games else 0.0
+    return BenchmarkResult(
+        candidate_path=candidate.root,
+        baseline_path=baseline.root,
+        positions=len(start_positions),
+        depth=depth,
+        max_plies=max_plies,
+        seed=seed,
+        wins=wins,
+        draws=draws,
+        losses=losses,
+        points=points,
+        total_games=total_games,
+        score=score,
+        elo_delta_estimate=_elo_from_score(score),
+    )
+def parse_args() -> argparse.Namespace:
+    root = Path(__file__).resolve().parents[1]
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--candidate-root", default=str(root))
+    parser.add_argument("--baseline-root", default=str(root))
+    parser.add_argument("--candidate-eval-rel", default="src/zero960/workspace_template/eval.py")
+    parser.add_argument("--baseline-eval-rel", default="src/zero960/workspace_template/eval.py")
+    parser.add_argument("--candidate-search-rel", default="src/zero960/engine/search.py")
+    parser.add_argument("--baseline-search-rel", default="src/zero960/engine/search.py")
+    parser.add_argument("--positions", type=int, default=64)
+    parser.add_argument("--depth", type=int, default=2)
+    parser.add_argument("--max-plies", type=int, default=120)
+    parser.add_argument("--seed", type=int, default=42)
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    result = benchmark_engine_roots(
+        Path(args.candidate_root).resolve(),
+        Path(args.baseline_root).resolve(),
+        candidate_eval_rel=args.candidate_eval_rel,
+        baseline_eval_rel=args.baseline_eval_rel,
+        candidate_search_rel=args.candidate_search_rel,
+        baseline_search_rel=args.baseline_search_rel,
+        positions=args.positions,
+        depth=args.depth,
+        max_plies=args.max_plies,
+        seed=args.seed,
+    )
+    print(f"candidate_root: {result.candidate_path}")
+    print(f"baseline_root:  {result.baseline_path}")
+    print(
+        f"positions={result.positions} depth={result.depth} max_plies={result.max_plies} "
+        f"games={result.total_games} seed={result.seed}"
+    )
+    print(
+        f"record={result.wins}-{result.draws}-{result.losses} "
+        f"points={result.points:.1f}/{result.total_games}"
+    )
+    print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
+if __name__ == "__main__":
+    main()

train/benchmark_eval.py ADDED Viewed

	@@ -0,0 +1,184 @@

+"""Benchmark two Chess960 eval functions against each other."""
+from __future__ import annotations
+import argparse
+import importlib.util
+import math
+import random
+from collections.abc import Callable
+from dataclasses import asdict, dataclass
+from pathlib import Path
+import chess
+from zero960.engine.match import play_game
+EvalFn = Callable[[chess.Board], int]
+@dataclass(slots=True)
+class BenchmarkResult:
+    candidate_path: Path
+    baseline_path: Path
+    positions: int
+    depth: int
+    max_plies: int
+    seed: int
+    wins: int
+    draws: int
+    losses: int
+    points: float
+    total_games: int
+    score: float
+    elo_delta_estimate: float
+    def to_json(self) -> dict[str, object]:
+        payload = asdict(self)
+        payload["candidate_path"] = str(self.candidate_path)
+        payload["baseline_path"] = str(self.baseline_path)
+        return payload
+def _load_eval(path: Path) -> EvalFn:
+    spec = importlib.util.spec_from_file_location(f"zero960_benchmark_{path.stem}", path)
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    evaluate = getattr(module, "evaluate", None)
+    if evaluate is None or not callable(evaluate):
+        raise RuntimeError(f"{path} does not define evaluate(board)")
+    return evaluate
+def _sample_positions(count: int, seed: int) -> list[int]:
+    rng = random.Random(seed)
+    population = list(range(960))
+    if count <= len(population):
+        return rng.sample(population, count)
+    return [rng.choice(population) for _ in range(count)]
+def _elo_from_score(score: float) -> float:
+    clipped = min(max(score, 0.01), 0.99)
+    return -400.0 * math.log10((1.0 / clipped) - 1.0)
+def benchmark_eval_files(
+    candidate_path: Path,
+    baseline_path: Path,
+    *,
+    positions: int = 64,
+    depth: int = 2,
+    max_plies: int = 120,
+    seed: int = 42,
+) -> BenchmarkResult:
+    candidate_eval = _load_eval(candidate_path)
+    baseline_eval = _load_eval(baseline_path)
+    start_positions = _sample_positions(positions, seed)
+    wins = 0
+    draws = 0
+    losses = 0
+    points = 0.0
+    for chess960_index in start_positions:
+        white_result = play_game(
+            chess960_index,
+            candidate_eval,
+            baseline_eval,
+            depth=depth,
+            max_plies=max_plies,
+        )
+        points += white_result
+        if white_result == 1.0:
+            wins += 1
+        elif white_result == 0.5:
+            draws += 1
+        else:
+            losses += 1
+        black_result = 1.0 - play_game(
+            chess960_index,
+            baseline_eval,
+            candidate_eval,
+            depth=depth,
+            max_plies=max_plies,
+        )
+        points += black_result
+        if black_result == 1.0:
+            wins += 1
+        elif black_result == 0.5:
+            draws += 1
+        else:
+            losses += 1
+    total_games = len(start_positions) * 2
+    score = points / total_games if total_games else 0.0
+    return BenchmarkResult(
+        candidate_path=candidate_path,
+        baseline_path=baseline_path,
+        positions=len(start_positions),
+        depth=depth,
+        max_plies=max_plies,
+        seed=seed,
+        wins=wins,
+        draws=draws,
+        losses=losses,
+        points=points,
+        total_games=total_games,
+        score=score,
+        elo_delta_estimate=_elo_from_score(score),
+    )
+def parse_args() -> argparse.Namespace:
+    root = Path(__file__).resolve().parents[1]
+    parser = argparse.ArgumentParser(description="Benchmark two Chess960 eval functions.")
+    parser.add_argument(
+        "--candidate-file",
+        default=str(root / "src/zero960/workspace_template/eval.py"),
+        help="Path to the candidate eval.py file.",
+    )
+    parser.add_argument(
+        "--baseline-file",
+        default=str(root / "src/zero960/engine/default_eval.py"),
+        help="Path to the baseline eval.py file.",
+    )
+    parser.add_argument("--positions", type=int, default=64)
+    parser.add_argument("--depth", type=int, default=2)
+    parser.add_argument("--max-plies", type=int, default=120)
+    parser.add_argument("--seed", type=int, default=42)
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    candidate_path = Path(args.candidate_file).resolve()
+    baseline_path = Path(args.baseline_file).resolve()
+    result = benchmark_eval_files(
+        candidate_path,
+        baseline_path,
+        positions=args.positions,
+        depth=args.depth,
+        max_plies=args.max_plies,
+        seed=args.seed,
+    )
+    print(f"candidate: {candidate_path}")
+    print(f"baseline:  {baseline_path}")
+    print(
+        f"positions={result.positions} depth={result.depth} max_plies={result.max_plies} "
+        f"games={result.total_games} seed={result.seed}"
+    )
+    print(
+        f"record={result.wins}-{result.draws}-{result.losses} "
+        f"points={result.points:.1f}/{result.total_games}"
+    )
+    print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
+if __name__ == "__main__":
+    main()

train/benchmark_league.py ADDED Viewed

	@@ -0,0 +1,249 @@

+"""Benchmark a candidate eval against a league of accepted swarm champions."""
+from __future__ import annotations
+import argparse
+import json
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from train.benchmark_eval import BenchmarkResult, benchmark_eval_files
+@dataclass(slots=True)
+class LeagueOpponentResult:
+    opponent_path: Path
+    label: str
+    result: BenchmarkResult
+    def to_json(self) -> dict[str, object]:
+        payload = asdict(self)
+        payload["opponent_path"] = str(self.opponent_path)
+        payload["result"] = self.result.to_json()
+        return payload
+@dataclass(slots=True)
+class LeagueResult:
+    candidate_path: Path
+    opponents: list[LeagueOpponentResult]
+    total_points: float
+    total_games: int
+    overall_score: float
+    overall_elo_delta_estimate: float
+    def to_json(self) -> dict[str, object]:
+        return {
+            "candidate_path": str(self.candidate_path),
+            "opponents": [opponent.to_json() for opponent in self.opponents],
+            "total_points": self.total_points,
+            "total_games": self.total_games,
+            "overall_score": self.overall_score,
+            "overall_elo_delta_estimate": self.overall_elo_delta_estimate,
+        }
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+def _default_candidate(root: Path) -> Path:
+    return root / "outputs" / "codex_swarm" / "champion_eval.py"
+def _default_baseline(root: Path) -> Path:
+    return root / "src" / "zero960" / "engine" / "default_eval.py"
+def _accepted_snapshots(root: Path) -> list[Path]:
+    accepted_dir = root / "outputs" / "codex_swarm" / "accepted"
+    if not accepted_dir.exists():
+        return []
+    return sorted(accepted_dir.glob("*_eval.py"))
+def _dedupe_paths(paths: list[Path]) -> list[Path]:
+    seen: set[Path] = set()
+    ordered: list[Path] = []
+    for path in paths:
+        resolved = path.resolve()
+        if resolved in seen or not resolved.exists():
+            continue
+        seen.add(resolved)
+        ordered.append(resolved)
+    return ordered
+def _same_contents(left: Path, right: Path) -> bool:
+    return left.read_text(encoding="utf-8") == right.read_text(encoding="utf-8")
+def _label_for_path(root: Path, path: Path) -> str:
+    resolved = path.resolve()
+    champion = (root / "outputs" / "codex_swarm" / "champion_eval.py").resolve()
+    baseline = (root / "src" / "zero960" / "engine" / "default_eval.py").resolve()
+    if resolved == champion:
+        return "current_champion"
+    if resolved == baseline:
+        return "original_baseline"
+    if "outputs/codex_swarm/accepted" in str(resolved):
+        return resolved.stem
+    return resolved.stem
+def default_league_opponents(
+    *,
+    candidate_path: Path,
+    include_baseline: bool,
+    include_champion: bool,
+    accepted_limit: int | None,
+) -> list[Path]:
+    root = _repo_root()
+    opponents: list[Path] = []
+    if include_baseline:
+        opponents.append(_default_baseline(root))
+    if include_champion:
+        opponents.append(_default_candidate(root))
+    accepted = _accepted_snapshots(root)
+    if accepted_limit is not None:
+        accepted = accepted[-accepted_limit:]
+    opponents.extend(accepted)
+    filtered = []
+    for path in _dedupe_paths(opponents):
+        if path.resolve() == candidate_path.resolve():
+            continue
+        if _same_contents(path, candidate_path):
+            continue
+        filtered.append(path)
+    return filtered
+def benchmark_league(
+    candidate_path: Path,
+    opponent_paths: list[Path],
+    *,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+) -> LeagueResult:
+    root = _repo_root()
+    opponent_results: list[LeagueOpponentResult] = []
+    total_points = 0.0
+    total_games = 0
+    for offset, opponent_path in enumerate(opponent_paths):
+        result = benchmark_eval_files(
+            candidate_path,
+            opponent_path,
+            positions=positions,
+            depth=depth,
+            max_plies=max_plies,
+            seed=seed + offset,
+        )
+        opponent_results.append(
+            LeagueOpponentResult(
+                opponent_path=opponent_path,
+                label=_label_for_path(root, opponent_path),
+                result=result,
+            )
+        )
+        total_points += result.points
+        total_games += result.total_games
+    overall_score = total_points / total_games if total_games else 0.0
+    overall_elo = 0.0
+    if total_games:
+        from train.benchmark_eval import _elo_from_score  # local reuse
+        overall_elo = _elo_from_score(overall_score)
+    return LeagueResult(
+        candidate_path=candidate_path,
+        opponents=opponent_results,
+        total_points=total_points,
+        total_games=total_games,
+        overall_score=overall_score,
+        overall_elo_delta_estimate=overall_elo,
+    )
+def parse_args() -> argparse.Namespace:
+    root = _repo_root()
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--candidate-file",
+        default=str(_default_candidate(root)),
+        help="Path to the candidate eval.py file.",
+    )
+    parser.add_argument(
+        "--opponent-file",
+        action="append",
+        default=[],
+        help="Optional explicit opponent file. Repeat to add more than one.",
+    )
+    parser.add_argument("--positions", type=int, default=16)
+    parser.add_argument("--depth", type=int, default=2)
+    parser.add_argument("--max-plies", type=int, default=120)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument(
+        "--accepted-limit",
+        type=int,
+        default=4,
+        help="How many accepted swarm snapshots to include by default.",
+    )
+    parser.add_argument("--no-baseline", action="store_true", help="Exclude the original baseline from the league.")
+    parser.add_argument("--no-champion", action="store_true", help="Exclude the current champion from the league.")
+    parser.add_argument("--json", action="store_true", help="Print the full result as JSON.")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    candidate_path = Path(args.candidate_file).resolve()
+    explicit_opponents = [Path(path).resolve() for path in args.opponent_file]
+    opponents = _dedupe_paths(explicit_opponents)
+    if not opponents:
+        opponents = default_league_opponents(
+            candidate_path=candidate_path,
+            include_baseline=not args.no_baseline,
+            include_champion=not args.no_champion,
+            accepted_limit=args.accepted_limit,
+        )
+    if not opponents:
+        raise SystemExit("No league opponents found.")
+    result = benchmark_league(
+        candidate_path,
+        opponents,
+        positions=args.positions,
+        depth=args.depth,
+        max_plies=args.max_plies,
+        seed=args.seed,
+    )
+    if args.json:
+        print(json.dumps(result.to_json(), indent=2, sort_keys=True))
+        return
+    print(f"candidate: {candidate_path}")
+    print(f"league opponents: {len(result.opponents)}")
+    for opponent in result.opponents:
+        benchmark = opponent.result
+        print(
+            f"- {opponent.label}: record={benchmark.wins}-{benchmark.draws}-{benchmark.losses} "
+            f"points={benchmark.points:.1f}/{benchmark.total_games} score={benchmark.score:.3f} "
+            f"elo_delta_estimate={benchmark.elo_delta_estimate:.1f}"
+        )
+    print(
+        f"overall: points={result.total_points:.1f}/{result.total_games} "
+        f"score={result.overall_score:.3f} elo_delta_estimate={result.overall_elo_delta_estimate:.1f}"
+    )
+if __name__ == "__main__":
+    main()

train/benchmark_uci.py ADDED Viewed

	@@ -0,0 +1,307 @@

+"""Benchmark a local Chess960 eval file against a UCI engine such as Stockfish."""
+from __future__ import annotations
+import argparse
+import importlib.util
+import math
+import random
+from collections.abc import Callable
+from dataclasses import asdict, dataclass
+from pathlib import Path
+import chess
+import chess.engine
+from zero960.engine.search import select_move
+EvalFn = Callable[[chess.Board], int]
+@dataclass(slots=True)
+class UciBenchmarkResult:
+    candidate_path: Path
+    engine_command: str
+    engine_options: dict[str, bool | int | float | str]
+    positions: int
+    max_plies: int
+    seed: int
+    candidate_depth: int | None
+    candidate_nodes: int | None
+    engine_depth: int | None
+    engine_nodes: int | None
+    wins: int
+    draws: int
+    losses: int
+    points: float
+    total_games: int
+    score: float
+    elo_delta_estimate: float
+    def to_json(self) -> dict[str, object]:
+        payload = asdict(self)
+        payload["candidate_path"] = str(self.candidate_path)
+        return payload
+def _load_eval(path: Path) -> EvalFn:
+    spec = importlib.util.spec_from_file_location(f"zero960_uci_benchmark_{path.stem}", path)
+    if spec is None or spec.loader is None:
+        raise RuntimeError(f"failed to load module from {path}")
+    module = importlib.util.module_from_spec(spec)
+    spec.loader.exec_module(module)
+    evaluate = getattr(module, "evaluate", None)
+    if evaluate is None or not callable(evaluate):
+        raise RuntimeError(f"{path} does not define evaluate(board)")
+    return evaluate
+def _sample_positions(count: int, seed: int) -> list[int]:
+    rng = random.Random(seed)
+    population = list(range(960))
+    if count <= len(population):
+        return rng.sample(population, count)
+    return [rng.choice(population) for _ in range(count)]
+def _elo_from_score(score: float) -> float:
+    clipped = min(max(score, 0.01), 0.99)
+    return -400.0 * math.log10((1.0 / clipped) - 1.0)
+def _new_board(chess960_index: int) -> chess.Board:
+    board = chess.Board.from_chess960_pos(chess960_index)
+    board.chess960 = True
+    return board
+def _engine_limit(depth: int | None, nodes: int | None) -> chess.engine.Limit:
+    if depth is not None:
+        return chess.engine.Limit(depth=depth)
+    if nodes is not None:
+        return chess.engine.Limit(nodes=nodes)
+    raise ValueError("expected depth or nodes limit")
+def _parse_option_value(raw_value: str) -> bool | int | float | str:
+    lowered = raw_value.lower()
+    if lowered in {"true", "false"}:
+        return lowered == "true"
+    try:
+        return int(raw_value)
+    except ValueError:
+        pass
+    try:
+        return float(raw_value)
+    except ValueError:
+        pass
+    return raw_value
+def _parse_engine_options(pairs: list[str]) -> dict[str, bool | int | float | str]:
+    options: dict[str, bool | int | float | str] = {}
+    for pair in pairs:
+        if "=" not in pair:
+            raise ValueError(f"invalid --engine-option {pair!r}; expected NAME=VALUE")
+        name, raw_value = pair.split("=", 1)
+        option_name = name.strip()
+        if not option_name:
+            raise ValueError(f"invalid --engine-option {pair!r}; missing option name")
+        options[option_name] = _parse_option_value(raw_value.strip())
+    return options
+def _play_game_vs_engine(
+    chess960_index: int,
+    candidate_eval: EvalFn,
+    engine: chess.engine.SimpleEngine,
+    *,
+    candidate_is_white: bool,
+    candidate_depth: int | None,
+    candidate_nodes: int | None,
+    engine_depth: int | None,
+    engine_nodes: int | None,
+    max_plies: int,
+) -> float:
+    board = _new_board(chess960_index)
+    candidate_limit = _engine_limit(candidate_depth, candidate_nodes)
+    opponent_limit = _engine_limit(engine_depth, engine_nodes)
+    for _ in range(max_plies):
+        if board.is_game_over(claim_draw=True):
+            break
+        candidate_turn = board.turn == chess.WHITE if candidate_is_white else board.turn == chess.BLACK
+        if candidate_turn:
+            if candidate_limit.depth is not None:
+                move = select_move(board, depth=candidate_limit.depth, eval_fn=candidate_eval)
+            else:
+                raise ValueError("candidate_nodes is not supported by the local engine path")
+        else:
+            result = engine.play(board, opponent_limit)
+            move = result.move
+            if move is None:
+                raise RuntimeError("UCI engine returned no move")
+        board.push(move)
+    result = board.result(claim_draw=True)
+    if result == "1-0":
+        return 1.0 if candidate_is_white else 0.0
+    if result == "0-1":
+        return 0.0 if candidate_is_white else 1.0
+    return 0.5
+def benchmark_eval_vs_uci(
+    candidate_path: Path,
+    engine_command: str,
+    *,
+    engine_options: dict[str, bool | int | float | str] | None = None,
+    positions: int = 32,
+    candidate_depth: int = 2,
+    candidate_nodes: int | None = None,
+    engine_depth: int = 1,
+    engine_nodes: int | None = None,
+    max_plies: int = 120,
+    seed: int = 42,
+) -> UciBenchmarkResult:
+    candidate_eval = _load_eval(candidate_path)
+    start_positions = _sample_positions(positions, seed)
+    configured_engine_options = dict(engine_options or {})
+    wins = 0
+    draws = 0
+    losses = 0
+    points = 0.0
+    with chess.engine.SimpleEngine.popen_uci(engine_command) as engine:
+        if configured_engine_options:
+            engine.configure(configured_engine_options)
+        for chess960_index in start_positions:
+            white_result = _play_game_vs_engine(
+                chess960_index,
+                candidate_eval,
+                engine,
+                candidate_is_white=True,
+                candidate_depth=candidate_depth,
+                candidate_nodes=candidate_nodes,
+                engine_depth=engine_depth,
+                engine_nodes=engine_nodes,
+                max_plies=max_plies,
+            )
+            points += white_result
+            if white_result == 1.0:
+                wins += 1
+            elif white_result == 0.5:
+                draws += 1
+            else:
+                losses += 1
+            black_result = _play_game_vs_engine(
+                chess960_index,
+                candidate_eval,
+                engine,
+                candidate_is_white=False,
+                candidate_depth=candidate_depth,
+                candidate_nodes=candidate_nodes,
+                engine_depth=engine_depth,
+                engine_nodes=engine_nodes,
+                max_plies=max_plies,
+            )
+            points += black_result
+            if black_result == 1.0:
+                wins += 1
+            elif black_result == 0.5:
+                draws += 1
+            else:
+                losses += 1
+    total_games = len(start_positions) * 2
+    score = points / total_games if total_games else 0.0
+    return UciBenchmarkResult(
+        candidate_path=candidate_path,
+        engine_command=engine_command,
+        engine_options=configured_engine_options,
+        positions=len(start_positions),
+        max_plies=max_plies,
+        seed=seed,
+        candidate_depth=candidate_depth,
+        candidate_nodes=candidate_nodes,
+        engine_depth=engine_depth,
+        engine_nodes=engine_nodes,
+        wins=wins,
+        draws=draws,
+        losses=losses,
+        points=points,
+        total_games=total_games,
+        score=score,
+        elo_delta_estimate=_elo_from_score(score),
+    )
+def parse_args() -> argparse.Namespace:
+    root = Path(__file__).resolve().parents[1]
+    parser = argparse.ArgumentParser(description="Benchmark a local eval file against a UCI engine.")
+    parser.add_argument(
+        "--candidate-file",
+        default=str(root / "src/zero960/workspace_template/eval.py"),
+        help="Path to the candidate eval.py file.",
+    )
+    parser.add_argument(
+        "--engine-command",
+        default="stockfish",
+        help="UCI engine command, for example 'stockfish'.",
+    )
+    parser.add_argument(
+        "--engine-option",
+        action="append",
+        default=[],
+        help="Repeated engine option in NAME=VALUE form, for example UCI_LimitStrength=true.",
+    )
+    parser.add_argument("--positions", type=int, default=32)
+    parser.add_argument("--candidate-depth", type=int, default=2)
+    parser.add_argument("--candidate-nodes", type=int, default=None)
+    parser.add_argument("--engine-depth", type=int, default=1)
+    parser.add_argument("--engine-nodes", type=int, default=None)
+    parser.add_argument("--max-plies", type=int, default=120)
+    parser.add_argument("--seed", type=int, default=42)
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    candidate_path = Path(args.candidate_file).resolve()
+    engine_options = _parse_engine_options(args.engine_option)
+    result = benchmark_eval_vs_uci(
+        candidate_path,
+        args.engine_command,
+        engine_options=engine_options,
+        positions=args.positions,
+        candidate_depth=args.candidate_depth,
+        candidate_nodes=args.candidate_nodes,
+        engine_depth=args.engine_depth,
+        engine_nodes=args.engine_nodes,
+        max_plies=args.max_plies,
+        seed=args.seed,
+    )
+    print(f"candidate: {result.candidate_path}")
+    print(f"engine:    {result.engine_command}")
+    if result.engine_options:
+        print(f"engine_options={result.engine_options}")
+    print(
+        f"positions={result.positions} max_plies={result.max_plies} games={result.total_games} seed={result.seed} "
+        f"candidate_depth={result.candidate_depth} engine_depth={result.engine_depth} "
+        f"candidate_nodes={result.candidate_nodes} engine_nodes={result.engine_nodes}"
+    )
+    print(
+        f"record={result.wins}-{result.draws}-{result.losses} "
+        f"points={result.points:.1f}/{result.total_games}"
+    )
+    print(f"score={result.score:.3f} elo_delta_estimate={result.elo_delta_estimate:.1f}")
+if __name__ == "__main__":
+    main()

train/build_dashboard.py ADDED Viewed

	@@ -0,0 +1,656 @@

+"""Build a self-contained HTML dashboard for swarm and benchmark results."""
+from __future__ import annotations
+import argparse
+import json
+import tempfile
+from dataclasses import asdict, dataclass
+from datetime import datetime
+from pathlib import Path
+from train.benchmark_engine import benchmark_engine_roots
+from train.benchmark_league import benchmark_league, default_league_opponents
+from train.benchmark_uci import benchmark_eval_vs_uci
+@dataclass(slots=True)
+class DashboardData:
+    generated_at: str
+    current_champion: str
+    accepted_count: int
+    all_results: list[dict[str, object]]
+    accepted_results: list[dict[str, object]]
+    engine_progress: dict[str, object] | None
+    league: dict[str, object] | None
+    stockfish_anchors: list[dict[str, object]]
+    def to_json(self) -> dict[str, object]:
+        return asdict(self)
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+def _load_jsonl(path: Path) -> list[dict[str, object]]:
+    if not path.exists():
+        return []
+    rows: list[dict[str, object]] = []
+    for line in path.read_text(encoding="utf-8").splitlines():
+        line = line.strip()
+        if not line:
+            continue
+        rows.append(json.loads(line))
+    return rows
+def _short_summary(summary: str, limit: int = 180) -> str:
+    compact = " ".join(summary.split())
+    if len(compact) <= limit:
+        return compact
+    return compact[: limit - 3] + "..."
+def _normalize_result(entry: dict[str, object]) -> dict[str, object]:
+    benchmark = entry.get("benchmark") or {}
+    round_dir = str(entry.get("round_dir", ""))
+    round_name = Path(round_dir).name if round_dir else "unknown"
+    return {
+        "worker_name": entry.get("worker_name"),
+        "accepted": bool(entry.get("accepted")),
+        "winner": bool(entry.get("winner")),
+        "round_name": round_name,
+        "score": benchmark.get("score"),
+        "elo_delta_estimate": benchmark.get("elo_delta_estimate"),
+        "wins": benchmark.get("wins"),
+        "draws": benchmark.get("draws"),
+        "losses": benchmark.get("losses"),
+        "points": benchmark.get("points"),
+        "total_games": benchmark.get("total_games"),
+        "candidate_file": entry.get("candidate_file"),
+        "summary": _short_summary(str(entry.get("summary", ""))),
+        "surface": entry.get("surface", "eval"),
+    }
+def _copy_file(src: Path, dst: Path) -> None:
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    dst.write_text(src.read_text(encoding="utf-8"), encoding="utf-8")
+def _build_engine_progress(
+    root: Path,
+    champion_eval_path: Path,
+    *,
+    baseline_root: Path,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+) -> dict[str, object] | None:
+    if not baseline_root.exists():
+        return None
+    baseline_eval = baseline_root / "src" / "zero960" / "workspace_template" / "eval.py"
+    baseline_search = baseline_root / "src" / "zero960" / "engine" / "search.py"
+    if not baseline_eval.exists() or not baseline_search.exists():
+        return None
+    with tempfile.TemporaryDirectory(prefix="0x960-dashboard-engine-") as temp_dir:
+        candidate_root = Path(temp_dir)
+        _copy_file(
+            champion_eval_path,
+            candidate_root / "src" / "zero960" / "workspace_template" / "eval.py",
+        )
+        _copy_file(
+            root / "src" / "zero960" / "engine" / "search.py",
+            candidate_root / "src" / "zero960" / "engine" / "search.py",
+        )
+        result = benchmark_engine_roots(
+            candidate_root,
+            baseline_root,
+            positions=positions,
+            depth=depth,
+            max_plies=max_plies,
+            seed=seed,
+        )
+    return {
+        "label": "Current engine vs search baseline",
+        "candidate_eval_path": str(champion_eval_path),
+        "candidate_search_path": str((root / "src" / "zero960" / "engine" / "search.py").resolve()),
+        "baseline_root": str(baseline_root),
+        "result": result.to_json(),
+    }
+def _build_stockfish_anchors(
+    candidate_path: Path,
+    *,
+    positions: int,
+    candidate_depth: int,
+    engine_depth: int,
+    max_plies: int,
+    seed: int,
+    engine_command: str,
+    anchor_elos: list[int],
+) -> list[dict[str, object]]:
+    rows: list[dict[str, object]] = []
+    for elo in anchor_elos:
+        result = benchmark_eval_vs_uci(
+            candidate_path,
+            engine_command,
+            engine_options={"UCI_LimitStrength": True, "UCI_Elo": elo},
+            positions=positions,
+            candidate_depth=candidate_depth,
+            engine_depth=engine_depth,
+            max_plies=max_plies,
+            seed=seed,
+        )
+        rows.append(
+            {
+                "label": f"Stockfish {elo}",
+                "uci_elo": elo,
+                "score": result.score,
+                "elo_delta_estimate": result.elo_delta_estimate,
+                "wins": result.wins,
+                "draws": result.draws,
+                "losses": result.losses,
+                "points": result.points,
+                "total_games": result.total_games,
+            }
+        )
+    return rows
+def _build_dashboard_data(args: argparse.Namespace) -> DashboardData:
+    root = _repo_root()
+    ledger_path = root / "outputs" / "codex_swarm" / "ledger.jsonl"
+    champion_path = Path(args.candidate_file).resolve()
+    ledger_rows = _load_jsonl(ledger_path)
+    normalized_rows = [_normalize_result(row) for row in ledger_rows if row.get("benchmark") is not None]
+    accepted_rows = [row for row in normalized_rows if row["accepted"]]
+    league_payload: dict[str, object] | None = None
+    opponents = default_league_opponents(
+        candidate_path=champion_path,
+        include_baseline=True,
+        include_champion=True,
+        accepted_limit=args.league_accepted_limit,
+    )
+    if opponents:
+        league_result = benchmark_league(
+            champion_path,
+            opponents,
+            positions=args.league_positions,
+            depth=args.depth,
+            max_plies=args.max_plies,
+            seed=args.seed,
+        )
+        league_payload = league_result.to_json()
+    stockfish_rows: list[dict[str, object]] = []
+    engine_progress: dict[str, object] | None = None
+    if args.include_engine_progress:
+        engine_progress = _build_engine_progress(
+            root,
+            champion_path,
+            baseline_root=Path(args.engine_baseline_root).resolve(),
+            positions=args.engine_positions,
+            depth=args.depth,
+            max_plies=args.max_plies,
+            seed=args.seed,
+        )
+    if args.include_stockfish:
+        stockfish_rows = _build_stockfish_anchors(
+            champion_path,
+            positions=args.stockfish_positions,
+            candidate_depth=args.depth,
+            engine_depth=args.stockfish_depth,
+            max_plies=args.max_plies,
+            seed=args.seed,
+            engine_command=args.engine_command,
+            anchor_elos=args.stockfish_elo,
+        )
+    return DashboardData(
+        generated_at=datetime.now().isoformat(timespec="seconds"),
+        current_champion=str(champion_path),
+        accepted_count=len(accepted_rows),
+        all_results=normalized_rows,
+        accepted_results=accepted_rows,
+        engine_progress=engine_progress,
+        league=league_payload,
+        stockfish_anchors=stockfish_rows,
+    )
+def _dashboard_html(payload: dict[str, object]) -> str:
+    data_json = json.dumps(payload)
+    template = """<!doctype html>
+<html lang="en">
+<head>
+  <meta charset="utf-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1">
+  <title>0x960 Dashboard</title>
+  <style>
+    :root {{
+      --bg: #0d1117;
+      --panel: #151b23;
+      --panel-2: #1d2733;
+      --text: #e6edf3;
+      --muted: #9fb0c0;
+      --green: #3fb950;
+      --red: #f85149;
+      --amber: #d29922;
+      --blue: #58a6ff;
+      --border: #2d3a49;
+    }}
+    * {{ box-sizing: border-box; }}
+    body {{
+      margin: 0;
+      font-family: ui-sans-serif, -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+      background:
+        radial-gradient(circle at top left, rgba(88,166,255,0.14), transparent 28%),
+        radial-gradient(circle at top right, rgba(63,185,80,0.12), transparent 22%),
+        linear-gradient(180deg, #0b1016 0%, var(--bg) 100%);
+      color: var(--text);
+    }}
+    .wrap {{
+      width: min(1200px, calc(100vw - 32px));
+      margin: 0 auto;
+      padding: 28px 0 48px;
+    }}
+    h1, h2, h3, p {{ margin: 0; }}
+    .hero {{
+      display: grid;
+      gap: 12px;
+      margin-bottom: 20px;
+    }}
+    .hero p {{ color: var(--muted); }}
+    .grid {{
+      display: grid;
+      grid-template-columns: repeat(12, 1fr);
+      gap: 16px;
+    }}
+    .card {{
+      background: linear-gradient(180deg, rgba(255,255,255,0.02), rgba(255,255,255,0.01));
+      border: 1px solid var(--border);
+      border-radius: 18px;
+      padding: 18px;
+      backdrop-filter: blur(8px);
+      box-shadow: 0 18px 50px rgba(0,0,0,0.22);
+    }}
+    .span-3 {{ grid-column: span 3; }}
+    .span-4 {{ grid-column: span 4; }}
+    .span-5 {{ grid-column: span 5; }}
+    .span-6 {{ grid-column: span 6; }}
+    .span-7 {{ grid-column: span 7; }}
+    .span-8 {{ grid-column: span 8; }}
+    .span-12 {{ grid-column: span 12; }}
+    .kpi-label {{ color: var(--muted); font-size: 13px; margin-bottom: 8px; }}
+    .kpi-value {{ font-size: 34px; font-weight: 700; letter-spacing: -0.03em; }}
+    .kpi-sub {{ color: var(--muted); margin-top: 6px; font-size: 13px; }}
+    .section-title {{ font-size: 18px; margin-bottom: 14px; }}
+    .chart {{ width: 100%; height: 280px; }}
+    .bars .row, .table-row {{
+      display: grid;
+      gap: 12px;
+      align-items: center;
+    }}
+    .bars .row {{
+      grid-template-columns: 190px 1fr 72px;
+      margin-bottom: 10px;
+    }}
+    .bar-track {{
+      height: 12px;
+      border-radius: 999px;
+      background: rgba(255,255,255,0.08);
+      overflow: hidden;
+    }}
+    .bar-fill {{
+      height: 100%;
+      border-radius: 999px;
+      background: linear-gradient(90deg, var(--blue), #8ed0ff);
+    }}
+    .good {{ color: var(--green); }}
+    .bad {{ color: var(--red); }}
+    .muted {{ color: var(--muted); }}
+    .table-head, .table-row {{
+      grid-template-columns: 126px 70px 82px 120px 1fr;
+      font-size: 13px;
+      padding: 10px 0;
+      border-bottom: 1px solid rgba(255,255,255,0.06);
+    }}
+    .table-head {{
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.06em;
+      font-size: 11px;
+    }}
+    .pill {{
+      display: inline-block;
+      border-radius: 999px;
+      padding: 4px 10px;
+      font-size: 11px;
+      font-weight: 700;
+      letter-spacing: 0.04em;
+      text-transform: uppercase;
+      background: rgba(255,255,255,0.08);
+    }}
+    .pill.win {{ background: rgba(63,185,80,0.16); color: var(--green); }}
+    .pill.loss {{ background: rgba(248,81,73,0.14); color: var(--red); }}
+    .pill.flat {{ background: rgba(210,153,34,0.14); color: var(--amber); }}
+    .league-list {{
+      display: grid;
+      gap: 12px;
+    }}
+    .league-item {{
+      display: grid;
+      grid-template-columns: 1fr auto auto;
+      gap: 12px;
+      padding: 12px 0;
+      border-bottom: 1px solid rgba(255,255,255,0.06);
+    }}
+    .footer {{
+      margin-top: 16px;
+      font-size: 12px;
+      color: var(--muted);
+    }}
+    @media (max-width: 900px) {{
+      .span-3, .span-4, .span-5, .span-6, .span-7, .span-8, .span-12 {{
+        grid-column: span 12;
+      }}
+      .bars .row {{ grid-template-columns: 1fr; }}
+      .table-head, .table-row {{ grid-template-columns: 1fr; gap: 6px; }}
+      .league-item {{ grid-template-columns: 1fr; }}
+    }}
+  </style>
+</head>
+<body>
+  <div class="wrap">
+    <div class="hero">
+      <h1>0x960 Engine Dashboard</h1>
+      <p>Swarm progress, internal Elo deltas, league self-play, and optional Stockfish anchors in one static page.</p>
+    </div>
+    <div class="grid" id="app"></div>
+    <div class="footer" id="footer"></div>
+  </div>
+  <script type="application/json" id="dashboard-data">__DASHBOARD_JSON__</script>
+  <script>
+    const data = JSON.parse(document.getElementById('dashboard-data').textContent);
+    const app = document.getElementById('app');
+    const footer = document.getElementById('footer');
+    const accepted = data.accepted_results || [];
+    const league = data.league;
+    const anchors = data.stockfish_anchors || [];
+    const engineProgress = data.engine_progress;
+    const all = data.all_results || [];
+    const latestAccepted = accepted.length ? accepted[accepted.length - 1] : null;
+    const bestAccepted = accepted.length
+      ? accepted.reduce((best, row) => (row.score > best.score ? row : best), accepted[0])
+      : null;
+    const bestRejected = all.filter((row) => !row.accepted && row.score !== null).reduce((best, row) => {
+      if (!best || row.score > best.score) return row;
+      return best;
+    }, null);
+    function card(cls, inner) {{
+      const el = document.createElement('section');
+      el.className = `card ${cls}`;
+      el.innerHTML = inner;
+      return el;
+    }}
+    function scoreClass(value) {{
+      if (value > 0.5) return 'good';
+      if (value < 0.5) return 'bad';
+      return 'muted';
+    }}
+    function eloClass(value) {{
+      if (value > 0) return 'good';
+      if (value < 0) return 'bad';
+      return 'muted';
+    }}
+    const kpis = [
+      {{
+        label: 'Accepted Champions',
+        value: String(data.accepted_count),
+        sub: latestAccepted ? `Latest: ${latestAccepted.worker_name}` : 'No accepted challenger yet'
+      }},
+      {{
+        label: 'Current Internal Score',
+        value: latestAccepted ? latestAccepted.score.toFixed(3) : 'n/a',
+        sub: latestAccepted ? `vs previous champion` : 'Awaiting accepted run'
+      }},
+      {{
+        label: 'Current Internal Elo',
+        value: latestAccepted ? `${latestAccepted.elo_delta_estimate.toFixed(1)}` : 'n/a',
+        sub: latestAccepted ? 'delta vs prior champion' : 'Awaiting accepted run'
+      }},
+      {{
+        label: 'League Score',
+        value: league ? league.overall_score.toFixed(3) : 'n/a',
+        sub: league ? `${league.total_points.toFixed(1)}/${league.total_games} points` : 'League not available'
+      }}
+    ];
+    if (engineProgress) {{
+      kpis.push({{
+        label: 'Search Gain',
+        value: `${{engineProgress.result.elo_delta_estimate.toFixed(1)}}`,
+        sub: `${{engineProgress.result.points.toFixed(1)}}/${{engineProgress.result.total_games}} vs baseline`
+      }});
+    }}
+    for (const kpi of kpis) {{
+      app.appendChild(card('span-3', `
+        <div class="kpi-label">${{kpi.label}}</div>
+        <div class="kpi-value">${{kpi.value}}</div>
+        <div class="kpi-sub">${{kpi.sub}}</div>
+      `));
+    }}
+    function lineChart(rows) {{
+      if (!rows.length) {{
+        return '<p class="muted">No accepted results yet.</p>';
+      }}
+      const width = 640;
+      const height = 260;
+      const padding = 28;
+      const xs = rows.map((_, index) => padding + (index * (width - padding * 2) / Math.max(rows.length - 1, 1)));
+      const ys = rows.map((row) => {{
+        const score = row.score ?? 0.5;
+        return height - padding - ((score - 0.35) / 0.35) * (height - padding * 2);
+      }});
+      const points = xs.map((x, index) => `${{x}},${{ys[index]}}`).join(' ');
+      const circles = xs.map((x, index) =>
+        `<circle cx="${{x}}" cy="${{ys[index]}}" r="5" fill="#58a6ff"><title>${{rows[index].worker_name}}: ${{rows[index].score.toFixed(3)}}</title></circle>`
+      ).join('');
+      return `
+        <svg viewBox="0 0 ${{width}} ${{height}}" class="chart" role="img" aria-label="Accepted score progression">
+          <line x1="${{padding}}" y1="${{height - padding}}" x2="${{width - padding}}" y2="${{height - padding}}" stroke="rgba(255,255,255,0.18)" />
+          <line x1="${{padding}}" y1="${{padding}}" x2="${{padding}}" y2="${{height - padding}}" stroke="rgba(255,255,255,0.18)" />
+          <line x1="${{padding}}" y1="${{height - padding - ((0.5 - 0.35) / 0.35) * (height - padding * 2)}}" x2="${{width - padding}}" y2="${{height - padding - ((0.5 - 0.35) / 0.35) * (height - padding * 2)}}" stroke="rgba(210,153,34,0.35)" stroke-dasharray="4 4" />
+          <polyline fill="none" stroke="#58a6ff" stroke-width="3" points="${{points}}" />
+          ${{circles}}
+        </svg>
+      `;
+    }}
+    app.appendChild(card('span-7', `
+      <h2 class="section-title">Accepted Score Progression</h2>
+      ${{lineChart(accepted)}}
+    `));
+    const summaryRows = [
+      latestAccepted ? `<div class="league-item"><div><strong>Latest winner</strong><div class="muted">${{latestAccepted.worker_name}} in ${{latestAccepted.round_name}}</div></div><div class="${{eloClass(latestAccepted.elo_delta_estimate)}}">${{latestAccepted.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(latestAccepted.score)}}">${{latestAccepted.score.toFixed(3)}} score</div></div>` : '',
+      bestAccepted ? `<div class="league-item"><div><strong>Best accepted score</strong><div class="muted">${{bestAccepted.worker_name}}</div></div><div class="${{eloClass(bestAccepted.elo_delta_estimate)}}">${{bestAccepted.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(bestAccepted.score)}}">${{bestAccepted.score.toFixed(3)}} score</div></div>` : '',
+      bestRejected ? `<div class="league-item"><div><strong>Best rejected try</strong><div class="muted">${{bestRejected.worker_name}} in ${{bestRejected.round_name}}</div></div><div class="${{eloClass(bestRejected.elo_delta_estimate)}}">${{bestRejected.elo_delta_estimate.toFixed(1)}} Elo</div><div class="${{scoreClass(bestRejected.score)}}">${{bestRejected.score.toFixed(3)}} score</div></div>` : ''
+    ].join('');
+    app.appendChild(card('span-5', `
+      <h2 class="section-title">Swarm Snapshot</h2>
+      <div class="league-list">${{summaryRows || '<p class="muted">No benchmark rows yet.</p>'}}</div>
+    `));
+    if (engineProgress) {{
+      app.appendChild(card('span-12', `
+        <h2 class="section-title">Engine Search Progress</h2>
+        <div class="league-list">
+          <div class="league-item">
+            <div>
+              <strong>${{engineProgress.label}}</strong>
+              <div class="muted">${{engineProgress.result.wins}}-${{engineProgress.result.draws}}-${{engineProgress.result.losses}}</div>
+            </div>
+            <div class="${{scoreClass(engineProgress.result.score)}}">${{engineProgress.result.score.toFixed(3)}} score</div>
+            <div class="${{eloClass(engineProgress.result.elo_delta_estimate)}}">${{engineProgress.result.elo_delta_estimate.toFixed(1)}} Elo</div>
+          </div>
+        </div>
+        <div class="kpi-sub" style="margin-top: 10px;">
+          Candidate search: ${{engineProgress.candidate_search_path}}<br>
+          Baseline root: ${{engineProgress.baseline_root}}
+        </div>
+      `));
+    }}
+    function barRows(rows, key, formatter) {{
+      if (!rows.length) {{
+        return '<p class="muted">No data yet.</p>';
+      }}
+      const values = rows.map((row) => Math.abs(row[key] ?? 0));
+      const max = Math.max(...values, 1);
+      return rows.map((row) => {{
+        const value = row[key] ?? 0;
+        const width = Math.max(6, Math.round(Math.abs(value) / max * 100));
+        const cls = value > 0 ? 'good' : value < 0 ? 'bad' : 'muted';
+        const fill = value > 0 ? 'var(--green)' : value < 0 ? 'var(--red)' : 'var(--amber)';
+        return `
+          <div class="row">
+            <div>${{row.worker_name}}</div>
+            <div class="bar-track"><div class="bar-fill" style="width:${{width}}%; background:${{fill}}"></div></div>
+            <div class="${{cls}}">${{formatter(value)}}</div>
+          </div>
+        `;
+      }}).join('');
+    }}
+    app.appendChild(card('span-6', `
+      <h2 class="section-title">Accepted Internal Elo Deltas</h2>
+      <div class="bars">${{barRows(accepted, 'elo_delta_estimate', (value) => value.toFixed(1))}}</div>
+    `));
+    app.appendChild(card('span-6', `
+      <h2 class="section-title">League Self-Play</h2>
+      ${{
+        league
+          ? `<div class="league-list">${{league.opponents.map((opponent) => `
+              <div class="league-item">
+                <div>
+                  <strong>${{opponent.label}}</strong>
+                  <div class="muted">${{opponent.result.wins}}-${{opponent.result.draws}}-${{opponent.result.losses}}</div>
+                </div>
+                <div class="${{scoreClass(opponent.result.score)}}">${{opponent.result.score.toFixed(3)}} score</div>
+                <div class="${{eloClass(opponent.result.elo_delta_estimate)}}">${{opponent.result.elo_delta_estimate.toFixed(1)}} Elo</div>
+              </div>
+            `).join('')}}</div>
+            <div class="kpi-sub" style="margin-top: 10px;">Overall: ${{league.overall_score.toFixed(3)}} score, ${{league.overall_elo_delta_estimate.toFixed(1)}} Elo delta estimate</div>`
+          : '<p class="muted">League benchmark not available.</p>'
+      }}
+    `));
+    if (anchors.length) {{
+      app.appendChild(card('span-12', `
+        <h2 class="section-title">Stockfish Anchor Ladder</h2>
+        <div class="bars">${{anchors.map((row) => `
+          <div class="row">
+            <div>${{row.label}}</div>
+            <div class="bar-track"><div class="bar-fill" style="width:${{Math.max(6, Math.round(row.score * 100))}}%; background:${{row.score >= 0.5 ? 'var(--green)' : 'var(--blue)'}}"></div></div>
+            <div class="${{scoreClass(row.score)}}">${{row.score.toFixed(3)}}</div>
+          </div>
+        `).join('')}}</div>
+      `));
+    }}
+    const rows = all.slice().reverse().map((row) => {{
+      const pillClass = row.accepted ? 'win' : (row.score > 0.5 ? 'flat' : 'loss');
+      const pillText = row.accepted ? 'accepted' : 'rejected';
+      return `
+        <div class="table-row">
+          <div>${{row.round_name}}</div>
+          <div>${{row.worker_name}}</div>
+          <div><span class="pill ${{pillClass}}">${{pillText}}</span></div>
+          <div class="${{scoreClass(row.score)}}">${{row.score !== null ? row.score.toFixed(3) : 'n/a'}}</div>
+          <div class="muted">${{row.summary}}</div>
+        </div>
+      `;
+    }}).join('');
+    app.appendChild(card('span-12', `
+      <h2 class="section-title">Recent Swarm Results</h2>
+      <div class="table-head">
+        <div>Round</div>
+        <div>Worker</div>
+        <div>Status</div>
+        <div>Score</div>
+        <div>Summary</div>
+      </div>
+      ${{rows || '<p class="muted">No swarm results yet.</p>'}}
+    `));
+    footer.textContent = `Generated ${{data.generated_at}} | champion: ${{data.current_champion}}`;
+  </script>
+</body>
+</html>
+"""
+    template = template.replace("{{", "{").replace("}}", "}")
+    return template.replace("__DASHBOARD_JSON__", data_json)
+def parse_args() -> argparse.Namespace:
+    root = _repo_root()
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--candidate-file",
+        default=str(root / "outputs" / "codex_swarm" / "champion_eval.py"),
+        help="Candidate file to treat as the current engine in the dashboard.",
+    )
+    parser.add_argument(
+        "--output-dir",
+        default=str(root / "outputs" / "dashboard"),
+        help="Directory where index.html and dashboard_data.json will be written.",
+    )
+    parser.add_argument("--depth", type=int, default=2)
+    parser.add_argument("--max-plies", type=int, default=120)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--league-positions", type=int, default=8)
+    parser.add_argument("--league-accepted-limit", type=int, default=4)
+    parser.add_argument("--include-engine-progress", action="store_true")
+    parser.add_argument("--engine-baseline-root", default="/tmp/0x960-search-baseline")
+    parser.add_argument("--engine-positions", type=int, default=8)
+    parser.add_argument("--include-stockfish", action="store_true")
+    parser.add_argument("--engine-command", default="stockfish")
+    parser.add_argument("--stockfish-depth", type=int, default=1)
+    parser.add_argument("--stockfish-positions", type=int, default=4)
+    parser.add_argument("--stockfish-elo", type=int, action="append", default=[1320, 1600])
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    output_dir = Path(args.output_dir).resolve()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    payload = _build_dashboard_data(args).to_json()
+    (output_dir / "dashboard_data.json").write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+    (output_dir / "index.html").write_text(_dashboard_html(payload), encoding="utf-8")
+    print(f"wrote {(output_dir / 'index.html')}")
+    print(f"wrote {(output_dir / 'dashboard_data.json')}")
+if __name__ == "__main__":
+    main()

train/codex_distill.py ADDED Viewed

	@@ -0,0 +1,332 @@

+"""Collect teacher trajectories from Codex for 0x960 and emit SFT-ready samples.
+This script keeps the teacher inside the same bounded action space as the student:
+the model sees the current observation and returns exactly one JSON action per turn.
+"""
+from __future__ import annotations
+import argparse
+import json
+import shutil
+import subprocess
+import tempfile
+import time
+from dataclasses import dataclass
+from pathlib import Path
+from zero960_env.client import Zero960Client
+from zero960_env.models import Zero960Action, Zero960Observation
+from train.minimal_trl_openenv import SYSTEM_PROMPT, format_observation_as_prompt
+ACTION_SCHEMA = {
+    "type": "object",
+    "additionalProperties": False,
+    "properties": {
+        "action_type": {
+            "type": "string",
+            "enum": ["read_file", "write_file", "run_static_eval", "run_match", "finish"],
+        },
+        "path": {"type": ["string", "null"]},
+        "content": {"type": ["string", "null"]},
+    },
+    "required": ["action_type", "path", "content"],
+}
+TEACHER_INSTRUCTIONS = """You are the teacher policy for 0x960.
+Return exactly one JSON action object that matches the provided schema.
+Constraints:
+- Act only through the bounded action schema. Do not describe actions.
+- Do not use shell commands or external tools.
+- The current eval.py contents are already included in the observation.
+- Prefer the high-reward loop: write_file -> run_match -> finish.
+- Avoid repeated run_static_eval unless it is truly necessary.
+- Always include all three JSON keys: action_type, path, content.
+- Use null for unused fields. Example: {"action_type":"run_match","path":null,"content":null}
+- If you choose write_file, return a full valid replacement for eval.py.
+"""
+@dataclass(slots=True)
+class TeacherTurn:
+    action: Zero960Action
+    raw_response: str
+    elapsed_s: float
+def _action_payload(action: Zero960Action) -> dict:
+    return {
+        "action_type": action.action_type,
+        "path": action.path,
+        "content": action.content,
+    }
+def _find_codex_binary(explicit_path: str | None) -> str:
+    if explicit_path:
+        return explicit_path
+    codex_bin = shutil.which("codex")
+    if codex_bin is None:
+        raise RuntimeError("codex CLI not found on PATH; install or pass --codex-bin")
+    return codex_bin
+def _teacher_prompt(observation: Zero960Observation) -> str:
+    return (
+        f"{TEACHER_INSTRUCTIONS}\n"
+        "Use the same environment contract as the student prompt below.\n\n"
+        f"System prompt:\n{SYSTEM_PROMPT}\n\n"
+        f"Observation:\n{format_observation_as_prompt(observation)}\n"
+    )
+def _run_codex_turn(
+    codex_bin: str,
+    model: str,
+    workdir: Path,
+    observation: Zero960Observation,
+    timeout_s: int,
+) -> TeacherTurn:
+    prompt = _teacher_prompt(observation)
+    with tempfile.TemporaryDirectory(prefix="zero960_codex_") as temp_dir_str:
+        temp_dir = Path(temp_dir_str)
+        schema_path = temp_dir / "action.schema.json"
+        output_path = temp_dir / "action.json"
+        schema_path.write_text(json.dumps(ACTION_SCHEMA))
+        command = [
+            codex_bin,
+            "exec",
+            "--model",
+            model,
+            "--cd",
+            str(workdir),
+            "--ephemeral",
+            "--color",
+            "never",
+            "--output-schema",
+            str(schema_path),
+            "--output-last-message",
+            str(output_path),
+            "-",
+        ]
+        started = time.time()
+        result = subprocess.run(
+            command,
+            input=prompt,
+            text=True,
+            capture_output=True,
+            timeout=timeout_s,
+            check=False,
+        )
+        elapsed_s = round(time.time() - started, 2)
+        if result.returncode != 0:
+            stderr = result.stderr.strip()
+            if "refresh_token_reused" in stderr:
+                raise RuntimeError(
+                    "codex auth is stale; run `codex logout` then `codex login` and retry"
+                )
+            if "usage limit" in stderr.lower():
+                raise RuntimeError("codex usage limit reached; stop the batch and retry later")
+            raise RuntimeError(f"codex exec failed with exit code {result.returncode}: {stderr}")
+        if not output_path.exists():
+            raise RuntimeError("codex exec did not write an output message")
+        raw_response = output_path.read_text().strip()
+        if not raw_response:
+            raise RuntimeError("codex exec returned an empty final message")
+        action = Zero960Action.model_validate_json(raw_response)
+        return TeacherTurn(action=action, raw_response=raw_response, elapsed_s=elapsed_s)
+def _append_jsonl(path: Path, payload: dict) -> None:
+    with path.open("a") as handle:
+        handle.write(json.dumps(payload, default=str) + "\n")
+def _sft_sample(observation: Zero960Observation, action: Zero960Action, metadata: dict) -> dict:
+    return {
+        "messages": [
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": format_observation_as_prompt(observation)},
+            {"role": "assistant", "content": json.dumps(_action_payload(action))},
+        ],
+        "metadata": metadata,
+    }
+def collect_teacher_rollouts(
+    base_url: str,
+    model: str,
+    episodes: int,
+    max_turns: int,
+    timeout_s: int,
+    output_dir: Path,
+    min_reward: float,
+    codex_bin: str | None,
+) -> tuple[Path, Path]:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    trace_path = output_dir / f"teacher_rollouts_{int(time.time())}.jsonl"
+    sft_path = output_dir / f"sft_samples_{int(time.time())}.jsonl"
+    trace_path.touch()
+    sft_path.touch()
+    codex_executable = _find_codex_binary(codex_bin)
+    workdir = Path(__file__).resolve().parents[1]
+    with Zero960Client(base_url=base_url) as client:
+        stop_reason: str | None = None
+        for episode_index in range(episodes):
+            reset_result = client.reset()
+            observation = reset_result.observation
+            episode_turns: list[dict] = []
+            forced_finish = False
+            for turn_index in range(max_turns):
+                if reset_result.done:
+                    break
+                pre_action_observation = observation
+                try:
+                    teacher_turn = _run_codex_turn(
+                        codex_bin=codex_executable,
+                        model=model,
+                        workdir=workdir,
+                        observation=pre_action_observation,
+                        timeout_s=timeout_s,
+                    )
+                except RuntimeError as exc:
+                    if "usage limit reached" in str(exc):
+                        stop_reason = str(exc)
+                        break
+                    raise
+                step_result = client.step(teacher_turn.action)
+                observation = step_result.observation
+                turn_payload = {
+                    "episode_index": episode_index,
+                    "turn_index": turn_index,
+                    "teacher_model": model,
+                    "elapsed_s": teacher_turn.elapsed_s,
+                    "raw_response": teacher_turn.raw_response,
+                    "action": _action_payload(teacher_turn.action),
+                    "observation_before": pre_action_observation.model_dump(),
+                    "observation_after": observation.model_dump(),
+                    "reward": step_result.reward,
+                    "done": step_result.done,
+                }
+                episode_turns.append(turn_payload)
+                if step_result.done:
+                    reset_result = step_result
+                    break
+                reset_result = step_result
+            if stop_reason is not None:
+                break
+            if not reset_result.done:
+                forced_finish = True
+                finish_result = client.step(Zero960Action(action_type="finish"))
+                observation = finish_result.observation
+                episode_turns.append(
+                    {
+                        "episode_index": episode_index,
+                        "turn_index": len(episode_turns),
+                        "teacher_model": model,
+                        "elapsed_s": 0.0,
+                        "raw_response": json.dumps({"action_type": "finish"}),
+                        "action": {"action_type": "finish"},
+                        "observation_before": reset_result.observation.model_dump(),
+                        "observation_after": observation.model_dump(),
+                        "reward": finish_result.reward,
+                        "done": finish_result.done,
+                        "forced_finish": True,
+                    }
+                )
+                reset_result = finish_result
+            final_reward = float(reset_result.reward or 0.0)
+            accepted = (
+                final_reward >= min_reward
+                and observation.has_valid_edit
+                and observation.has_run_match
+            )
+            episode_payload = {
+                "episode_index": episode_index,
+                "teacher_model": model,
+                "forced_finish": forced_finish,
+                "accepted_for_sft": accepted,
+                "final_reward": final_reward,
+                "final_status": observation.status_message,
+                "turns": episode_turns,
+            }
+            _append_jsonl(trace_path, episode_payload)
+            if accepted:
+                for turn in episode_turns:
+                    if turn.get("forced_finish"):
+                        continue
+                    sample = _sft_sample(
+                        observation=Zero960Observation.model_validate(turn["observation_before"]),
+                        action=Zero960Action.model_validate(turn["action"]),
+                        metadata={
+                            "episode_index": episode_index,
+                            "turn_index": turn["turn_index"],
+                            "teacher_model": model,
+                            "final_reward": final_reward,
+                        },
+                    )
+                    _append_jsonl(sft_path, sample)
+            print(
+                {
+                    "episode": episode_index,
+                    "final_reward": final_reward,
+                    "accepted_for_sft": accepted,
+                    "turns": len(episode_turns),
+                    "final_status": observation.status_message,
+                }
+            )
+        if stop_reason is not None:
+            print({"stopped_early": True, "reason": stop_reason})
+    return trace_path, sft_path
+def main() -> None:
+    parser = argparse.ArgumentParser(description="Collect Codex teacher rollouts for 0x960.")
+    parser.add_argument("--base-url", default="http://127.0.0.1:8000")
+    parser.add_argument("--model", default="gpt-5.4")
+    parser.add_argument("--episodes", type=int, default=20)
+    parser.add_argument("--max-turns", type=int, default=6)
+    parser.add_argument("--timeout-s", type=int, default=180)
+    parser.add_argument("--min-reward", type=float, default=0.4)
+    parser.add_argument("--codex-bin", default=None)
+    parser.add_argument("--output-dir", default="outputs/codex_distill")
+    args = parser.parse_args()
+    trace_path, sft_path = collect_teacher_rollouts(
+        base_url=args.base_url,
+        model=args.model,
+        episodes=args.episodes,
+        max_turns=args.max_turns,
+        timeout_s=args.timeout_s,
+        output_dir=Path(args.output_dir),
+        min_reward=args.min_reward,
+        codex_bin=args.codex_bin,
+    )
+    print({"trace_path": str(trace_path), "sft_path": str(sft_path)})
+if __name__ == "__main__":
+    main()

train/codex_swarm.py ADDED Viewed

	@@ -0,0 +1,1114 @@

+"""Local Codex swarm coordinator for champion/challenger engine iteration."""
+from __future__ import annotations
+import argparse
+import difflib
+import json
+import shutil
+import subprocess
+import time
+from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, TimeoutError, as_completed
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from pathlib import Path
+from train.benchmark_engine import benchmark_engine_roots
+from train.benchmark_eval import BenchmarkResult, benchmark_eval_files
+DEFAULT_MODEL = "gpt-5.3-codex"
+DEFAULT_WORKER_COUNT = 5
+DEFAULT_SCREEN_POSITIONS = 8
+DEFAULT_POSITIONS = 32
+DEFAULT_DEPTH = 2
+DEFAULT_MAX_PLIES = 120
+DEFAULT_SEARCH_SCREEN_POSITIONS = 1
+DEFAULT_SEARCH_SCREEN_DEPTH = 1
+DEFAULT_SEARCH_SCREEN_MAX_PLIES = 20
+DEFAULT_FINAL_BENCHMARK_TIMEOUT_SEC = 180
+DEFAULT_SCREEN_MIN_SCORE = 0.52
+DEFAULT_MIN_SCORE = 0.53
+DEFAULT_MAX_DIFF_LINES = 80
+DEFAULT_EDITABLE_FILES = ("src/zero960/workspace_template/eval.py",)
+DEFAULT_WORKER_TIMEOUT_SEC = 600
+DEFAULT_BENCHMARK_TIMEOUT_SEC = 180
+DEFAULT_REFERENCE_PATHS = (
+    "AGENTS.md",
+    "README.md",
+    "docs/codex-swarm-plan.md",
+    "docs/process.md",
+    "train/benchmark_eval.py",
+    "train/benchmark_engine.py",
+    "train/benchmark_league.py",
+    "train/benchmark_uci.py",
+    "src/zero960/engine/default_eval.py",
+    "src/zero960/engine/search.py",
+)
+DEFAULT_SYNC_PATHS = (
+    *DEFAULT_REFERENCE_PATHS,
+    "src/zero960/workspace_template/eval.py",
+)
+DEFAULT_SURFACE = "eval"
+DEFAULT_EVAL_EDITABLE_FILES = ("src/zero960/workspace_template/eval.py",)
+DEFAULT_SEARCH_EDITABLE_FILES = ("src/zero960/engine/search.py",)
+DEFAULT_WORKER_SPECIALIZATIONS = (
+    (
+        "Structure Researcher",
+        "Study Chess960-specific king safety, castling structure, and pawn cover weaknesses around both kings.",
+        "_structure_hook",
+    ),
+    (
+        "Pawn-Endgame Researcher",
+        "Study pawn structure, passed-pawn pressure, rook-file coordination, and simple endgame conversion bonuses.",
+        "_pawn_endgame_hook",
+    ),
+    (
+        "Initiative Tuner",
+        "Study tempo, mobility pressure, queen safety, and initiative terms that might convert shallow-search advantages faster.",
+        "_initiative_hook",
+    ),
+    (
+        "Activity Researcher",
+        "Study piece activity, development, space, and centralization terms that help when search depth is limited.",
+        "_activity_hook",
+    ),
+    (
+        "Tactical Safety Researcher",
+        "Study loose-piece pressure, attacked-undefended pieces, and tactical safety terms that matter at shallow search depth.",
+        "_tactical_hook",
+    ),
+)
+DEFAULT_SEARCH_SPECIALIZATIONS = (
+    (
+        "Move Ordering Researcher",
+        "Study move ordering, capture ordering, and cheap heuristics that push strong moves to the front early.",
+        "_move_order_score",
+    ),
+    (
+        "Quiescence Researcher",
+        "Study tactical horizon control and leaf evaluation extension without exploding the tree.",
+        "_quiescence",
+    ),
+    (
+        "Tree Search Researcher",
+        "Study alpha-beta search control, pruning safety, and transposition-table usage in the main negamax loop.",
+        "negamax",
+    ),
+    (
+        "Root Policy Researcher",
+        "Study root move selection, aspiration behavior, and tie-breaking that helps shallow search convert edges.",
+        "select_move",
+    ),
+    (
+        "Tactical Move Filter Researcher",
+        "Study which tactical moves should survive into the quiescence frontier without causing pointless explosion.",
+        "_tactical_moves",
+    ),
+)
+@dataclass(slots=True)
+class SwarmPaths:
+    repo_root: Path
+    state_root: Path
+    worktree_root: Path
+    champion_eval: Path
+    champion_search: Path
+    ledger_path: Path
+@dataclass(slots=True)
+class WorkerResult:
+    worker_name: str
+    worktree_dir: Path
+    round_dir: Path
+    prompt_path: Path
+    final_message_path: Path
+    stdout_path: Path
+    stderr_path: Path
+    candidate_file: Path
+    changed_files: list[str]
+    diff_lines_added: int
+    diff_lines_deleted: int
+    screen_benchmark: BenchmarkResult | None
+    benchmark: BenchmarkResult | None
+    exit_code: int | None
+    accepted: bool
+    summary: str
+    sandbox_mode: str
+    def to_json(self) -> dict[str, object]:
+        return {
+            "worker_name": self.worker_name,
+            "worktree_dir": str(self.worktree_dir),
+            "round_dir": str(self.round_dir),
+            "prompt_path": str(self.prompt_path),
+            "final_message_path": str(self.final_message_path),
+            "stdout_path": str(self.stdout_path),
+            "stderr_path": str(self.stderr_path),
+            "candidate_file": str(self.candidate_file),
+            "changed_files": self.changed_files,
+            "diff_lines_added": self.diff_lines_added,
+            "diff_lines_deleted": self.diff_lines_deleted,
+            "screen_benchmark": None if self.screen_benchmark is None else self.screen_benchmark.to_json(),
+            "benchmark": None if self.benchmark is None else self.benchmark.to_json(),
+            "exit_code": self.exit_code,
+            "accepted": self.accepted,
+            "summary": self.summary,
+            "sandbox_mode": self.sandbox_mode,
+        }
+def _repo_root() -> Path:
+    return Path(__file__).resolve().parents[1]
+def _default_paths() -> SwarmPaths:
+    root = _repo_root()
+    state_root = root / "outputs" / "codex_swarm"
+    return SwarmPaths(
+        repo_root=root,
+        state_root=state_root,
+        worktree_root=Path("/tmp") / "0x960-codex-swarm",
+        champion_eval=state_root / "champion_eval.py",
+        champion_search=state_root / "champion_search.py",
+        ledger_path=state_root / "ledger.jsonl",
+    )
+def _run(
+    args: list[str],
+    *,
+    cwd: Path,
+    capture_output: bool = True,
+    check: bool = True,
+    input_text: str | None = None,
+) -> subprocess.CompletedProcess[str]:
+    return subprocess.run(
+        args,
+        cwd=cwd,
+        input=input_text,
+        text=True,
+        capture_output=capture_output,
+        check=check,
+    )
+def _git_output(repo_root: Path, args: list[str]) -> str:
+    result = _run(["git", *args], cwd=repo_root)
+    return result.stdout.strip()
+def _ensure_state_dirs(paths: SwarmPaths) -> None:
+    paths.state_root.mkdir(parents=True, exist_ok=True)
+    (paths.state_root / "runs").mkdir(parents=True, exist_ok=True)
+    (paths.state_root / "accepted").mkdir(parents=True, exist_ok=True)
+def _copy_file(src: Path, dst: Path) -> None:
+    dst.parent.mkdir(parents=True, exist_ok=True)
+    shutil.copy2(src, dst)
+def _copy_tree(src: Path, dst: Path) -> None:
+    if dst.exists():
+        shutil.rmtree(dst, ignore_errors=True)
+    shutil.copytree(src, dst)
+def _prepare_worker_dir(worker_dir: Path) -> Path:
+    if worker_dir.exists() and not any(worker_dir.iterdir()):
+        worker_dir.rmdir()
+    return worker_dir
+def _infer_repo_mode(worker_dir: Path) -> str:
+    git_path = worker_dir / ".git"
+    if git_path.is_file():
+        return "worktree"
+    if git_path.is_dir():
+        return "clone"
+    return "unknown"
+def _sync_worker_snapshot(paths: SwarmPaths, worker_dir: Path, sync_paths: tuple[str, ...]) -> None:
+    for rel_path in sync_paths:
+        src = paths.repo_root / rel_path
+        dst = worker_dir / rel_path
+        if src.is_file():
+            _copy_file(src, dst)
+    _copy_file(paths.champion_eval, worker_dir / "src" / "zero960" / "workspace_template" / "eval.py")
+    _copy_file(paths.champion_search, worker_dir / "src" / "zero960" / "engine" / "search.py")
+    accepted_src = paths.state_root / "accepted"
+    accepted_dst = worker_dir / "outputs" / "codex_swarm" / "accepted"
+    if accepted_src.exists():
+        _copy_tree(accepted_src, accepted_dst)
+    else:
+        accepted_dst.mkdir(parents=True, exist_ok=True)
+    _copy_file(paths.champion_eval, worker_dir / "outputs" / "codex_swarm" / "champion_eval.py")
+    _copy_file(paths.champion_search, worker_dir / "outputs" / "codex_swarm" / "champion_search.py")
+    ledger_copy = worker_dir / "outputs" / "codex_swarm" / "ledger.jsonl"
+    ledger_copy.parent.mkdir(parents=True, exist_ok=True)
+    if paths.ledger_path.exists():
+        shutil.copy2(paths.ledger_path, ledger_copy)
+    else:
+        ledger_copy.write_text("", encoding="utf-8")
+def _setup_workers(paths: SwarmPaths, worker_count: int, sync_paths: tuple[str, ...]) -> list[tuple[Path, str]]:
+    worker_dirs: list[tuple[Path, str]] = []
+    paths.worktree_root.mkdir(parents=True, exist_ok=True)
+    for worker_index in range(1, worker_count + 1):
+        worker_dir = paths.worktree_root / f"worker-{worker_index}"
+        sandbox_mode = "existing"
+        if not (worker_dir / ".git").exists():
+            worker_dir = _prepare_worker_dir(worker_dir)
+            try:
+                _run(
+                    ["git", "worktree", "add", "--detach", str(worker_dir), "HEAD"],
+                    cwd=paths.repo_root,
+                )
+                sandbox_mode = "worktree"
+            except subprocess.CalledProcessError:
+                worker_dir = _prepare_worker_dir(worker_dir)
+                _run(
+                    ["git", "clone", "--shared", str(paths.repo_root), str(worker_dir)],
+                    cwd=paths.repo_root,
+                )
+                sandbox_mode = "clone"
+        else:
+            sandbox_mode = _infer_repo_mode(worker_dir)
+        _sync_worker_snapshot(paths, worker_dir, sync_paths)
+        worker_dirs.append((worker_dir, sandbox_mode))
+    return worker_dirs
+def _last_ledger_entries(paths: SwarmPaths, limit: int = 5) -> list[dict[str, object]]:
+    if not paths.ledger_path.exists():
+        return []
+    lines = [line for line in paths.ledger_path.read_text(encoding="utf-8").splitlines() if line.strip()]
+    entries = [json.loads(line) for line in lines[-limit:]]
+    return entries
+def _extract_hook_body(champion_text: str, hook_name: str) -> str:
+    marker = f"def {hook_name}("
+    start = champion_text.find(marker)
+    if start == -1:
+        return ""
+    next_def = champion_text.find("\ndef ", start + len(marker))
+    if next_def == -1:
+        next_def = len(champion_text)
+    return champion_text[start:next_def]
+def _hook_state_rank(champion_text: str, hook_name: str) -> int:
+    body = _extract_hook_body(champion_text, hook_name)
+    if not body:
+        return 99
+    terminal_lines = [line.strip() for line in body.splitlines() if line.strip()]
+    terminal_return = terminal_lines[-1] if terminal_lines else ""
+    if terminal_return == "return 0":
+        return 0
+    if terminal_return.startswith("return _base_"):
+        return 1
+    return 2
+def _ordered_specializations(paths: SwarmPaths, surface: str) -> list[tuple[str, str, str]]:
+    if surface == "search":
+        return list(DEFAULT_SEARCH_SPECIALIZATIONS)
+    champion_text = paths.champion_eval.read_text(encoding="utf-8") if paths.champion_eval.exists() else ""
+    return sorted(
+        DEFAULT_WORKER_SPECIALIZATIONS,
+        key=lambda spec: (_hook_state_rank(champion_text, spec[2]), DEFAULT_WORKER_SPECIALIZATIONS.index(spec)),
+    )
+def _build_worker_prompt(
+    *,
+    worker_name: str,
+    worker_role: str,
+    worker_lane: str,
+    target_hook: str,
+    target_file: str,
+    recent_entries: list[dict[str, object]],
+) -> str:
+    history_lines: list[str] = []
+    for entry in recent_entries:
+        if not entry.get("accepted"):
+            continue
+        benchmark = entry.get("benchmark") or {}
+        history_lines.append(
+            f"- {entry['worker_name']}: score={benchmark.get('score')} "
+            f"elo={benchmark.get('elo_delta_estimate')} summary={entry.get('summary')}"
+        )
+    recent_history = "\n".join(history_lines) if history_lines else "- no accepted candidates yet"
+    return f"""Improve the current Chess960 champion in `{target_file}`.
+Lane:
+- {worker_lane}
+Target hook:
+- edit only `{target_hook}` and keep the rest of the file unchanged
+- if you need helper values, define them directly inside `{target_hook}` or make the smallest possible local constant change
+Before editing, inspect:
+- `{target_file}`
+- `outputs/codex_swarm/champion_eval.py`
+- `outputs/codex_swarm/champion_search.py`
+- `outputs/codex_swarm/ledger.jsonl`
+- `outputs/codex_swarm/accepted/`
+Requirements:
+- edit only `{target_file}`
+- make one small surgical patch inside `{target_hook}`
+- avoid duplicating prior accepted winners
+- do not run held-out benchmarks; the coordinator does that
+- finish quickly with a short summary of the patch and why it should help
+Recent accepted candidates:
+{recent_history}
+"""
+def _build_worker_agents_override(
+    *,
+    worker_name: str,
+    worker_role: str,
+    worker_lane: str,
+    target_hook: str,
+    target_file: str,
+    max_diff_lines: int,
+) -> str:
+    return f"""# Codex swarm worker override
+You are {worker_name}, the {worker_role}, in the 0x960 Codex swarm.
+Primary lane:
+- {worker_lane}
+Hard requirements:
+- Edit only `{target_file}`.
+- Touch only the `{target_hook}` function body unless a tiny adjacent constant change is absolutely necessary.
+- Use `apply_patch` or similarly surgical edits. Do not rewrite the whole file.
+- Keep the final diff within about {max_diff_lines} changed lines total.
+- If your current diff exceeds that budget, revert the excess and reduce the patch before finishing.
+- Run at most one small local probe. Do not run held-out benchmarks yourself; the coordinator handles them.
+- Do not browse the web or use internet-dependent tools for this task.
+- Stop immediately after the patch and one tiny sanity check. Do not spend time on extra diffs, `rg`, or `sed` inspections once the patch is in place.
+- Write a concise summary of the change and why it should help, then exit.
+"""
+def _surface_config(surface: str) -> tuple[tuple[str, ...], str]:
+    if surface == "search":
+        return DEFAULT_SEARCH_EDITABLE_FILES, DEFAULT_SEARCH_EDITABLE_FILES[0]
+    return DEFAULT_EVAL_EDITABLE_FILES, DEFAULT_EVAL_EDITABLE_FILES[0]
+def _screen_settings(args: argparse.Namespace) -> tuple[int, int, int]:
+    if args.surface == "search":
+        return args.search_screen_positions, args.search_screen_depth, args.search_screen_max_plies
+    return args.screen_positions, args.depth, args.max_plies
+def _baseline_snapshot_root(paths: SwarmPaths, round_dir: Path) -> Path:
+    baseline_root = round_dir / "baseline_root"
+    _copy_file(
+        paths.champion_eval,
+        baseline_root / "src" / "zero960" / "workspace_template" / "eval.py",
+    )
+    _copy_file(
+        paths.champion_search,
+        baseline_root / "src" / "zero960" / "engine" / "search.py",
+    )
+    return baseline_root
+def _snapshot_files(worker_dir: Path, rel_paths: tuple[str, ...]) -> dict[str, str]:
+    snapshot: dict[str, str] = {}
+    for rel_path in rel_paths:
+        file_path = worker_dir / rel_path
+        if file_path.exists():
+            snapshot[rel_path] = file_path.read_text(encoding="utf-8")
+    return snapshot
+def _changed_snapshot_paths(before: dict[str, str], worker_dir: Path, rel_paths: tuple[str, ...]) -> list[str]:
+    changed: list[str] = []
+    for rel_path in rel_paths:
+        file_path = worker_dir / rel_path
+        after = file_path.read_text(encoding="utf-8") if file_path.exists() else ""
+        if before.get(rel_path, "") != after:
+            changed.append(rel_path)
+    return changed
+def _snapshot_diff_line_counts(
+    before: dict[str, str],
+    worker_dir: Path,
+    rel_paths: tuple[str, ...],
+) -> tuple[int, int]:
+    added = 0
+    deleted = 0
+    for rel_path in rel_paths:
+        before_text = before.get(rel_path, "")
+        file_path = worker_dir / rel_path
+        after_text = file_path.read_text(encoding="utf-8") if file_path.exists() else ""
+        for line in difflib.unified_diff(
+            before_text.splitlines(),
+            after_text.splitlines(),
+            fromfile=rel_path,
+            tofile=rel_path,
+            lineterm="",
+        ):
+            if line.startswith(("---", "+++", "@@")):
+                continue
+            if line.startswith("+"):
+                added += 1
+            elif line.startswith("-"):
+                deleted += 1
+    return added, deleted
+def _write_json(path: Path, payload: dict[str, object]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    path.write_text(json.dumps(payload, indent=2, sort_keys=True) + "\n", encoding="utf-8")
+def _append_jsonl(path: Path, payload: dict[str, object]) -> None:
+    path.parent.mkdir(parents=True, exist_ok=True)
+    with path.open("a", encoding="utf-8") as handle:
+        handle.write(json.dumps(payload, sort_keys=True) + "\n")
+def _decode_timeout_output(payload: str | bytes | None) -> str:
+    if payload is None:
+        return ""
+    if isinstance(payload, bytes):
+        return payload.decode("utf-8", errors="replace")
+    return payload
+def _run_worker(
+    *,
+    paths: SwarmPaths,
+    worker_dir: Path,
+    round_dir: Path,
+    worker_name: str,
+    worker_role: str,
+    worker_lane: str,
+    target_hook: str,
+    target_file: str,
+    model: str,
+    editable_files: tuple[str, ...],
+    candidate_file_rel: str,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+    min_score: float,
+    max_diff_lines: int,
+    worker_timeout_sec: int,
+    dry_run: bool,
+    sandbox_mode: str,
+) -> WorkerResult:
+    worker_dir = worker_dir.resolve()
+    candidate_file = worker_dir / candidate_file_rel
+    prompt_path = round_dir / f"{worker_name}_prompt.txt"
+    final_message_path = round_dir / f"{worker_name}_final.txt"
+    stdout_path = round_dir / f"{worker_name}_stdout.log"
+    stderr_path = round_dir / f"{worker_name}_stderr.log"
+    recent_entries = _last_ledger_entries(paths)
+    prompt = _build_worker_prompt(
+        worker_name=worker_name,
+        worker_role=worker_role,
+        worker_lane=worker_lane,
+        target_hook=target_hook,
+        target_file=target_file,
+        recent_entries=recent_entries,
+    )
+    prompt_path.write_text(prompt, encoding="utf-8")
+    (worker_dir / "AGENTS.override.md").write_text(
+        _build_worker_agents_override(
+            worker_name=worker_name,
+            worker_role=worker_role,
+            worker_lane=worker_lane,
+            target_hook=target_hook,
+            target_file=target_file,
+            max_diff_lines=max_diff_lines,
+        ),
+        encoding="utf-8",
+    )
+    before_snapshot = _snapshot_files(worker_dir, editable_files)
+    if dry_run:
+        stdout_path.write_text("dry-run\n", encoding="utf-8")
+        stderr_path.write_text("", encoding="utf-8")
+        final_message_path.write_text("dry-run\n", encoding="utf-8")
+        changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
+        return WorkerResult(
+            worker_name=worker_name,
+            worktree_dir=worker_dir,
+            round_dir=round_dir,
+            prompt_path=prompt_path,
+            final_message_path=final_message_path,
+            stdout_path=stdout_path,
+            stderr_path=stderr_path,
+            candidate_file=candidate_file,
+            changed_files=changed_files,
+            diff_lines_added=0,
+            diff_lines_deleted=0,
+            screen_benchmark=None,
+            benchmark=None,
+            exit_code=0,
+            accepted=False,
+            summary="dry-run",
+            sandbox_mode=sandbox_mode,
+        )
+    try:
+        completed = subprocess.run(
+            [
+                "codex",
+                "exec",
+                "-m",
+                model,
+                "--full-auto",
+                "--ephemeral",
+                "--json",
+                "-c",
+                'web_search="disabled"',
+                "--color",
+                "never",
+                "--output-last-message",
+                str(final_message_path),
+                "-",
+            ],
+            cwd=worker_dir,
+            input=prompt,
+            text=True,
+            capture_output=True,
+            check=False,
+            timeout=worker_timeout_sec,
+        )
+        stdout_path.write_text(completed.stdout, encoding="utf-8")
+        stderr_path.write_text(completed.stderr, encoding="utf-8")
+    except subprocess.TimeoutExpired as exc:
+        stdout_text = _decode_timeout_output(exc.stdout)
+        stderr_text = _decode_timeout_output(exc.stderr)
+        stdout_path.write_text(stdout_text, encoding="utf-8")
+        stderr_path.write_text(stderr_text + f"\nTimed out after {worker_timeout_sec} seconds.\n", encoding="utf-8")
+        if not final_message_path.exists():
+            final_message_path.write_text("", encoding="utf-8")
+        changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
+        diff_lines_added, diff_lines_deleted = _snapshot_diff_line_counts(before_snapshot, worker_dir, editable_files)
+        return WorkerResult(
+            worker_name=worker_name,
+            worktree_dir=worker_dir,
+            round_dir=round_dir,
+            prompt_path=prompt_path,
+            final_message_path=final_message_path,
+            stdout_path=stdout_path,
+            stderr_path=stderr_path,
+            candidate_file=candidate_file,
+            changed_files=changed_files,
+            diff_lines_added=diff_lines_added,
+            diff_lines_deleted=diff_lines_deleted,
+            screen_benchmark=None,
+            benchmark=None,
+            exit_code=None,
+            accepted=False,
+            summary=f"timed out after {worker_timeout_sec}s",
+            sandbox_mode=sandbox_mode,
+        )
+    if not final_message_path.exists():
+        final_message_path.write_text("", encoding="utf-8")
+    changed_files = _changed_snapshot_paths(before_snapshot, worker_dir, editable_files)
+    diff_lines_added, diff_lines_deleted = _snapshot_diff_line_counts(before_snapshot, worker_dir, editable_files)
+    summary = final_message_path.read_text(encoding="utf-8").strip()
+    if completed.returncode != 0:
+        summary = summary or "codex exec failed"
+    return WorkerResult(
+        worker_name=worker_name,
+        worktree_dir=worker_dir,
+        round_dir=round_dir,
+        prompt_path=prompt_path,
+        final_message_path=final_message_path,
+        stdout_path=stdout_path,
+        stderr_path=stderr_path,
+        candidate_file=candidate_file,
+        changed_files=changed_files,
+        diff_lines_added=diff_lines_added,
+        diff_lines_deleted=diff_lines_deleted,
+        screen_benchmark=None,
+        benchmark=None,
+        exit_code=completed.returncode,
+        accepted=False,
+        summary=summary,
+        sandbox_mode=sandbox_mode,
+    )
+def _eligible_for_screen(result: WorkerResult, max_diff_lines: int) -> bool:
+    if result.exit_code not in (0, None):
+        return False
+    if not result.candidate_file.exists():
+        return False
+    if not result.changed_files:
+        return False
+    return (result.diff_lines_added + result.diff_lines_deleted) <= max_diff_lines
+def _candidate_compiles(candidate_file: Path) -> bool:
+    completed = subprocess.run(
+        ["python3", "-m", "py_compile", str(candidate_file)],
+        cwd=_repo_root(),
+        capture_output=True,
+        text=True,
+        check=False,
+    )
+    return completed.returncode == 0
+def _benchmark_eval_task(
+    candidate_file: str,
+    baseline_file: str,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+) -> BenchmarkResult:
+    return benchmark_eval_files(
+        Path(candidate_file).resolve(),
+        Path(baseline_file).resolve(),
+        positions=positions,
+        depth=depth,
+        max_plies=max_plies,
+        seed=seed,
+    )
+def _benchmark_engine_task(
+    candidate_root: str,
+    baseline_root: str,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+) -> BenchmarkResult:
+    return benchmark_engine_roots(
+        Path(candidate_root).resolve(),
+        Path(baseline_root).resolve(),
+        positions=positions,
+        depth=depth,
+        max_plies=max_plies,
+        seed=seed,
+    )
+def _run_benchmark_with_timeout(
+    *,
+    surface: str,
+    candidate_path: Path,
+    baseline_path: Path,
+    positions: int,
+    depth: int,
+    max_plies: int,
+    seed: int,
+    timeout_sec: int,
+) -> BenchmarkResult | None:
+    task = _benchmark_engine_task if surface == "search" else _benchmark_eval_task
+    with ProcessPoolExecutor(max_workers=1) as executor:
+        future = executor.submit(
+            task,
+            str(candidate_path),
+            str(baseline_path),
+            positions,
+            depth,
+            max_plies,
+            seed,
+        )
+        try:
+            return future.result(timeout=timeout_sec)
+        except TimeoutError:
+            future.cancel()
+            return None
+def _best_screened(results: list[WorkerResult], screen_min_score: float, surface: str) -> WorkerResult | None:
+    comparator = (
+        (lambda score: score >= screen_min_score)
+        if surface == "search"
+        else (lambda score: score > screen_min_score)
+    )
+    screened = [
+        result
+        for result in results
+        if result.screen_benchmark is not None and comparator(result.screen_benchmark.score)
+    ]
+    if not screened:
+        return None
+    return max(screened, key=lambda result: result.screen_benchmark.score)
+def _promote_winner(paths: SwarmPaths, winner: WorkerResult, promote_source: bool) -> None:
+    accepted_dir = paths.state_root / "accepted"
+    timestamp = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
+    if "src/zero960/workspace_template/eval.py" in winner.changed_files:
+        _copy_file(winner.worktree_dir / "src/zero960/workspace_template/eval.py", paths.champion_eval)
+        _copy_file(
+            winner.worktree_dir / "src/zero960/workspace_template/eval.py",
+            accepted_dir / f"{timestamp}_{winner.worker_name}_eval.py",
+        )
+    if "src/zero960/engine/search.py" in winner.changed_files:
+        _copy_file(winner.worktree_dir / "src/zero960/engine/search.py", paths.champion_search)
+        _copy_file(
+            winner.worktree_dir / "src/zero960/engine/search.py",
+            accepted_dir / f"{timestamp}_{winner.worker_name}_search.py",
+        )
+    if promote_source and "src/zero960/workspace_template/eval.py" in winner.changed_files:
+        _copy_file(winner.candidate_file, paths.repo_root / "src/zero960/workspace_template/eval.py")
+        _copy_file(winner.candidate_file, paths.repo_root / "src/zero960/engine/default_eval.py")
+    if promote_source and "src/zero960/engine/search.py" in winner.changed_files:
+        _copy_file(winner.worktree_dir / "src/zero960/engine/search.py", paths.repo_root / "src/zero960/engine/search.py")
+def _state_summary(paths: SwarmPaths) -> str:
+    entries = [
+        entry
+        for entry in _last_ledger_entries(paths, limit=20)
+        if (entry.get("benchmark") or {}).get("score") is not None
+    ]
+    if not paths.champion_eval.exists():
+        return "no champion yet"
+    if not entries:
+        return f"champion={paths.champion_eval}"
+    last = entries[-1]
+    benchmark = last.get("benchmark") or {}
+    return (
+        f"champion={paths.champion_eval} "
+        f"last_worker={last.get('worker_name')} "
+        f"score={benchmark.get('score')} "
+        f"elo={benchmark.get('elo_delta_estimate')}"
+    )
+def parse_args() -> argparse.Namespace:
+    paths = _default_paths()
+    parser = argparse.ArgumentParser(description=__doc__)
+    subparsers = parser.add_subparsers(dest="command", required=True)
+    setup = subparsers.add_parser("setup", help="Create local worker worktrees and initialize the champion.")
+    setup.add_argument("--workers", type=int, default=DEFAULT_WORKER_COUNT)
+    setup.add_argument("--worktree-root", default=str(paths.worktree_root))
+    setup.add_argument("--reset-champion", action="store_true")
+    run = subparsers.add_parser("run", help="Run one or more champion/challenger rounds.")
+    run.add_argument("--workers", type=int, default=DEFAULT_WORKER_COUNT)
+    run.add_argument("--rounds", type=int, default=1)
+    run.add_argument("--model", default=DEFAULT_MODEL)
+    run.add_argument("--surface", choices=("eval", "search"), default=DEFAULT_SURFACE)
+    run.add_argument("--worktree-root", default=str(paths.worktree_root))
+    run.add_argument("--screen-positions", type=int, default=DEFAULT_SCREEN_POSITIONS)
+    run.add_argument("--positions", type=int, default=DEFAULT_POSITIONS)
+    run.add_argument("--depth", type=int, default=DEFAULT_DEPTH)
+    run.add_argument("--max-plies", type=int, default=DEFAULT_MAX_PLIES)
+    run.add_argument("--search-screen-positions", type=int, default=DEFAULT_SEARCH_SCREEN_POSITIONS)
+    run.add_argument("--search-screen-depth", type=int, default=DEFAULT_SEARCH_SCREEN_DEPTH)
+    run.add_argument("--search-screen-max-plies", type=int, default=DEFAULT_SEARCH_SCREEN_MAX_PLIES)
+    run.add_argument("--seed", type=int, default=42)
+    run.add_argument("--screen-min-score", type=float, default=DEFAULT_SCREEN_MIN_SCORE)
+    run.add_argument("--min-score", type=float, default=DEFAULT_MIN_SCORE)
+    run.add_argument("--max-diff-lines", type=int, default=DEFAULT_MAX_DIFF_LINES)
+    run.add_argument("--worker-timeout-sec", type=int, default=DEFAULT_WORKER_TIMEOUT_SEC)
+    run.add_argument("--benchmark-timeout-sec", type=int, default=DEFAULT_BENCHMARK_TIMEOUT_SEC)
+    run.add_argument("--final-benchmark-timeout-sec", type=int, default=DEFAULT_FINAL_BENCHMARK_TIMEOUT_SEC)
+    run.add_argument("--dry-run", action="store_true")
+    run.add_argument("--serial", action="store_true", help="Run workers sequentially instead of in parallel.")
+    run.add_argument("--promote-source", action="store_true")
+    run.add_argument("--continuous", action="store_true", help="Keep running rounds until interrupted or stalled.")
+    run.add_argument(
+        "--max-stall-rounds",
+        type=int,
+        default=3,
+        help="Stop continuous mode after this many consecutive non-promotion rounds. Use 0 to disable.",
+    )
+    run.add_argument("--sleep-sec", type=float, default=0.0, help="Sleep between continuous rounds.")
+    status = subparsers.add_parser("status", help="Print the current champion summary and recent results.")
+    promote = subparsers.add_parser("promote", help="Copy the current swarm champion into the source tree.")
+    promote.add_argument("--source-only", action="store_true", help="Skip copying to default_eval.py.")
+    return parser.parse_args()
+def _resolve_paths(args: argparse.Namespace) -> SwarmPaths:
+    paths = _default_paths()
+    if hasattr(args, "worktree_root"):
+        paths.worktree_root = Path(args.worktree_root).resolve()
+    return paths
+def _setup_command(args: argparse.Namespace) -> int:
+    paths = _resolve_paths(args)
+    _ensure_state_dirs(paths)
+    if args.reset_champion or not paths.champion_eval.exists():
+        _copy_file(paths.repo_root / "src/zero960/workspace_template/eval.py", paths.champion_eval)
+    if args.reset_champion or not paths.champion_search.exists():
+        _copy_file(paths.repo_root / "src/zero960/engine/search.py", paths.champion_search)
+    worker_dirs = _setup_workers(paths, args.workers, DEFAULT_SYNC_PATHS)
+    print(f"initialized champion: {paths.champion_eval}")
+    for worker_dir, sandbox_mode in worker_dirs:
+        print(f"worker: {worker_dir} mode={sandbox_mode}")
+    return 0
+def _run_command(args: argparse.Namespace) -> int:
+    paths = _resolve_paths(args)
+    _ensure_state_dirs(paths)
+    if not paths.champion_eval.exists():
+        _copy_file(paths.repo_root / "src/zero960/workspace_template/eval.py", paths.champion_eval)
+    if not paths.champion_search.exists():
+        _copy_file(paths.repo_root / "src/zero960/engine/search.py", paths.champion_search)
+    worker_dirs = _setup_workers(paths, args.workers, DEFAULT_SYNC_PATHS)
+    editable_files, candidate_file_rel = _surface_config(args.surface)
+    round_index = 0
+    stall_rounds = 0
+    target_rounds = None if args.continuous else args.rounds
+    while target_rounds is None or round_index < target_rounds:
+        round_index += 1
+        round_seed = args.seed + round_index - 1
+        round_timestamp = datetime.now(UTC).strftime("%Y%m%dT%H%M%SZ")
+        round_dir = paths.state_root / "runs" / f"round_{round_timestamp}_{round_index}"
+        round_dir.mkdir(parents=True, exist_ok=True)
+        baseline_root = _baseline_snapshot_root(paths, round_dir)
+        round_specializations = _ordered_specializations(paths, args.surface)
+        screen_positions, screen_depth, screen_max_plies = _screen_settings(args)
+        print(f"round {round_index}: champion frozen at {paths.champion_eval}")
+        print(f"round {round_index}: surface={args.surface}")
+        print(
+            "round hooks: "
+            + ", ".join(spec[2] for spec in round_specializations[: len(worker_dirs)])
+        )
+        jobs = []
+        if args.serial or args.dry_run:
+            for worker_index, (worker_dir, sandbox_mode) in enumerate(worker_dirs, start=1):
+                _sync_worker_snapshot(paths, worker_dir, DEFAULT_SYNC_PATHS)
+                worker_role, worker_lane, target_hook = round_specializations[(worker_index - 1) % len(round_specializations)]
+                result = _run_worker(
+                    paths=paths,
+                    worker_dir=worker_dir,
+                    round_dir=round_dir,
+                    worker_name=f"worker-{worker_index}",
+                    worker_role=worker_role,
+                    worker_lane=worker_lane,
+                    target_hook=target_hook,
+                    target_file=candidate_file_rel,
+                    model=args.model,
+                    editable_files=editable_files,
+                    candidate_file_rel=candidate_file_rel,
+                    positions=args.positions,
+                    depth=args.depth,
+                    max_plies=args.max_plies,
+                    seed=round_seed,
+                    min_score=args.min_score,
+                    max_diff_lines=args.max_diff_lines,
+                    worker_timeout_sec=args.worker_timeout_sec,
+                    dry_run=args.dry_run,
+                    sandbox_mode=sandbox_mode,
+                )
+                jobs.append(result)
+        else:
+            with ThreadPoolExecutor(max_workers=len(worker_dirs)) as executor:
+                futures = []
+                for worker_index, (worker_dir, sandbox_mode) in enumerate(worker_dirs, start=1):
+                    _sync_worker_snapshot(paths, worker_dir, DEFAULT_SYNC_PATHS)
+                    worker_role, worker_lane, target_hook = round_specializations[(worker_index - 1) % len(round_specializations)]
+                    futures.append(
+                        executor.submit(
+                            _run_worker,
+                            paths=paths,
+                            worker_dir=worker_dir,
+                            round_dir=round_dir,
+                            worker_name=f"worker-{worker_index}",
+                            worker_role=worker_role,
+                            worker_lane=worker_lane,
+                            target_hook=target_hook,
+                            target_file=candidate_file_rel,
+                            model=args.model,
+                            editable_files=editable_files,
+                            candidate_file_rel=candidate_file_rel,
+                            positions=args.positions,
+                            depth=args.depth,
+                            max_plies=args.max_plies,
+                            seed=round_seed,
+                            min_score=args.min_score,
+                            max_diff_lines=args.max_diff_lines,
+                            worker_timeout_sec=args.worker_timeout_sec,
+                            dry_run=args.dry_run,
+                            sandbox_mode=sandbox_mode,
+                        )
+                    )
+                for future in as_completed(futures):
+                    jobs.append(future.result())
+        jobs.sort(key=lambda result: result.worker_name)
+        for result in jobs:
+            diff_total = result.diff_lines_added + result.diff_lines_deleted
+            if result.exit_code not in (0, None):
+                continue
+            if not result.candidate_file.exists():
+                continue
+            if not result.changed_files:
+                rejection = "rejected before benchmark: no file changes"
+                result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
+                continue
+            if diff_total > args.max_diff_lines:
+                overflow = diff_total - args.max_diff_lines
+                rejection = f"rejected before benchmark: diff budget exceeded by {overflow} lines"
+                result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
+                continue
+            if not _candidate_compiles(result.candidate_file):
+                rejection = "rejected before benchmark: candidate failed py_compile"
+                result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
+                continue
+            screen_candidate = result.worktree_dir.resolve() if args.surface == "search" else result.candidate_file.resolve()
+            screen_baseline = (
+                baseline_root.resolve()
+                if args.surface == "search"
+                else (baseline_root / "src" / "zero960" / "workspace_template" / "eval.py").resolve()
+            )
+            result.screen_benchmark = _run_benchmark_with_timeout(
+                surface=args.surface,
+                candidate_path=screen_candidate,
+                baseline_path=screen_baseline,
+                positions=screen_positions,
+                depth=screen_depth,
+                max_plies=screen_max_plies,
+                seed=round_seed,
+                timeout_sec=args.benchmark_timeout_sec,
+            )
+            if result.screen_benchmark is None:
+                rejection = f"rejected during screen benchmark: timed out after {args.benchmark_timeout_sec}s"
+                result.summary = f"{result.summary}\n{rejection}".strip() if result.summary else rejection
+        winner = _best_screened(jobs, args.screen_min_score, args.surface)
+        if winner is not None:
+            final_candidate = winner.worktree_dir.resolve() if args.surface == "search" else winner.candidate_file.resolve()
+            final_baseline = (
+                baseline_root.resolve()
+                if args.surface == "search"
+                else (baseline_root / "src" / "zero960" / "workspace_template" / "eval.py").resolve()
+            )
+            winner.benchmark = _run_benchmark_with_timeout(
+                surface=args.surface,
+                candidate_path=final_candidate,
+                baseline_path=final_baseline,
+                positions=args.positions,
+                depth=args.depth,
+                max_plies=args.max_plies,
+                seed=round_seed,
+                timeout_sec=args.final_benchmark_timeout_sec,
+            )
+            winner.accepted = winner.benchmark is not None and winner.benchmark.score > args.min_score
+            if winner.benchmark is None:
+                rejection = f"screen winner timed out in final benchmark after {args.final_benchmark_timeout_sec}s"
+                winner.summary = f"{winner.summary}\n{rejection}".strip() if winner.summary else rejection
+                winner = None
+            elif not winner.accepted:
+                rejection = (
+                    f"screen winner failed final benchmark: "
+                    f"{winner.benchmark.score:.3f} <= {args.min_score:.3f}"
+                )
+                winner.summary = f"{winner.summary}\n{rejection}".strip() if winner.summary else rejection
+                winner = None
+        for result in jobs:
+            payload = result.to_json()
+            payload["round_index"] = round_index
+            payload["winner"] = bool(winner and winner.worker_name == result.worker_name)
+            payload["surface"] = args.surface
+            _write_json(round_dir / f"{result.worker_name}_result.json", payload)
+            if not args.dry_run:
+                _append_jsonl(paths.ledger_path, payload)
+            screen_text = "n/a" if result.screen_benchmark is None else f"{result.screen_benchmark.score:.3f}"
+            final_text = "n/a" if result.benchmark is None else f"{result.benchmark.score:.3f}"
+            print(
+                f"{result.worker_name}: exit={result.exit_code} "
+                f"screen={screen_text} final={final_text} accepted={result.accepted} changed={len(result.changed_files)} "
+                f"diff=+{result.diff_lines_added}/-{result.diff_lines_deleted} mode={result.sandbox_mode}"
+            )
+        if winner is None:
+            print(f"round {round_index}: no challenger beat the champion")
+            stall_rounds += 1
+            if args.continuous and args.max_stall_rounds and stall_rounds >= args.max_stall_rounds:
+                print(f"stopping after {stall_rounds} consecutive non-promotion rounds")
+                break
+            if args.continuous and args.sleep_sec > 0:
+                time.sleep(args.sleep_sec)
+            continue
+        _promote_winner(paths, winner, args.promote_source)
+        stall_rounds = 0
+        print(
+            f"round {round_index}: promoted {winner.worker_name} "
+            f"score={winner.benchmark.score:.3f} elo={winner.benchmark.elo_delta_estimate:.1f}"
+        )
+        if args.continuous and args.sleep_sec > 0:
+            time.sleep(args.sleep_sec)
+    print(_state_summary(paths))
+    return 0
+def _status_command() -> int:
+    paths = _default_paths()
+    print(_state_summary(paths))
+    for entry in _last_ledger_entries(paths):
+        benchmark = entry.get("benchmark") or {}
+        if benchmark.get("score") is None:
+            continue
+        print(
+            f"{entry.get('worker_name')}: accepted={entry.get('accepted')} "
+            f"score={benchmark.get('score')} elo={benchmark.get('elo_delta_estimate')}"
+        )
+    return 0
+def _promote_command(args: argparse.Namespace) -> int:
+    paths = _default_paths()
+    if not paths.champion_eval.exists():
+        raise SystemExit("no champion available; run setup or run first")
+    _copy_file(paths.champion_eval, paths.repo_root / "src/zero960/workspace_template/eval.py")
+    if paths.champion_search.exists():
+        _copy_file(paths.champion_search, paths.repo_root / "src/zero960/engine/search.py")
+    if not args.source_only:
+        _copy_file(paths.champion_eval, paths.repo_root / "src/zero960/engine/default_eval.py")
+    print(f"promoted champion from {paths.champion_eval}")
+    return 0
+def main() -> None:
+    args = parse_args()
+    if args.command == "setup":
+        raise SystemExit(_setup_command(args))
+    if args.command == "run":
+        raise SystemExit(_run_command(args))
+    if args.command == "status":
+        raise SystemExit(_status_command())
+    if args.command == "promote":
+        raise SystemExit(_promote_command(args))
+    raise SystemExit(f"unknown command: {args.command}")
+if __name__ == "__main__":
+    main()

train/minimal_trl_openenv.py CHANGED Viewed

@@ -9,6 +9,7 @@ Modes:
 from __future__ import annotations
 import argparse
 import json
 import os
 import re
@@ -19,17 +20,78 @@ from pathlib import Path
 from zero960_env.client import Zero960Client
 from zero960_env.models import Zero960Action
 SYSTEM_PROMPT = (
     "You are a Chess960 evaluation engineer. You can take ONE action per turn.\n"
-    "Actions (respond with valid JSON only, no other text):\n"
     '  {"action_type":"read_file","path":"eval.py"}\n'
-    '  {"action_type":"write_file","path":"eval.py","content":"<new code>"}\n'
     '  {"action_type":"run_static_eval"}\n'
     '  {"action_type":"run_match"}\n'
     '  {"action_type":"finish"}\n'
     "\n"
-    "Goal: improve eval.py so the Chess960 engine beats the baseline.\n"
-    "Strategy: read eval.py → edit it → run_match to test → finish when satisfied."
 )
@@ -44,9 +106,14 @@ def format_observation_as_prompt(obs, system_prompt: str = SYSTEM_PROMPT) -> str
         f"Position index: {obs.start_position}\n"
         f"Steps remaining: {obs.remaining_steps}\n"
         f"Last match score: {obs.last_match_score}\n"
         f"History: {obs.history}\n\n"
         f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
-        "Choose your next action (JSON only)."
     )
     return user_msg
@@ -59,24 +126,380 @@ def format_messages(obs) -> list[dict[str, str]]:
     ]
 def parse_llm_output(text: str) -> Zero960Action:
     """Best-effort parse of LLM output into a Zero960Action."""
-    # Try to find JSON with nested braces (for write_file with content)
-    json_match = re.search(r'\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\}', text, re.DOTALL)
-    if json_match:
-        try:
-            data = json.loads(json_match.group())
-            return Zero960Action(**data)
-        except (json.JSONDecodeError, ValueError):
-            pass
-    # Simpler JSON match
-    json_match = re.search(r'\{[^}]+\}', text, re.DOTALL)
-    if json_match:
         try:
-            data = json.loads(json_match.group())
             return Zero960Action(**data)
         except (json.JSONDecodeError, ValueError):
-            pass
     return Zero960Action(action_type="finish")
@@ -92,18 +515,22 @@ class RolloutSummary:
 def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
-    """Quick demo: connect, read eval, run match, finish."""
     with Zero960Client(base_url=base_url) as client:
         result = client.reset()
         obs = result.observation
-        result = client.step(Zero960Action(action_type="read_file", path="eval.py"))
         obs = result.observation
-        if obs.remaining_steps > 1:
-            result = client.step(Zero960Action(action_type="run_static_eval"))
-            obs = result.observation
         if obs.remaining_steps > 1:
             result = client.step(Zero960Action(action_type="run_match"))
             obs = result.observation
@@ -125,13 +552,15 @@ def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
 def run_inference_test(
     base_url: str,
     model_name: str = "Qwen/Qwen3.5-9B",
     max_episode_steps: int = 6,
 ) -> RolloutSummary:
     """Run a single episode with Qwen generating actions against the live env."""
     from transformers import AutoModelForCausalLM, AutoTokenizer
     print(f"Loading {model_name}...")
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
     model = AutoModelForCausalLM.from_pretrained(
         model_name, torch_dtype="auto", device_map="auto",
     )
@@ -144,32 +573,28 @@ def run_inference_test(
             if result.done:
                 break
-            msgs = format_messages(obs)
-            prompt = tokenizer.apply_chat_template(
-                msgs, tokenize=False, add_generation_prompt=True,
-            )
-            inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-            outputs = model.generate(
-                **inputs, max_new_tokens=1024,
-                temperature=0.7, top_p=0.9, do_sample=True,
             )
-            generated = tokenizer.decode(
-                outputs[0][inputs["input_ids"].shape[1]:],
-                skip_special_tokens=True,
             )
-            print(f"\n--- Step {step_i + 1} ---")
-            print(f"LLM output: {generated[:300]}...")
-            action = parse_llm_output(generated)
             print(f"Parsed action: {action.action_type}", end="")
             if action.path:
                 print(f" path={action.path}", end="")
             print()
             result = client.step(action)
             obs = result.observation
             print(f"Status: {obs.status_message}")
         if not result.done:
             result = client.step(Zero960Action(action_type="finish"))
@@ -217,10 +642,8 @@ def run_grpo_training(
         prompts = []
         for _ in range(n):
             result = env.reset()
-            msgs = format_messages(result.observation)
-            prompt_text = tokenizer.apply_chat_template(
-                msgs, tokenize=False, add_generation_prompt=True,
-            )
             prompts.append(prompt_text)
         return Dataset.from_dict({"prompt": prompts})
@@ -250,36 +673,43 @@ def run_grpo_training(
                 "gen": i,
                 "completion_preview": completion[:500],
                 "completion_len": len(completion),
             }
             try:
                 result = env.reset()
                 obs = result.observation
-                # First action from the model's completion
-                action = parse_llm_output(completion)
                 entry["parsed_action"] = action.action_type
                 entry["parsed_path"] = action.path
                 if action.action_type == "write_file" and action.content:
                     entry["code_preview"] = action.content[:300]
                     entry["code_len"] = len(action.content)
                 result = env.step(action)
                 obs = result.observation
                 entry["env_status_1"] = obs.status_message
                 # If the model wrote code, run a match to get a real score
                 if not result.done and action.action_type == "write_file":
                     result = env.step(Zero960Action(action_type="run_match"))
                     obs = result.observation
                     entry["match_score"] = obs.last_match_score
                 # Finish to get terminal reward
                 if not result.done:
                     result = env.step(Zero960Action(action_type="finish"))
-                reward = float(result.reward or 0.0)
                 rewards.append(reward)
                 entry["reward"] = reward
             except Exception as exc:
                 rewards.append(0.0)
                 entry["reward"] = 0.0
@@ -338,10 +768,9 @@ def run_grpo_training(
         learning_rate=5e-6,
         logging_steps=1,
         num_generations=num_generations,
-        max_completion_length=512,
         bf16=True,
-        gradient_checkpointing=True,
-        gradient_checkpointing_kwargs={"use_reentrant": False},
         report_to="none",
     )
@@ -395,6 +824,11 @@ def main() -> None:
     )
     parser.add_argument("--base-url", default="http://127.0.0.1:8000")
     parser.add_argument("--model", default="Qwen/Qwen3.5-9B")
     parser.add_argument("--steps", type=int, default=20)
     parser.add_argument("--num-generations", type=int, default=4)
     parser.add_argument("--max-turns", type=int, default=6)
@@ -411,6 +845,7 @@ def main() -> None:
         summary = run_inference_test(
             base_url=args.base_url,
             model_name=args.model,
         )
         print({
             "reward": summary.reward,

 from __future__ import annotations
 import argparse
+import ast
 import json
 import os
 import re
 from zero960_env.client import Zero960Client
 from zero960_env.models import Zero960Action
+EXAMPLE_WRITE_ACTION = json.dumps(
+    {
+        "action_type": "write_file",
+        "path": "eval.py",
+        "content": (
+            "from __future__ import annotations\n\n"
+            "import chess\n\n"
+            "PIECE_VALUES = {\n"
+            "    chess.PAWN: 100,\n"
+            "    chess.KNIGHT: 320,\n"
+            "    chess.BISHOP: 330,\n"
+            "    chess.ROOK: 500,\n"
+            "    chess.QUEEN: 900,\n"
+            "    chess.KING: 0,\n"
+            "}\n\n"
+            "def evaluate(board: chess.Board) -> int:\n"
+            "    score = 0\n"
+            "    for piece_type, piece_value in PIECE_VALUES.items():\n"
+            "        score += piece_value * len(board.pieces(piece_type, chess.WHITE))\n"
+            "        score -= piece_value * len(board.pieces(piece_type, chess.BLACK))\n"
+            "    return score\n"
+        ),
+    }
+)
+ACTION_SCHEMA_TEXT = (
+    "Return exactly one JSON object matching one of these shapes:\n"
+    '1. {"action_type":"write_file","path":"eval.py","content":"<full eval.py source>"}\n'
+    '2. {"action_type":"run_match"}\n'
+    '3. {"action_type":"finish"}\n'
+    '4. {"action_type":"run_static_eval"}\n'
+    '5. {"action_type":"read_file","path":"eval.py"}'
+)
+ACTION_CHOICE_MAP = {
+    "1": "write_file",
+    "2": "run_match",
+    "3": "finish",
+    "4": "run_static_eval",
+    "5": "read_file",
+}
+TRAIN_ACTION_REWARD_BIAS = {
+    "write_file": 0.35,
+    "run_match": -0.15,
+    "finish": -0.30,
+    "run_static_eval": -0.25,
+    "read_file": -0.30,
+}
 SYSTEM_PROMPT = (
     "You are a Chess960 evaluation engineer. You can take ONE action per turn.\n"
+    "Respond with exactly one JSON object and no extra text.\n"
+    f"{ACTION_SCHEMA_TEXT}\n"
+    "Actions:\n"
     '  {"action_type":"read_file","path":"eval.py"}\n'
+    '  {"action_type":"write_file","path":"eval.py","content":"<full replacement eval.py>"}\n'
     '  {"action_type":"run_static_eval"}\n'
     '  {"action_type":"run_match"}\n'
     '  {"action_type":"finish"}\n'
     "\n"
+    "Important rules:\n"
+    "- The full current eval.py is already included in the observation, so read_file is usually unnecessary.\n"
+    "- High-reward loop: write_file a valid full replacement, run_match, then finish.\n"
+    "- Repeating run_static_eval, finishing before a write, or finishing before an explicit match is penalized.\n"
+    "- If you write code, keep it short and valid Python that defines evaluate(board).\n"
+    "- Do not output analysis, markdown, XML tags, or prose. Do not emit <think> blocks.\n"
+    "\n"
+    "Examples:\n"
+    f"Fresh episode best first move:\n{EXAMPLE_WRITE_ACTION}\n"
+    'After a valid write, best next move:\n{"action_type":"run_match"}\n'
+    'After a match score is available, best next move:\n{"action_type":"finish"}'
 )
         f"Position index: {obs.start_position}\n"
         f"Steps remaining: {obs.remaining_steps}\n"
         f"Last match score: {obs.last_match_score}\n"
+        f"Has valid edit: {obs.has_valid_edit}\n"
+        f"Has explicit match: {obs.has_run_match}\n"
+        f"Suggested actions: {', '.join(obs.suggested_actions)}\n"
+        f"Workflow hint: {obs.workflow_hint}\n"
         f"History: {obs.history}\n\n"
         f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
+        f"{ACTION_SCHEMA_TEXT}\n"
+        "Choose your next action. Output JSON only."
     )
     return user_msg
     ]
+def format_action_selection_messages(obs) -> list[dict[str, str]]:
+    """Ask the model to choose only the next action type ID."""
+    eval_code = obs.file_contents.get("eval.py", "<not read yet>")
+    return [
+        {
+            "role": "system",
+            "content": (
+                "Choose the next action for a Chess960 eval-editing task.\n"
+                "Return exactly one digit and nothing else.\n"
+                "1 = write_file\n"
+                "2 = run_match\n"
+                "3 = finish\n"
+                "4 = run_static_eval\n"
+                "5 = read_file"
+            ),
+        },
+        {
+            "role": "user",
+            "content": (
+                f"Steps remaining: {obs.remaining_steps}\n"
+                f"Last match score: {obs.last_match_score}\n"
+                f"Has valid edit: {obs.has_valid_edit}\n"
+                f"Has explicit match: {obs.has_run_match}\n"
+                f"Suggested actions: {', '.join(obs.suggested_actions)}\n"
+                f"Workflow hint: {obs.workflow_hint}\n"
+                f"History: {obs.history}\n\n"
+                f"Current eval.py:\n```python\n{eval_code}\n```\n\n"
+                "Choose the best next action ID. Return exactly one digit."
+            ),
+        },
+    ]
+def format_write_messages(obs) -> list[dict[str, str]]:
+    """Ask the model to output a full replacement eval.py file only."""
+    eval_code = obs.file_contents.get("eval.py", "<not read yet>")
+    write_prefix = build_write_prefix(eval_code)
+    return [
+        {
+            "role": "system",
+            "content": (
+                "Continue a Python file for a Chess960 engine.\n"
+                "The assistant response is appended directly after a provided prefix.\n"
+                "Output only the remaining Python lines after the prefix.\n"
+                "Do not repeat the prefix. No markdown, no prose, no JSON, no <think>."
+            ),
+        },
+        {
+            "role": "user",
+            "content": (
+                f"Steps remaining: {obs.remaining_steps}\n"
+                f"Last match score: {obs.last_match_score}\n"
+                f"Workflow hint: {obs.workflow_hint}\n"
+                "Improve the evaluation function while keeping valid Python that defines evaluate(board).\n"
+                "You are completing the file after this exact prefix:\n\n"
+                f"```python\n{write_prefix}```"
+            ),
+        },
+    ]
+def apply_action_chat_template(tokenizer, messages: list[dict[str, str]]) -> str:
+    """Apply Qwen chat template while disabling thinking when the template supports it."""
+    template_attempts = [
+        {"chat_template_kwargs": {"enable_thinking": False}},
+        {"enable_thinking": False},
+        {},
+    ]
+    for extra_kwargs in template_attempts:
+        try:
+            return tokenizer.apply_chat_template(
+                messages,
+                tokenize=False,
+                add_generation_prompt=True,
+                **extra_kwargs,
+            )
+        except TypeError:
+            continue
+    return tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+def strip_reasoning(text: str) -> str:
+    """Remove common reasoning wrappers before JSON parsing."""
+    cleaned = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL | re.IGNORECASE)
+    cleaned = re.sub(r"<\|im_start\|>assistant\s*", "", cleaned)
+    cleaned = re.sub(r"<\|im_end\|>", "", cleaned)
+    return cleaned.strip()
+def extract_python_source(text: str) -> str:
+    """Extract raw Python source from model output."""
+    cleaned = strip_reasoning(text)
+    fenced = re.findall(r"```(?:python)?\s*(.*?)\s*```", cleaned, re.DOTALL)
+    if fenced:
+        return fenced[0].strip()
+    return cleaned.strip()
+def build_write_prefix(current_code: str) -> str:
+    """Build a stable file prefix that the model must continue."""
+    marker = "def evaluate(board: chess.Board) -> int:\n"
+    match = re.search(re.escape(marker), current_code)
+    if match:
+        return current_code[:match.end()] + "    score = 0\n"
+    return (
+        "from __future__ import annotations\n\n"
+        "import chess\n\n"
+        "PIECE_VALUES = {\n"
+        "    chess.PAWN: 100,\n"
+        "    chess.KNIGHT: 320,\n"
+        "    chess.BISHOP: 330,\n"
+        "    chess.ROOK: 500,\n"
+        "    chess.QUEEN: 900,\n"
+        "    chess.KING: 0,\n"
+        "}\n\n"
+        "def evaluate(board: chess.Board) -> int:\n"
+        "    score = 0\n"
+    )
+def extract_python_continuation(text: str) -> str:
+    """Extract only indented Python lines for the evaluate() body continuation."""
+    cleaned = extract_python_source(text)
+    lines = cleaned.splitlines()
+    kept: list[str] = []
+    started = False
+    for line in lines:
+        if not started:
+            if not line.strip():
+                continue
+            if line.startswith("    ") or line.startswith("\t"):
+                started = True
+                kept.append(line)
+                continue
+            if re.match(r"(for|if|elif|else|while|return|score|white_|black_|center_|mobility_|piece_|pawn_|king_|board)", line.strip()):
+                started = True
+                kept.append(f"    {line.strip()}")
+                continue
+            continue
+        if line.strip() and not (line.startswith("    ") or line.startswith("\t")):
+            break
+        kept.append(line)
+    return "\n".join(kept).rstrip() + "\n" if kept else ""
+def fallback_eval_tail(current_code: str) -> str:
+    """Reuse the existing evaluate() body tail as a safe syntax fallback."""
+    marker = "    score = 0\n"
+    index = current_code.find(marker)
+    if index == -1:
+        return "    return score\n"
+    return current_code[index + len(marker):].rstrip() + "\n"
+def choose_action_id(model, tokenizer, prompt: str) -> tuple[str, dict[str, float]]:
+    """Score a fixed set of action IDs and return the most likely one."""
+    import torch
+    prompt_inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    prompt_input_ids = prompt_inputs["input_ids"]
+    prompt_attention_mask = prompt_inputs["attention_mask"]
+    option_scores: dict[str, float] = {}
+    with torch.no_grad():
+        for option_id in ACTION_CHOICE_MAP:
+            option_tokens = tokenizer(option_id, add_special_tokens=False, return_tensors="pt")
+            option_input_ids = option_tokens["input_ids"].to(model.device)
+            option_attention_mask = option_tokens["attention_mask"].to(model.device)
+            full_input_ids = torch.cat([prompt_input_ids, option_input_ids], dim=1)
+            full_attention_mask = torch.cat([prompt_attention_mask, option_attention_mask], dim=1)
+            outputs = model(input_ids=full_input_ids, attention_mask=full_attention_mask)
+            log_probs = outputs.logits[:, prompt_input_ids.shape[1] - 1:-1, :].log_softmax(dim=-1)
+            token_log_prob = 0.0
+            for index in range(option_input_ids.shape[1]):
+                token_id = option_input_ids[0, index]
+                token_log_prob += float(log_probs[0, index, token_id])
+            option_scores[option_id] = token_log_prob / max(option_input_ids.shape[1], 1)
+    best_option = max(option_scores, key=option_scores.get)
+    return best_option, option_scores
+def parse_action_choice(text: str) -> str:
+    """Parse a one-token action ID from a completion."""
+    cleaned = strip_reasoning(text)
+    digit_match = re.search(r"\b([1-5])\b", cleaned)
+    if digit_match:
+        return digit_match.group(1)
+    lowered = cleaned.lower()
+    for action_id, action_type in ACTION_CHOICE_MAP.items():
+        if action_type in lowered:
+            return action_id
+    return "3"
+def build_training_write_code(current_code: str, variant_index: int = 0) -> str:
+    """Apply a deterministic valid edit so GRPO can learn the task loop first."""
+    candidates = [
+        (
+            "CENTER_ATTACK_BONUS = 3",
+            "CENTER_ATTACK_BONUS = 4",
+        ),
+        (
+            "BISHOP_PAIR_BONUS = 35",
+            "BISHOP_PAIR_BONUS = 45",
+        ),
+        (
+            "ROOK_OPEN_FILE_BONUS = 20",
+            "ROOK_OPEN_FILE_BONUS = 24",
+        ),
+        (
+            "PASSED_PAWN_BONUS_BY_RANK = [0, 5, 10, 18, 28, 42, 60, 0]",
+            "PASSED_PAWN_BONUS_BY_RANK = [0, 6, 12, 20, 32, 48, 68, 0]",
+        ),
+    ]
+    for offset in range(len(candidates)):
+        source, target = candidates[(variant_index + offset) % len(candidates)]
+        if source in current_code:
+            candidate_code = current_code.replace(source, target, 1)
+            try:
+                ast.parse(candidate_code, filename="eval.py")
+            except SyntaxError:
+                continue
+            if candidate_code != current_code:
+                return candidate_code
+    return current_code
+def build_training_action(choice_id: str, obs, variant_index: int = 0) -> Zero960Action:
+    """Convert an action-choice completion into a concrete env action."""
+    action_type = ACTION_CHOICE_MAP.get(choice_id, "finish")
+    if action_type == "write_file":
+        current_code = obs.file_contents.get("eval.py", "")
+        content = build_training_write_code(current_code, variant_index=variant_index)
+        return Zero960Action(action_type="write_file", path="eval.py", content=content)
+    if action_type == "read_file":
+        return Zero960Action(action_type="read_file", path="eval.py")
+    return Zero960Action(action_type=action_type)
+def generate_write_action(model, tokenizer, obs) -> tuple[Zero960Action, str]:
+    """Generate the full eval.py replacement after action type selection."""
+    current_code = obs.file_contents.get("eval.py", "")
+    write_prefix = build_write_prefix(current_code)
+    write_prompt = apply_action_chat_template(tokenizer, format_write_messages(obs)) + write_prefix
+    inputs = tokenizer(write_prompt, return_tensors="pt").to(model.device)
+    outputs = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        do_sample=False,
+    )
+    generated = tokenizer.decode(
+        outputs[0][inputs["input_ids"].shape[1]:],
+        skip_special_tokens=True,
+    )
+    continuation = extract_python_continuation(generated)
+    if not continuation:
+        continuation = fallback_eval_tail(current_code)
+    code = write_prefix + continuation
+    try:
+        ast.parse(code, filename="eval.py")
+    except SyntaxError:
+        code = write_prefix + fallback_eval_tail(current_code)
+    return Zero960Action(action_type="write_file", path="eval.py", content=code), generated
+def choose_structured_action(
+    model,
+    tokenizer,
+    obs,
+    deterministic_write: bool = False,
+) -> tuple[Zero960Action, dict[str, float], str | None]:
+    """Choose action type via fixed-option scoring, then generate code only if needed."""
+    action_prompt = apply_action_chat_template(tokenizer, format_action_selection_messages(obs))
+    action_id, scores = choose_action_id(model, tokenizer, action_prompt)
+    adjusted_scores = dict(scores)
+    # Make the policy respect the environment workflow instead of repeatedly editing.
+    if obs.has_valid_edit and not obs.has_run_match:
+        adjusted_scores["2"] += 3.0
+        adjusted_scores["1"] -= 2.0
+        adjusted_scores["5"] -= 1.5
+        adjusted_scores["4"] -= 1.5
+    elif obs.has_run_match:
+        if obs.last_match_score is not None and (obs.last_match_score >= 0.25 or obs.remaining_steps <= 2):
+            adjusted_scores["3"] += 2.5
+            adjusted_scores["1"] -= 1.0
+            adjusted_scores["5"] -= 1.0
+            adjusted_scores["4"] -= 1.0
+        else:
+            adjusted_scores["1"] += 1.0
+            adjusted_scores["3"] += 0.5
+    action_id = max(adjusted_scores, key=adjusted_scores.get)
+    action_type = ACTION_CHOICE_MAP[action_id]
+    if action_type == "write_file":
+        if deterministic_write:
+            action = build_training_action("1", obs, variant_index=max(obs.remaining_steps, 0))
+            return action, adjusted_scores, "[deterministic write template]"
+        action, raw_code_output = generate_write_action(model, tokenizer, obs)
+        return action, adjusted_scores, raw_code_output
+    if action_type == "read_file":
+        return Zero960Action(action_type="read_file", path="eval.py"), adjusted_scores, None
+    return Zero960Action(action_type=action_type), adjusted_scores, None
+def _extract_balanced_json_objects(text: str) -> list[str]:
+    """Return brace-balanced JSON object candidates from free-form model output."""
+    candidates: list[str] = []
+    start: int | None = None
+    depth = 0
+    in_string = False
+    escape = False
+    for index, char in enumerate(text):
+        if start is None:
+            if char == "{":
+                start = index
+                depth = 1
+                in_string = False
+                escape = False
+            continue
+        if in_string:
+            if escape:
+                escape = False
+            elif char == "\\":
+                escape = True
+            elif char == '"':
+                in_string = False
+            continue
+        if char == '"':
+            in_string = True
+        elif char == "{":
+            depth += 1
+        elif char == "}":
+            depth -= 1
+            if depth == 0:
+                candidates.append(text[start:index + 1])
+                start = None
+    return candidates
 def parse_llm_output(text: str) -> Zero960Action:
     """Best-effort parse of LLM output into a Zero960Action."""
+    cleaned = strip_reasoning(text)
+    fenced_match = re.findall(r"```(?:json)?\s*(\{.*?\})\s*```", cleaned, re.DOTALL)
+    for candidate in fenced_match + _extract_balanced_json_objects(cleaned):
         try:
+            data = json.loads(candidate)
             return Zero960Action(**data)
         except (json.JSONDecodeError, ValueError):
+            continue
+    action_match = re.search(r'"action_type"\s*:\s*"([^"]+)"', cleaned)
+    if action_match:
+        action_type = action_match.group(1)
+        if action_type in {"run_static_eval", "run_match", "finish"}:
+            return Zero960Action(action_type=action_type)
+    lowered = cleaned.lower()
+    if "run_match" in lowered:
+        return Zero960Action(action_type="run_match")
+    if "write_file" in lowered and "eval.py" in lowered:
+        return Zero960Action(action_type="read_file", path="eval.py")
     return Zero960Action(action_type="finish")
 def run_handcrafted_rollout(base_url: str) -> RolloutSummary:
+    """Quick demo: apply a tiny valid edit, run a match, then finish."""
     with Zero960Client(base_url=base_url) as client:
         result = client.reset()
         obs = result.observation
+        current_code = obs.file_contents["eval.py"]
+        edited_code = current_code.replace("score += 15 *", "score += 20 *", 1)
+        result = client.step(
+            Zero960Action(
+                action_type="write_file",
+                path="eval.py",
+                content=edited_code,
+            )
+        )
         obs = result.observation
         if obs.remaining_steps > 1:
             result = client.step(Zero960Action(action_type="run_match"))
             obs = result.observation
 def run_inference_test(
     base_url: str,
     model_name: str = "Qwen/Qwen3.5-9B",
+    tokenizer_name: str | None = None,
     max_episode_steps: int = 6,
+    deterministic_write: bool = True,
 ) -> RolloutSummary:
     """Run a single episode with Qwen generating actions against the live env."""
     from transformers import AutoModelForCausalLM, AutoTokenizer
     print(f"Loading {model_name}...")
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_name or model_name)
     model = AutoModelForCausalLM.from_pretrained(
         model_name, torch_dtype="auto", device_map="auto",
     )
             if result.done:
                 break
+            print(f"\n--- Step {step_i + 1} ---")
+            action, action_scores, raw_code_output = choose_structured_action(
+                model,
+                tokenizer,
+                obs,
+                deterministic_write=deterministic_write,
             )
+            score_text = ", ".join(
+                f"{choice}:{score:.3f}" for choice, score in sorted(action_scores.items())
             )
+            print(f"Action scores: {score_text}")
             print(f"Parsed action: {action.action_type}", end="")
             if action.path:
                 print(f" path={action.path}", end="")
             print()
+            if raw_code_output is not None:
+                print(f"Write output: {raw_code_output[:300]}...")
             result = client.step(action)
             obs = result.observation
             print(f"Status: {obs.status_message}")
+            print(f"Step reward: {result.reward}")
         if not result.done:
             result = client.step(Zero960Action(action_type="finish"))
         prompts = []
         for _ in range(n):
             result = env.reset()
+            msgs = format_action_selection_messages(result.observation)
+            prompt_text = apply_action_chat_template(tokenizer, msgs)
             prompts.append(prompt_text)
         return Dataset.from_dict({"prompt": prompts})
                 "gen": i,
                 "completion_preview": completion[:500],
                 "completion_len": len(completion),
+                "step_rewards": [],
             }
             try:
                 result = env.reset()
                 obs = result.observation
+                choice_id = parse_action_choice(completion)
+                action = build_training_action(choice_id, obs, variant_index=step_n + i)
+                entry["choice_id"] = choice_id
                 entry["parsed_action"] = action.action_type
                 entry["parsed_path"] = action.path
                 if action.action_type == "write_file" and action.content:
                     entry["code_preview"] = action.content[:300]
                     entry["code_len"] = len(action.content)
+                    entry["code_changed"] = action.content != obs.file_contents.get("eval.py", "")
                 result = env.step(action)
                 obs = result.observation
                 entry["env_status_1"] = obs.status_message
+                entry["step_rewards"].append(float(result.reward or 0.0))
                 # If the model wrote code, run a match to get a real score
                 if not result.done and action.action_type == "write_file":
                     result = env.step(Zero960Action(action_type="run_match"))
                     obs = result.observation
                     entry["match_score"] = obs.last_match_score
+                    entry["step_rewards"].append(float(result.reward or 0.0))
                 # Finish to get terminal reward
                 if not result.done:
                     result = env.step(Zero960Action(action_type="finish"))
+                    entry["step_rewards"].append(float(result.reward or 0.0))
+                reward = float(result.reward or 0.0) + TRAIN_ACTION_REWARD_BIAS[action.action_type]
                 rewards.append(reward)
                 entry["reward"] = reward
+                entry["reward_bias"] = TRAIN_ACTION_REWARD_BIAS[action.action_type]
             except Exception as exc:
                 rewards.append(0.0)
                 entry["reward"] = 0.0
         learning_rate=5e-6,
         logging_steps=1,
         num_generations=num_generations,
+        max_completion_length=4,
         bf16=True,
+        gradient_checkpointing=False,
         report_to="none",
     )
     )
     parser.add_argument("--base-url", default="http://127.0.0.1:8000")
     parser.add_argument("--model", default="Qwen/Qwen3.5-9B")
+    parser.add_argument(
+        "--tokenizer",
+        default=None,
+        help="Optional tokenizer path/name for infer mode when loading a checkpoint without tokenizer files.",
+    )
     parser.add_argument("--steps", type=int, default=20)
     parser.add_argument("--num-generations", type=int, default=4)
     parser.add_argument("--max-turns", type=int, default=6)
         summary = run_inference_test(
             base_url=args.base_url,
             model_name=args.model,
+            tokenizer_name=args.tokenizer,
         )
         print({
             "reward": summary.reward,

train/sft_student.py ADDED Viewed

	@@ -0,0 +1,243 @@

+"""Supervised fine-tuning for a bounded-action 0x960 student policy."""
+from __future__ import annotations
+import argparse
+import glob
+import json
+import random
+from collections import Counter
+from pathlib import Path
+from datasets import Dataset
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from trl import SFTConfig, SFTTrainer
+ALLOWED_ACTION_KEYS = {"action_type", "path", "content"}
+def _resolve_input_paths(explicit_paths: list[str], data_glob: str) -> list[Path]:
+    paths = [Path(path) for path in explicit_paths]
+    paths.extend(Path(path) for path in glob.glob(data_glob))
+    unique_paths = sorted({path.resolve() for path in paths if path.exists()})
+    if not unique_paths:
+        raise FileNotFoundError(
+            "no SFT data files found; pass --data-path or adjust --data-glob"
+        )
+    return unique_paths
+def _validate_record(payload: dict, source_path: Path, line_number: int) -> dict | None:
+    messages = payload.get("messages")
+    metadata = payload.get("metadata", {})
+    if not isinstance(messages, list) or len(messages) != 3:
+        return None
+    roles = [message.get("role") for message in messages if isinstance(message, dict)]
+    if roles != ["system", "user", "assistant"]:
+        return None
+    assistant_content = messages[-1].get("content")
+    if not isinstance(assistant_content, str):
+        return None
+    try:
+        action_payload = json.loads(assistant_content)
+    except json.JSONDecodeError:
+        return None
+    if not isinstance(action_payload, dict):
+        return None
+    if set(action_payload) != ALLOWED_ACTION_KEYS:
+        return None
+    if action_payload["action_type"] not in {
+        "read_file",
+        "write_file",
+        "run_static_eval",
+        "run_match",
+        "finish",
+    }:
+        return None
+    final_reward = metadata.get("final_reward")
+    if final_reward is not None:
+        final_reward = float(final_reward)
+    return {
+        "messages": messages,
+        "metadata": {
+            "source_path": str(source_path),
+            "line_number": line_number,
+            "episode_index": metadata.get("episode_index"),
+            "turn_index": metadata.get("turn_index"),
+            "teacher_model": metadata.get("teacher_model"),
+            "final_reward": final_reward,
+        },
+        "action_type": action_payload["action_type"],
+    }
+def load_sft_records(
+    input_paths: list[Path],
+    min_final_reward: float,
+    max_examples: int | None,
+    seed: int,
+) -> tuple[list[dict], dict]:
+    records: list[dict] = []
+    skipped_invalid = 0
+    skipped_low_reward = 0
+    dedupe_keys: set[str] = set()
+    for input_path in input_paths:
+        for line_number, line in enumerate(input_path.read_text().splitlines(), start=1):
+            if not line.strip():
+                continue
+            payload = json.loads(line)
+            record = _validate_record(payload, input_path, line_number)
+            if record is None:
+                skipped_invalid += 1
+                continue
+            final_reward = record["metadata"]["final_reward"]
+            if final_reward is not None and final_reward < min_final_reward:
+                skipped_low_reward += 1
+                continue
+            dedupe_key = json.dumps(record["messages"], sort_keys=True)
+            if dedupe_key in dedupe_keys:
+                continue
+            dedupe_keys.add(dedupe_key)
+            records.append(record)
+    random.Random(seed).shuffle(records)
+    if max_examples is not None:
+        records = records[:max_examples]
+    stats = {
+        "input_files": [str(path) for path in input_paths],
+        "records_kept": len(records),
+        "skipped_invalid": skipped_invalid,
+        "skipped_low_reward": skipped_low_reward,
+        "action_counts": dict(Counter(record["action_type"] for record in records)),
+    }
+    return records, stats
+def split_records(records: list[dict], eval_fraction: float) -> tuple[list[dict], list[dict]]:
+    if not records:
+        return [], []
+    if eval_fraction <= 0 or len(records) < 10:
+        return records, []
+    eval_size = max(1, int(len(records) * eval_fraction))
+    if eval_size >= len(records):
+        eval_size = len(records) - 1
+    return records[eval_size:], records[:eval_size]
+def build_dataset(records: list[dict]) -> Dataset:
+    return Dataset.from_list(
+        [
+            {
+                "messages": record["messages"],
+                "metadata": record["metadata"],
+            }
+            for record in records
+        ]
+    )
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="SFT a bounded-action 0x960 student model.")
+    parser.add_argument("--model", default="Qwen/Qwen3.5-0.8B")
+    parser.add_argument("--data-path", action="append", default=[])
+    parser.add_argument("--data-glob", default="outputs/codex_distill/sft_samples_*.jsonl")
+    parser.add_argument("--output-dir", default="outputs/sft_student")
+    parser.add_argument("--min-final-reward", type=float, default=0.4)
+    parser.add_argument("--max-examples", type=int, default=None)
+    parser.add_argument("--eval-fraction", type=float, default=0.1)
+    parser.add_argument("--seed", type=int, default=42)
+    parser.add_argument("--per-device-train-batch-size", type=int, default=1)
+    parser.add_argument("--per-device-eval-batch-size", type=int, default=1)
+    parser.add_argument("--gradient-accumulation-steps", type=int, default=8)
+    parser.add_argument("--learning-rate", type=float, default=2e-5)
+    parser.add_argument("--num-train-epochs", type=float, default=3.0)
+    parser.add_argument("--max-steps", type=int, default=-1)
+    parser.add_argument("--logging-steps", type=int, default=5)
+    parser.add_argument("--save-total-limit", type=int, default=2)
+    parser.add_argument("--max-length", type=int, default=1024)
+    parser.add_argument("--assistant-only-loss", action="store_true")
+    parser.add_argument("--dry-run", action="store_true")
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    input_paths = _resolve_input_paths(args.data_path, args.data_glob)
+    records, stats = load_sft_records(
+        input_paths=input_paths,
+        min_final_reward=args.min_final_reward,
+        max_examples=args.max_examples,
+        seed=args.seed,
+    )
+    if not records:
+        raise RuntimeError("no usable SFT rows found after validation and filtering")
+    train_records, eval_records = split_records(records, args.eval_fraction)
+    stats["train_records"] = len(train_records)
+    stats["eval_records"] = len(eval_records)
+    print(stats)
+    if args.dry_run:
+        return
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    tokenizer = AutoTokenizer.from_pretrained(args.model)
+    if tokenizer.pad_token is None:
+        tokenizer.pad_token = tokenizer.eos_token
+    use_cuda = torch.cuda.is_available()
+    use_bf16 = use_cuda and torch.cuda.is_bf16_supported()
+    model_kwargs = {"torch_dtype": torch.bfloat16} if use_bf16 else {}
+    if tokenizer.padding_side != "right":
+        tokenizer.padding_side = "right"
+    train_dataset = build_dataset(train_records)
+    eval_dataset = build_dataset(eval_records) if eval_records else None
+    trainer = SFTTrainer(
+        model=AutoModelForCausalLM.from_pretrained(args.model, **model_kwargs),
+        args=SFTConfig(
+            output_dir=str(output_dir),
+            per_device_train_batch_size=args.per_device_train_batch_size,
+            per_device_eval_batch_size=args.per_device_eval_batch_size,
+            gradient_accumulation_steps=args.gradient_accumulation_steps,
+            learning_rate=args.learning_rate,
+            num_train_epochs=args.num_train_epochs,
+            max_steps=args.max_steps,
+            logging_steps=args.logging_steps,
+            save_strategy="epoch",
+            eval_strategy="epoch" if eval_dataset is not None else "no",
+            save_total_limit=args.save_total_limit,
+            report_to="none",
+            bf16=use_bf16,
+            gradient_checkpointing=use_cuda,
+            assistant_only_loss=args.assistant_only_loss,
+            max_length=args.max_length,
+            remove_unused_columns=False,
+            dataset_num_proc=1,
+            seed=args.seed,
+        ),
+        train_dataset=train_dataset,
+        eval_dataset=eval_dataset,
+        processing_class=tokenizer,
+    )
+    trainer.train()
+    trainer.save_model(str(output_dir / "final"))
+    tokenizer.save_pretrained(str(output_dir / "final"))
+if __name__ == "__main__":
+    main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff